-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(healthcontroller): generate only one add/remove operation per cycle #636
Conversation
4a115e6
to
0ae20f9
Compare
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #636 +/- ##
==========================================
- Coverage 64.30% 64.15% -0.15%
==========================================
Files 39 39
Lines 2905 2960 +55
==========================================
+ Hits 1868 1899 +31
- Misses 909 929 +20
- Partials 128 132 +4 ☔ View full report in Codecov by Sentry. |
979af9e
to
4666b6c
Compare
The logic to get the desired number of rooms should not depend on the logic of ensuring this desired amount. Ideally, the desired logic would stay in a separate worker and a separate operation that we are going to call SchedulerController. Then, HealthController will only be responsible for enqueueing the necessary operations to get the current state equal to the desired. Thus, rolling updates should NOT add any operation to the queue
4666b6c
to
21fa95a
Compare
After sorting by status, sort based on scheduler's active version moving the current available rooms (Pending, Ready and Occupied) to the bottom of the list. Now, we'll delete first Pending, Ready and Occupied from a non-active scheduler version to ensure version updates.
Create a matrix of test cases with table tests, validated in spreadsheets and with the desired behavior that each health_controller operation should take during a rolling update. There are 4 base cases: stable occupancy, occupancy increased, occupancy decreased, stable occupancy but only 75% of new rooms becoming active.
If the operation did not took action, do not append to history. This will avoid bloat of operation history when listing, simplifying debug.
9789b07
to
785a118
Compare
c19b25a
to
3afe568
Compare
When sending Kubernetes events, let the kubernetes clientSet infere what is the namespace of the object (Pod) instead of having one recorder per namespace.
3afe568
to
1788e34
Compare
49de795
to
2913a14
Compare
Currently, the health controller is performing both rolling updates and autoscaling. We can roughly divide these responsibilities among these functions executed by the operation:
In a lot of cases, one cycle will generate conflicting operations: autoscale will try to remove X amount of rooms but rolling update only wants Y amount of rooms deleted. Moreover, autoscale deletion is based on its policy (currently only occupancy) but rolling update wants to delete based on the version of the scheduler. Thus, to streamline and simplify this process the refactor aims to split the responsibility and let only one piece of logic handle the creation or removal of rooms.
The work of this PR can be summarized in:
1. Split the logic, inside
healthcontroller/execute.go
So that we have one logical abstraction to compute the desired number of rooms based on autoscaling and rolling update (if ongoing), and another logical abstraction responsible for fixing the drift between current and desired by enqueuing add or remove operations
All logic that is responsible for computing the desired amount can be refactored to a different operation, outside of execution worker, later on. All those methods are not receivers of the
Execute
and the dependency is injected by their arguments in the functions:2. Refactor the Remove Rooms operation to sort by version when deleting a number of game rooms
Currently, the system will sort by game room status, however, we want to take the sorted rooms and apply a sort by version on Pending, Ready, and Occupied rooms.
Given this set of rooms with their status + versions: [Error1v1, Error2v2, Pending1v1, Pending2v2, Ready1v1, Ready2v2, Occupied1v1, Occupied2v2]
Currently, the sort by priority/status will return the above list, however, we want to ensure old scheduler versions (v1) are deleted first (Kubernetes deletes pods based on creation time, so it will be similar). Thus, the new sorting version will become:
If there's only one active version of the scheduler in the system, then we skip the additional sort.
3. Simplified calculations of desired when autoscaling + rolling updating
As a baseline, the system will always use the desired from autoscaling (or roomsReplica if no autoscale is defined). However, if we have a major version of the difference between schedulers of current rooms in the system, then we are in a rolling update and we should consider the
maxSurge
.4. Add unit tests with ALL rolling update cycles
Create a matrix of test cases with table tests, validated in spreadsheets and with the desired behavior that each health_controller operation should take during a rolling update. There are 4 base cases: stable occupancy, occupancy increased, occupancy decreased, and stable occupancy but only 75% of new rooms becoming active.
Spreadsheet reference with the desired cases: https://docs.google.com/spreadsheets/d/13grNuk3os7mOesnpROqLOVTSWiGpgvi5JUgWUB3Z428/edit?usp=sharing