Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(healthcontroller): generate only one add/remove operation per cycle #636

Merged
merged 6 commits into from
Sep 2, 2024

Conversation

hspedro
Copy link
Collaborator

@hspedro hspedro commented Aug 20, 2024

Currently, the health controller is performing both rolling updates and autoscaling. We can roughly divide these responsibilities among these functions executed by the operation:

  1. performRollingUpdate: generates a pair of add and remove operations based on rolling update constraints (maxSurge and maxUnavailable = 0)
  2. ensureDesiredAmountOfInstances: generates either an add or remove operation to move the current state equal to the desired

In a lot of cases, one cycle will generate conflicting operations: autoscale will try to remove X amount of rooms but rolling update only wants Y amount of rooms deleted. Moreover, autoscale deletion is based on its policy (currently only occupancy) but rolling update wants to delete based on the version of the scheduler. Thus, to streamline and simplify this process the refactor aims to split the responsibility and let only one piece of logic handle the creation or removal of rooms.

The work of this PR can be summarized in:

1. Split the logic, inside healthcontroller/execute.go

So that we have one logical abstraction to compute the desired number of rooms based on autoscaling and rolling update (if ongoing), and another logical abstraction responsible for fixing the drift between current and desired by enqueuing add or remove operations

All logic that is responsible for computing the desired amount can be refactored to a different operation, outside of execution worker, later on. All those methods are not receivers of the Execute and the dependency is injected by their arguments in the functions:

  • GetDesiredNumberOfRooms
  • IsRollingUpdating
  • ComputeMaxSurge

2. Refactor the Remove Rooms operation to sort by version when deleting a number of game rooms

Currently, the system will sort by game room status, however, we want to take the sorted rooms and apply a sort by version on Pending, Ready, and Occupied rooms.

Given this set of rooms with their status + versions: [Error1v1, Error2v2, Pending1v1, Pending2v2, Ready1v1, Ready2v2, Occupied1v1, Occupied2v2]

Currently, the sort by priority/status will return the above list, however, we want to ensure old scheduler versions (v1) are deleted first (Kubernetes deletes pods based on creation time, so it will be similar). Thus, the new sorting version will become:

  • Before: [Error1v1, Error2v2, Pending1v1, Pending2v2, Ready1v1, Ready2v2, Occupied1v1, Occupied2v2]
  • Now: [Error1v1, Error2v2, Pending1v1, Ready1v1, Occupied1v1, Pending2v2, Ready2v2, Occupied2v2]

If there's only one active version of the scheduler in the system, then we skip the additional sort.

3. Simplified calculations of desired when autoscaling + rolling updating

As a baseline, the system will always use the desired from autoscaling (or roomsReplica if no autoscale is defined). However, if we have a major version of the difference between schedulers of current rooms in the system, then we are in a rolling update and we should consider the maxSurge.

Maestro Rollout Strategy Design Doc

4. Add unit tests with ALL rolling update cycles

Create a matrix of test cases with table tests, validated in spreadsheets and with the desired behavior that each health_controller operation should take during a rolling update. There are 4 base cases: stable occupancy, occupancy increased, occupancy decreased, and stable occupancy but only 75% of new rooms becoming active.

Spreadsheet reference with the desired cases: https://docs.google.com/spreadsheets/d/13grNuk3os7mOesnpROqLOVTSWiGpgvi5JUgWUB3Z428/edit?usp=sharing

@hspedro hspedro self-assigned this Aug 20, 2024
@hspedro hspedro force-pushed the refactor/hc-execute branch 9 times, most recently from 4a115e6 to 0ae20f9 Compare August 21, 2024 20:05
@codecov-commenter
Copy link

codecov-commenter commented Aug 21, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 60.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 64.15%. Comparing base (bb118c8) to head (404be51).
Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
internal/adapters/runtime/kubernetes/game_room.go 60.00% 3 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #636      +/-   ##
==========================================
- Coverage   64.30%   64.15%   -0.15%     
==========================================
  Files          39       39              
  Lines        2905     2960      +55     
==========================================
+ Hits         1868     1899      +31     
- Misses        909      929      +20     
- Partials      128      132       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hspedro hspedro force-pushed the refactor/hc-execute branch 2 times, most recently from 979af9e to 4666b6c Compare August 22, 2024 17:11
The logic to get the desired number of rooms should not depend
on the logic of ensuring this desired amount. Ideally, the desired
logic would stay in a separate worker and a separate operation
that we are going to call SchedulerController. Then, HealthController
will only be responsible for enqueueing the necessary operations
to get the current state equal to the desired. Thus, rolling
updates should NOT add any operation to the queue
@hspedro hspedro force-pushed the refactor/hc-execute branch from 4666b6c to 21fa95a Compare August 22, 2024 19:32
After sorting by status, sort based on scheduler's active version moving
the current available rooms (Pending, Ready and Occupied) to the bottom
of the list. Now, we'll delete first Pending, Ready and Occupied from a
non-active scheduler version to ensure version updates.
Create a matrix of test cases with table tests, validated in spreadsheets
and with the desired behavior that each health_controller operation should
take during a rolling update. There are 4 base cases: stable occupancy,
occupancy increased, occupancy decreased, stable occupancy but only 75% of
new rooms becoming active.
If the operation did not took action, do not append to history. This will avoid bloat
of operation history when listing, simplifying debug.
@hspedro hspedro force-pushed the refactor/hc-execute branch from 9789b07 to 785a118 Compare August 23, 2024 17:45
@hspedro hspedro force-pushed the refactor/hc-execute branch 2 times, most recently from c19b25a to 3afe568 Compare September 2, 2024 18:25
When sending Kubernetes events, let the kubernetes clientSet infere
what is the namespace of the object (Pod) instead of having one
recorder per namespace.
@hspedro hspedro force-pushed the refactor/hc-execute branch from 3afe568 to 1788e34 Compare September 2, 2024 18:58
@hspedro hspedro force-pushed the refactor/hc-execute branch from 49de795 to 2913a14 Compare September 2, 2024 19:46
@hspedro hspedro merged commit b2ca233 into main Sep 2, 2024
6 checks passed
@hspedro hspedro deleted the refactor/hc-execute branch September 2, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants