This repository has been archived by the owner on Jul 25, 2024. It is now read-only.
ci/tasks.py: offload testjob post processing to its own task #1115
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The reason for having this is for deployments of SQUAD on auto-scalable systems such as Kubernetes. When the load in SQUAD is high, Kubernetes creates new replicas of workers to consume from the queue.
When the load is back to low, Kubernetes starts trimming workers no longer being used. There is a very specific corner case with this approach though.
When Kubernetes trims a worker, it sends SIGTERM to it and wait 30s by default for the worker to self terminate. In Linaro's deployment of SQUAD, there is a particular kind of test job that comes from Android CTS/VTS. They are huge and take a lot more than 30s to finish. If the worker is not finished by the 30s mark, Kubernetes sends SIGKILL to it and it dies abruptly, causing inconsistencies.
Yes we can increase the 30s timeout, but if SQUAD is under heavy load, increasing the timeout might still cause inconsistency if the worker doesn't self terminate in that timeout.
The solution fo this problem is the creation of a new queue called 'ci_fetch_postprocess'. Deployments with great load should then create a different kind of worker that never dies and does not auto-scale, thus eliminating the problem completely.
Tasks in 'ci_fetch_postprocess' are the plugin ones, which are the culprit of the issue.