Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: job scheduler - run jobs continuously [DHIS2-16004] #15466

Merged
merged 2 commits into from
Oct 25, 2023

Conversation

jbee
Copy link
Contributor

@jbee jbee commented Oct 23, 2023

Summary

Adds a feature to the scheduler to allow jobs to execute faster than one per type per loop cycle.

For fast jobs like small imports that only take below 1-2 seconds it is important to run more than one per 20 seconds loop.
Otherwise jobs might be added faster or in higher quantity that the scheduling can execute as it is limited to one per 20 seconds otherwise. This would be 3 per minute, 180 per hour, 4320 per day. For imports this might not be enough and not be fast enough.

Usage

For future job types to use it their enum constant need to be added to the new https://github.com/dhis2/dhis2-core/pull/15466/files#diff-3ad347fffaf9c8a6f46a781e97b1a5d9770a5fecba98121543414e69428baa48R232 method

Implementation Notes

To allow importing faster a job type can be hard coded as being continuous. For job types of that nature the loop pushes all jobs (IDs) that are potentially ready to a per type queue. If no queue exists a new one is created for the type and a worker spawned that works the queue until all jobs in the queue are processed. The main loop keeps potentially adding to the queue in every loop cycle in the background but the worker will poll and execute the jobs as fast as possible while still ensuring that only one job at a time per type is running.

The queue will only remember the job's ID and reload the job configuration when it is time to execute them because the job configuration may have changed in the database since it was added to the queue.

A worker working the queue will always assume the queue was created by the worker so when all entries in the queue are processed the worker clears the queue before it ends so that another worker may be spawned in the future when a new job of the job type is created.

In the unlikely event that the loop cycle checks if the queue is empty to find out if a worker should be spawned and it is empty because the current worker is just processing the last entry while also a new entry was created since then a second worker is spawned for the same queue. Now two workers will poll from the same queue to empty it and both will try to remove it when done. This is not an issue as they will poll different jobs and the DB still limits execution to 1 job at a time. This means a job might try to run while the other worker's job just got started. It then fails to start and is re-added to the queue in the next loop cycle as a job that is ready to run.

The goal with this implementation is that while there is in-memory state the design is robust against getting stuck or out-of sync.
In a worst case a job might be delayed and run 1 loop cycle (20 seconds) later than it could have been running but it will eventually be picked up and run if it is ready.

Manual Testing

  • goto import/export app
  • export some metadata to a file (chose one object type only to make a small import)
  • import the file using a REST tool like POSTman (inspect the request made by the app and replicate it in the tool. the main point to use a tool is so we can create many imports quickly, usually in a tool this can be done by just clicking the send button multiple times)
  • wait some 30 seconds
  • check the job configurations lastExecuted time for the created jobs (jobType=METADATA_IMPORT), they should be all started quickly after another and not be about 20 seconds apart, e.g. /api/jobConfigurations/gist?filter=jobType:eq:METADATA_IMPORT

@jbee jbee self-assigned this Oct 23, 2023
@jbee jbee marked this pull request as ready for review October 24, 2023 11:58
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@codecov
Copy link

codecov bot commented Oct 24, 2023

Codecov Report

Merging #15466 (586d11b) into master (ada6f8a) will increase coverage by 0.01%.
Report is 6 commits behind head on master.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##             master   #15466      +/-   ##
============================================
+ Coverage     66.10%   66.12%   +0.01%     
- Complexity    31115    31139      +24     
============================================
  Files          3484     3486       +2     
  Lines        129448   129547      +99     
  Branches      15095    15107      +12     
============================================
+ Hits          85577    85659      +82     
- Misses        36792    36807      +15     
- Partials       7079     7081       +2     
Flag Coverage Δ
integration 49.69% <0.00%> (-0.02%) ⬇️
integration-h2 32.45% <0.00%> (+<0.01%) ⬆️
unit 30.32% <0.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...rc/main/java/org/hisp/dhis/scheduling/JobType.java 94.87% <0.00%> (-1.24%) ⬇️
...his/scheduling/DefaultJobConfigurationService.java 62.16% <0.00%> (+2.80%) ⬆️
...troller/scheduling/JobConfigurationController.java 30.76% <0.00%> (ø)
...his/scheduling/DefaultJobSchedulerLoopService.java 52.57% <0.00%> (+1.53%) ⬆️
...in/java/org/hisp/dhis/scheduling/JobScheduler.java 20.00% <0.00%> (-7.59%) ⬇️

... and 21 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ada6f8a...586d11b. Read the comment docs.

@david-mackessy
Copy link
Contributor

Nice. A test to cover this new behaviour would be good to have in the e2e suite. Checking that the allowed (continuous) job type finishes within 20 seconds?? Otherwise we'll never know (from a test) if this functionality breaks after it passes manual testing.

@jbee
Copy link
Contributor Author

jbee commented Oct 24, 2023

Checking that the allowed (continuous) job type finishes within 20 seconds??

Yes, if multiple small jobs of the METADATA_IMPORT type are created shortly after another they all should finish after one cycle of the scheduler loop, so in less than 20 seconds + the time it takes to process them all. Usually this should be just some additional ms or seconds for a small payload.

I will have a look if I can create an e2e test :)

@jbee jbee merged commit b565b84 into dhis2:master Oct 25, 2023
18 checks passed
@jbee
Copy link
Contributor Author

jbee commented Oct 25, 2023

I have not forgotton about adding a test but currently there are many other topics I work on that block further development which I need to prioritize so I merged this now as the added feature also is a precondition for that work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants