Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry job on different agent #1585

Open
EmilioMoreno opened this issue Sep 20, 2024 · 3 comments
Open

Retry job on different agent #1585

EmilioMoreno opened this issue Sep 20, 2024 · 3 comments

Comments

@EmilioMoreno
Copy link

Is your feature request related to a problem? Please describe.
I am running several indexing process using dkron with 3 agents, all of them with the same tags, so the indexing processes can be run on any of them. concurrency of specific indexing processes is forbidden, but several indexing processes can happen at the same time. When a process fails due to some hardware issues (low memory, low disk space), the job is always retried in the same server, which simply fails again

Describe the solution you'd like
I would like to have an option to force retry in a different agent (obviously an agent which meet the tag criteria)
Other alternative would be to have some option to avoid concurrency of specific jobs (so some specific setting which allows you to set the name of. the jobs the current job will try to avoid if possible) so we avoid running several processes which are hardware consuming at the same time in the same servers and distribute them more evenly.

Describe alternatives you've considered
As alternatives I can increase the number of agents to have a greater dispersion, but this will not solve the problem as this will happen anyways from time to time.
Also with an increased number of agents I can set new tags so every hardware-specific jobs can target a subset, but this will heavily reduce availability.
Other alternative would be dynamic tags: simply set tags considering average disk usage, or average memory free for the past X minutes which update every minute and be able to target those too.

Additional context
Add any other context or screenshots about the feature request here.

@vcastellm
Copy link
Member

I had in mind to implement resource checks on nodes but that will take some time, but the retries should pick a new random node as per its implementation.

Which version are you using?

@EmilioMoreno
Copy link
Author

I was using 3.1.10, last week I upgraded to 3.2.7 and it seems to be working as you mentioned. Apologies!

@EmilioMoreno
Copy link
Author

@vcastellm I stand corrected. It seems to behave erratically; sometimes it selects a different agent, while other times it sticks to the same one for all retries. Just now, two separate tasks failed, with all retries (up to 6 times per task) taking place on the same agent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants