Retry job on different agent #1585

EmilioMoreno · 2024-09-20T11:34:44Z

Is your feature request related to a problem? Please describe.
I am running several indexing process using dkron with 3 agents, all of them with the same tags, so the indexing processes can be run on any of them. concurrency of specific indexing processes is forbidden, but several indexing processes can happen at the same time. When a process fails due to some hardware issues (low memory, low disk space), the job is always retried in the same server, which simply fails again

Describe the solution you'd like
I would like to have an option to force retry in a different agent (obviously an agent which meet the tag criteria)
Other alternative would be to have some option to avoid concurrency of specific jobs (so some specific setting which allows you to set the name of. the jobs the current job will try to avoid if possible) so we avoid running several processes which are hardware consuming at the same time in the same servers and distribute them more evenly.

Describe alternatives you've considered
As alternatives I can increase the number of agents to have a greater dispersion, but this will not solve the problem as this will happen anyways from time to time.
Also with an increased number of agents I can set new tags so every hardware-specific jobs can target a subset, but this will heavily reduce availability.
Other alternative would be dynamic tags: simply set tags considering average disk usage, or average memory free for the past X minutes which update every minute and be able to target those too.

Additional context
Add any other context or screenshots about the feature request here.

vcastellm · 2024-10-27T12:22:41Z

I had in mind to implement resource checks on nodes but that will take some time, but the retries should pick a new random node as per its implementation.

Which version are you using?

EmilioMoreno · 2024-11-05T18:02:49Z

I was using 3.1.10, last week I upgraded to 3.2.7 and it seems to be working as you mentioned. Apologies!

EmilioMoreno · 2024-11-08T13:04:56Z

@vcastellm I stand corrected. It seems to behave erratically; sometimes it selects a different agent, while other times it sticks to the same one for all retries. Just now, two separate tasks failed, with all retries (up to 6 times per task) taking place on the same agent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry job on different agent #1585

Retry job on different agent #1585

EmilioMoreno commented Sep 20, 2024

vcastellm commented Oct 27, 2024

EmilioMoreno commented Nov 5, 2024

EmilioMoreno commented Nov 8, 2024

Retry job on different agent #1585

Retry job on different agent #1585

Comments

EmilioMoreno commented Sep 20, 2024

vcastellm commented Oct 27, 2024

EmilioMoreno commented Nov 5, 2024

EmilioMoreno commented Nov 8, 2024