Job heartbeat support #436

meyer9 · 2023-12-03T02:02:13Z

If we had a field last_heartbeat, we could more quickly detect when a job fails due to process crashes, etc where it may not properly update to failed.

One example would be to send a heartbeat every 15 seconds (setting the column to NOW()) and marking a job as failed/expired if the heartbeat is over 60 seconds old.

The text was updated successfully, but these errors were encountered:

Crispy1975 · 2024-04-24T13:09:32Z

@timgit I was thinking about this exact addition to pg-boss. We have a scenario as the OP describes where a worker process exits in an uncontrolled way leaving a job in the active state. There doesn't seem to be a way for other workers to know about this and therefore we end up with jobs stuck in limbo.

As the OP mentioned a heartbeat column in the jobs table would allow for a maintenance task or other workers to detect the stuck job and perhaps move it into the retry state so it can again be processed.

What are your thoughts on this? I am happy to work on this update/feature as we have a a fairly urgent need. It should also benefit other pg-boss users too. 😄

schester44 · 2024-05-17T20:02:20Z

+1 to this, would love to see pg-boss better recognize when a job has stopped for reasons such as the worker process crashing.

timgit · 2024-07-13T21:31:57Z

Internal maintenance guarantees active jobs are retried or failed after their timeout/expiration. This behaves similarly to the visibility timeout in SQS. You should tune the expiration to what should be considered normal execution time as well.

KristjanTammekivi · 2024-07-16T09:55:16Z

Internal maintenance guarantees active jobs are retried or failed after their timeout/expiration. This behaves similarly to the visibility timeout in SQS. You should tune the expiration to what should be considered normal execution time as well.

Well, you can but at the same time sometimes some task might take longer, then you can either increase the expected time to way longer than the normal execution time or risk same task be run multiple times. Heartbeat should reasonably easy to implement and would solve that issue at the cost of an extra column in the job table.

KristjanTammekivi · 2024-08-26T12:43:55Z

Hi @timgit I see that this is marked as completed but I can't seem to find it in the source code or documentation. Can you please give more information on the subject?

timgit · 2024-08-27T00:24:23Z

I didn't mean to mark this as completed. I was resolving lot of old issues during the v10 release.

afonsomatos · 2024-10-20T10:04:41Z

Hey, I have this issue as well.

My situation is that our jobs can take 60 seconds or 15 minutes to complete, depending of the configuration of that particular job.

We want our users to get the job done as quickly as possible.

If the worker dies at 60 seconds, it's too much to wait another 14 minutes (for the expiration timeout).
Job heartbeat support would fix this.

timgit closed this as completed Aug 17, 2024

timgit reopened this Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job heartbeat support #436

Job heartbeat support #436

meyer9 commented Dec 3, 2023

Crispy1975 commented Apr 24, 2024

schester44 commented May 17, 2024

timgit commented Jul 13, 2024

KristjanTammekivi commented Jul 16, 2024

KristjanTammekivi commented Aug 26, 2024

timgit commented Aug 27, 2024

afonsomatos commented Oct 20, 2024 •

edited

Loading

Job heartbeat support #436

Job heartbeat support #436

Comments

meyer9 commented Dec 3, 2023

Crispy1975 commented Apr 24, 2024

schester44 commented May 17, 2024

timgit commented Jul 13, 2024

KristjanTammekivi commented Jul 16, 2024

KristjanTammekivi commented Aug 26, 2024

timgit commented Aug 27, 2024

afonsomatos commented Oct 20, 2024 • edited Loading

afonsomatos commented Oct 20, 2024 •

edited

Loading