Unable to requeue a job after sigterm signal on slurm #20542

jkobject · 2025-01-10T14:39:46Z

Bug description

When running a model fit function on a slurm cluster everything happens correctly but when the time is out I receive

Epoch 10:  93%|█████████▎| 18625/20000 [5:43:21<25:20,  0.90it/s, v_num=0txx, train_loss=3.400, denoise_60%_expr=1.290, denoise_60%_emb_independence=0.0694, denoise_60%_cls=0.377, denoise_60%_ecs=0.865, gen_expr=1
Epoch 10:  93%|█████████▎| 18626/20000 [5:43:21<25:19,  0.90it/s, v_num=0txx, train_loss=3.400, denoise_60%_expr=1.290, denoise_60%_emb_independence=0.0694, denoise_60%_cls=0.377, denoise_60%_ecs=0.865, gen_expr=1.490, gen_emb_independence

slurmstepd: error: *** STEP 55595933.0 ON maestro-3017 CANCELLED AT 2025-01-10T15:27:25 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[rank: 0] Received SIGTERM: 15
Bypassing SIGTERM: 15

Epoch 10:  93%|█████████▎| 18626/20000 [5:43:21<25:19,  0.90it/s, v_num=0txx, train_loss=3.970, denoise_60%_expr=1.570, denoise_60%_emb_independence=0.068, denoise_60%_cls=0.386, denoise_60%_ecs=0.867, gen_expr=1...
Epoch 10:  93%|█████████▎| 18627/20000 [5:43:23<25:18,  0.90it/s, v_num=0txx, train_loss=3.970, denoise_60%_expr=1.570, denoise_60%_emb_independence=0.068, denoise_60%_cls=0.386, denoise_60%_ecs=0.867, gen_expr=1...
Epoch 10:  93%|█████████▎| 18627/20000 [5:43:23<25:18,  0.90it/s, v_num=0txx, train_loss=3.330, denoise_60%_expr=1.270, denoise_60%_emb_independence=0.0689, denoise_60%_cls=0.368, denoise_60%_ecs=0.867, gen_expr=1.460, gen_emb_independence=0.0584, gen_ecs=0.868, cce=0.480]

wandb: 🚀 View run super-dream-58 at: https://wandb.ai/ml4ig/scprint_v2/runs/k2oz0txx
wandb: Find logs at: ../../../../zeus/projets/p02/ml4ig_hot/Users/jkalfon/wandb/run-20250107_152923-k2oz0txx/logs

Unfortunately the model never requeues and doesn't even save a checkpoint...
It seems I don't have to add anything here in my config.yml but even when adding

plugins:
    - class_path: lightning.pytorch.plugins.environments.SLURMEnvironment
      init_args:
        requeue_signal: SIGHUP

it doesn't change anything, I have also specified --signal=SIGUSR1@90 in my sbatch cmd.
Is there a solution?

What version are you seeing the problem on?

v2.4

How to reproduce the bug

git clone https://github.com/cantinilab/scPRINT
follow installation instruction
sbatch -p gpu -q gpu --gres=gpu:A100:1,gmem:80G --cpus-per-task 20 --mem-per-gpu 80G --ntasks-per-node=1 --signal=SIGUSR1@90 scprint fit --config config/base_v2.yml --config config/pretrain_medium.yml

The text was updated successfully, but these errors were encountered:

arijit-hub · 2025-01-10T16:46:47Z

Hye,

I see you use reque_signal : SIGHUP and do --signal=SIGUSR1@90. You need to have the same signal for it to work.

So what you should do is:

plugins:
    - class_path: lightning.pytorch.plugins.environments.SLURMEnvironment
      init_args:
        requeue_signal: SIGUSR1

and it would be alright.

Hope this helps!

jkobject · 2025-01-10T16:56:41Z

Hello,

I tried both but I use the same signal for both, I just copied 2 different runs

arijit-hub · 2025-01-10T20:13:52Z

Hye,

Hmm. I was checking your repository, and saw the batch files here. Are you using them to run your jobs? Or do you use the command that you wrote in How to reproduce your bug?

jkobject · 2025-01-12T14:50:54Z

I used the command I wrote in "how to reproduce", not the file in the git repo. I matched the SIGUSR1@90 with a requeue_signal: SIGUSR1 in the .yaml

Except the SLURM sbatch files, what you see in the main branch of the repo is what I use

arijit-hub · 2025-01-12T17:32:16Z

Ok I see. Can you maybe try to increase the timer of the signal from 90seconds to a higher number, maybe 500 by doing this —-signal=SIGUSR1@500? Sometimes the model might need a bit more time to be saved.

jkobject · 2025-01-13T08:16:01Z

I have also found out there is a --requeue option on sbatch, I have tried it but it didn't seem to change anything. I will try the 500s wait :)
Now the current command is sbatch_tail --ntasks-per-node=1 --hint=nomultithread --gres=gpu:1 --cpus-per-task=24 --time 0:20:00 --account=xeg@h100 --nodes=1 --constraint=h100 --signal=SIGUSR1@500 --requeue scprint fit --config config/base_v2.yml --config config/pretrain_medium.yml --data.batch_size 32 --data.collection_name "scPRINT-V2 (some)"

jkobject · 2025-01-13T09:13:35Z

No it is still not requeuing. still getting:

16.80, denoise_60%_emb_independence=0.384, denoise_60%_ecs=0.863, gen_expr=25.10, cce=1.470]slurmstepd: error: *** JOB 1857552 ON jzxh010 CANCELLED AT 2025-01-13T09:34:39 DUE TO TIME LIMIT ***
[rank: 0] Received SIGTERM: 15
Bypassing SIGTERM: 15
Epoch 0:  17%|█▋        | 2577/14727 [19:31<1:32:02,  2.20it/s, v_num=rrp1, train_loss=57.80, full_forward_expr=0.000, full_forward_emb_independence=0.399, full_forward_cls=1.010, full_forward_ecs=0.876, mask_TF_expr=14.30, mask_TF_emb_independence=0.379, mask_TF_ecs=0.858, denoise_60%_expr=Epoch 0:  17%|█▋        | 2577/14727 [19:31<1:32:02,  2.20it/s, v_num=rrp1, train_loss=25.10, full_forward_expr=0.000, full_forward_emb_independence=0.394, full_forward_cls=1.130, full_forward_ecs=0.866, mask_TF_expr=6.310, mask_TF_emb_independence=0.380, mask_TF_ecs=0.864, denoise_60%_expr=7.290, denoise_60%_emb_independence=0.386, denoise_60%_ecs=0.862, gen_expr=9.860, cce=1.480]wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /lustre/fswork/projects/rech/xeg/uat95fg/wandb/offline-run-20250113_091454-t6yqrrp1
wandb: Find logs at: ../wandb/offline-run-20250113_091454-t6yqrrp1/logs

While I see the sigterm at the end of the job, I don't see any print of the sigusr signal... But I don't know if there should be one.

jkobject · 2025-01-14T08:53:09Z

Hello, so I found a solution. I need to put it in a sbatch script and add exit 99 at the end. It seems to be true for my slurm cluster and others too.

Best,

arijit-hub · 2025-01-14T10:01:22Z

Hello,

Amazing that you found the solution.

jkobject added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jan 10, 2025

github-actions bot added the ver: 2.4.x label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to requeue a job after sigterm signal on slurm #20542

Unable to requeue a job after sigterm signal on slurm #20542

jkobject commented Jan 10, 2025 •

edited

Loading

arijit-hub commented Jan 10, 2025

jkobject commented Jan 10, 2025 •

edited

Loading

arijit-hub commented Jan 10, 2025

jkobject commented Jan 12, 2025

arijit-hub commented Jan 12, 2025

jkobject commented Jan 13, 2025

jkobject commented Jan 13, 2025 •

edited

Loading

jkobject commented Jan 14, 2025

arijit-hub commented Jan 14, 2025

Unable to requeue a job after sigterm signal on slurm #20542

Unable to requeue a job after sigterm signal on slurm #20542

Comments

jkobject commented Jan 10, 2025 • edited Loading

Bug description

What version are you seeing the problem on?

How to reproduce the bug

arijit-hub commented Jan 10, 2025

jkobject commented Jan 10, 2025 • edited Loading

arijit-hub commented Jan 10, 2025

jkobject commented Jan 12, 2025

arijit-hub commented Jan 12, 2025

jkobject commented Jan 13, 2025

jkobject commented Jan 13, 2025 • edited Loading

jkobject commented Jan 14, 2025

arijit-hub commented Jan 14, 2025

jkobject commented Jan 10, 2025 •

edited

Loading

jkobject commented Jan 10, 2025 •

edited

Loading

jkobject commented Jan 13, 2025 •

edited

Loading