Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to requeue a job after sigterm signal on slurm #20542

Open
jkobject opened this issue Jan 10, 2025 · 9 comments
Open

Unable to requeue a job after sigterm signal on slurm #20542

jkobject opened this issue Jan 10, 2025 · 9 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x

Comments

@jkobject
Copy link

jkobject commented Jan 10, 2025

Bug description

When running a model fit function on a slurm cluster everything happens correctly but when the time is out I receive

Epoch 10:  93%|█████████▎| 18625/20000 [5:43:21<25:20,  0.90it/s, v_num=0txx, train_loss=3.400, denoise_60%_expr=1.290, denoise_60%_emb_independence=0.0694, denoise_60%_cls=0.377, denoise_60%_ecs=0.865, gen_expr=1
Epoch 10:  93%|█████████▎| 18626/20000 [5:43:21<25:19,  0.90it/s, v_num=0txx, train_loss=3.400, denoise_60%_expr=1.290, denoise_60%_emb_independence=0.0694, denoise_60%_cls=0.377, denoise_60%_ecs=0.865, gen_expr=1.490, gen_emb_independence

slurmstepd: error: *** STEP 55595933.0 ON maestro-3017 CANCELLED AT 2025-01-10T15:27:25 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[rank: 0] Received SIGTERM: 15
Bypassing SIGTERM: 15

Epoch 10:  93%|█████████▎| 18626/20000 [5:43:21<25:19,  0.90it/s, v_num=0txx, train_loss=3.970, denoise_60%_expr=1.570, denoise_60%_emb_independence=0.068, denoise_60%_cls=0.386, denoise_60%_ecs=0.867, gen_expr=1...
Epoch 10:  93%|█████████▎| 18627/20000 [5:43:23<25:18,  0.90it/s, v_num=0txx, train_loss=3.970, denoise_60%_expr=1.570, denoise_60%_emb_independence=0.068, denoise_60%_cls=0.386, denoise_60%_ecs=0.867, gen_expr=1...
Epoch 10:  93%|█████████▎| 18627/20000 [5:43:23<25:18,  0.90it/s, v_num=0txx, train_loss=3.330, denoise_60%_expr=1.270, denoise_60%_emb_independence=0.0689, denoise_60%_cls=0.368, denoise_60%_ecs=0.867, gen_expr=1.460, gen_emb_independence=0.0584, gen_ecs=0.868, cce=0.480]

wandb: 🚀 View run super-dream-58 at: https://wandb.ai/ml4ig/scprint_v2/runs/k2oz0txx
wandb: Find logs at: ../../../../zeus/projets/p02/ml4ig_hot/Users/jkalfon/wandb/run-20250107_152923-k2oz0txx/logs

Unfortunately the model never requeues and doesn't even save a checkpoint...
It seems I don't have to add anything here in my config.yml but even when adding

plugins:
    - class_path: lightning.pytorch.plugins.environments.SLURMEnvironment
      init_args:
        requeue_signal: SIGHUP

it doesn't change anything, I have also specified --signal=SIGUSR1@90 in my sbatch cmd.
Is there a solution?

What version are you seeing the problem on?

v2.4

How to reproduce the bug

git clone https://github.com/cantinilab/scPRINT
follow installation instruction
sbatch -p gpu -q gpu --gres=gpu:A100:1,gmem:80G --cpus-per-task 20 --mem-per-gpu 80G --ntasks-per-node=1 --signal=SIGUSR1@90 scprint fit --config config/base_v2.yml --config config/pretrain_medium.yml

@jkobject jkobject added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jan 10, 2025
@arijit-hub
Copy link

Hye,

I see you use reque_signal : SIGHUP and do --signal=SIGUSR1@90. You need to have the same signal for it to work.

So what you should do is:

plugins:
    - class_path: lightning.pytorch.plugins.environments.SLURMEnvironment
      init_args:
        requeue_signal: SIGUSR1

and it would be alright.

Hope this helps!

@jkobject
Copy link
Author

jkobject commented Jan 10, 2025

Hello,

I tried both but I use the same signal for both, I just copied 2 different runs

@arijit-hub
Copy link

Hye,

Hmm. I was checking your repository, and saw the batch files here. Are you using them to run your jobs? Or do you use the command that you wrote in How to reproduce your bug?

@jkobject
Copy link
Author

I used the command I wrote in "how to reproduce", not the file in the git repo. I matched the SIGUSR1@90 with a requeue_signal: SIGUSR1 in the .yaml

Except the SLURM sbatch files, what you see in the main branch of the repo is what I use

@arijit-hub
Copy link

Ok I see. Can you maybe try to increase the timer of the signal from 90seconds to a higher number, maybe 500 by doing this —-signal=SIGUSR1@500? Sometimes the model might need a bit more time to be saved.

@jkobject
Copy link
Author

I have also found out there is a --requeue option on sbatch, I have tried it but it didn't seem to change anything. I will try the 500s wait :)
Now the current command is sbatch_tail --ntasks-per-node=1 --hint=nomultithread --gres=gpu:1 --cpus-per-task=24 --time 0:20:00 --account=xeg@h100 --nodes=1 --constraint=h100 --signal=SIGUSR1@500 --requeue scprint fit --config config/base_v2.yml --config config/pretrain_medium.yml --data.batch_size 32 --data.collection_name "scPRINT-V2 (some)"

@jkobject
Copy link
Author

jkobject commented Jan 13, 2025

No it is still not requeuing. still getting:

16.80, denoise_60%_emb_independence=0.384, denoise_60%_ecs=0.863, gen_expr=25.10, cce=1.470]slurmstepd: error: *** JOB 1857552 ON jzxh010 CANCELLED AT 2025-01-13T09:34:39 DUE TO TIME LIMIT ***
[rank: 0] Received SIGTERM: 15
Bypassing SIGTERM: 15
Epoch 0:  17%|█▋        | 2577/14727 [19:31<1:32:02,  2.20it/s, v_num=rrp1, train_loss=57.80, full_forward_expr=0.000, full_forward_emb_independence=0.399, full_forward_cls=1.010, full_forward_ecs=0.876, mask_TF_expr=14.30, mask_TF_emb_independence=0.379, mask_TF_ecs=0.858, denoise_60%_expr=Epoch 0:  17%|█▋        | 2577/14727 [19:31<1:32:02,  2.20it/s, v_num=rrp1, train_loss=25.10, full_forward_expr=0.000, full_forward_emb_independence=0.394, full_forward_cls=1.130, full_forward_ecs=0.866, mask_TF_expr=6.310, mask_TF_emb_independence=0.380, mask_TF_ecs=0.864, denoise_60%_expr=7.290, denoise_60%_emb_independence=0.386, denoise_60%_ecs=0.862, gen_expr=9.860, cce=1.480]wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /lustre/fswork/projects/rech/xeg/uat95fg/wandb/offline-run-20250113_091454-t6yqrrp1
wandb: Find logs at: ../wandb/offline-run-20250113_091454-t6yqrrp1/logs

While I see the sigterm at the end of the job, I don't see any print of the sigusr signal... But I don't know if there should be one.

@jkobject
Copy link
Author

Hello, so I found a solution. I need to put it in a sbatch script and add exit 99 at the end. It seems to be true for my slurm cluster and others too.

Best,

@arijit-hub
Copy link

Hello,

Amazing that you found the solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x
Projects
None yet
Development

No branches or pull requests

2 participants