-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to requeue a job after sigterm signal on slurm #20542
Comments
Hye, I see you use So what you should do is:
and it would be alright. Hope this helps! |
Hello, I tried both but I use the same signal for both, I just copied 2 different runs |
Hye, Hmm. I was checking your repository, and saw the batch files here. Are you using them to run your jobs? Or do you use the command that you wrote in |
I used the command I wrote in "how to reproduce", not the file in the git repo. I matched the SIGUSR1@90 with a requeue_signal: SIGUSR1 in the .yaml Except the SLURM sbatch files, what you see in the main branch of the repo is what I use |
Ok I see. Can you maybe try to increase the timer of the signal from 90seconds to a higher number, maybe 500 by doing this |
I have also found out there is a --requeue option on sbatch, I have tried it but it didn't seem to change anything. I will try the 500s wait :) |
No it is still not requeuing. still getting:
While I see the sigterm at the end of the job, I don't see any print of the sigusr signal... But I don't know if there should be one. |
Hello, so I found a solution. I need to put it in a sbatch script and add exit 99 at the end. It seems to be true for my slurm cluster and others too. Best, |
Hello, Amazing that you found the solution. |
Bug description
When running a model fit function on a slurm cluster everything happens correctly but when the time is out I receive
Unfortunately the model never requeues and doesn't even save a checkpoint...
It seems I don't have to add anything here in my config.yml but even when adding
it doesn't change anything, I have also specified
--signal=SIGUSR1@90
in my sbatch cmd.Is there a solution?
What version are you seeing the problem on?
v2.4
How to reproduce the bug
git clone https://github.com/cantinilab/scPRINT
follow installation instruction
sbatch -p gpu -q gpu --gres=gpu:A100:1,gmem:80G --cpus-per-task 20 --mem-per-gpu 80G --ntasks-per-node=1 --signal=SIGUSR1@90 scprint fit --config config/base_v2.yml --config config/pretrain_medium.yml
The text was updated successfully, but these errors were encountered: