-
-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm Job creation crashes on master because "SLURM_JOB_ID" not set #127
Comments
I'm having the same problem. That code runs on the calling machine, which isn't a slurm node, so SLURM_JOB_ID isn't set. |
Probably that code needs to check if that environment variable is available and only then load it. |
I don't think so. The next line creates the (I'm happy to do the coding and make a PR if someone can tell me what it's supposed to do; for now I'll get something that works and we can see what you think) |
We can probably do a quick check for job files in the directory we are wanting to save (which already kinda exists) and instead of deleting all the files:
All of these should be pretty straightforward to add. Another thing you could add is a flag which turns job files on/off and the job_id functionality on and off. |
In my cluster I have noticed that the SLURM_JOB_ID is set after launching a job using srun, as might be expected from the name of the variable. A workaround at the moment is to submit an interactive job, run julia on the compute note and add workers using ClusterManagers. However we should not be expecting it to be set before an srun command is called |
Scripts in https://github.com/magerton/julia-slurm-example work OK, maybe this should be mentioned in the front page and documentation? |
In order to get around the replace bug mentioned in #118 I installed directly from master, but this introduced another bug namely that the .out name change introduced in #123 causes job creation to crash, since it seems that on my cluster does not seem to set SLURM_JOB_ID (nor any other slurm variable) at the point in time which the code expects?
Removing the jobID solved the problem.
The text was updated successfully, but these errors were encountered: