Slurm Job creation crashes on master because "SLURM_JOB_ID" not set #127

igor-krawczuk · 2019-10-19T11:28:02Z

In order to get around the replace bug mentioned in #118 I installed directly from master, but this introduced another bug namely that the .out name change introduced in #123 causes job creation to crash, since it seems that on my cluster does not seem to set SLURM_JOB_ID (nor any other slurm variable) at the point in time which the code expects?

Removing the jobID solved the problem.

grahamas · 2019-12-03T20:31:44Z

I'm having the same problem. That code runs on the calling machine, which isn't a slurm node, so SLURM_JOB_ID isn't set.

vchuravy · 2019-12-03T20:45:37Z

Probably that code needs to check if that environment variable is available and only then load it.

grahamas · 2019-12-03T21:32:55Z

Probably that code needs to check if that environment variable is available and only then load it.

I don't think so. The next line creates the srun command and uses jobID to set the name of the output file. What it seems like it wants to do is either a) use %j to put the job ID into the name of the output file, or b) to give the output file a known name so that we can find it. However, I don't think we can do both. Possibly after we run the command, we can figure out the ID and then know the name of the output file, but I'm not sure how to do that.

(I'm happy to do the coding and make a PR if someone can tell me what it's supposed to do; for now I'll get something that works and we can see what you think)

mkschleg · 2019-12-10T23:29:15Z

We can probably do a quick check for job files in the directory we are wanting to save (which already kinda exists) and instead of deleting all the files:

We check to see if there will be clashes with the job_id if the env variable is available (I.e. check for files that look like joinpath(location, "job_%jobID")).
If the env variable is not available we can just check for any job files and increment the highest value, or replace job_id with the date/time or something like this to make it clear what is the most recent.

All of these should be pretty straightforward to add. Another thing you could add is a flag which turns job files on/off and the job_id functionality on and off.

jishnub · 2019-12-12T05:10:56Z

In my cluster I have noticed that the SLURM_JOB_ID is set after launching a job using srun, as might be expected from the name of the variable. A workaround at the moment is to submit an interactive job, run julia on the compute note and add workers using ClusterManagers. However we should not be expecting it to be set before an srun command is called

KajWiik · 2020-03-12T13:56:30Z

Scripts in https://github.com/magerton/julia-slurm-example work OK, maybe this should be mentioned in the front page and documentation?

vchuravy added bug good first issue help wanted labels Dec 3, 2019

kescobo mentioned this issue Apr 12, 2020

Slurm example non-functional #134

Closed

grahamas mentioned this issue Jul 1, 2020

Slurm robust to no jobid and to node warnings #139

Merged

kescobo closed this as completed in #139 Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm Job creation crashes on master because "SLURM_JOB_ID" not set #127

Slurm Job creation crashes on master because "SLURM_JOB_ID" not set #127

igor-krawczuk commented Oct 19, 2019 •

edited

Loading

grahamas commented Dec 3, 2019

vchuravy commented Dec 3, 2019

grahamas commented Dec 3, 2019 •

edited

Loading

mkschleg commented Dec 10, 2019

jishnub commented Dec 12, 2019 •

edited

Loading

KajWiik commented Mar 12, 2020

Slurm Job creation crashes on master because "SLURM_JOB_ID" not set #127

Slurm Job creation crashes on master because "SLURM_JOB_ID" not set #127

Comments

igor-krawczuk commented Oct 19, 2019 • edited Loading

grahamas commented Dec 3, 2019

vchuravy commented Dec 3, 2019

grahamas commented Dec 3, 2019 • edited Loading

mkschleg commented Dec 10, 2019

jishnub commented Dec 12, 2019 • edited Loading

KajWiik commented Mar 12, 2020

igor-krawczuk commented Oct 19, 2019 •

edited

Loading

grahamas commented Dec 3, 2019 •

edited

Loading

jishnub commented Dec 12, 2019 •

edited

Loading