Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm Job creation crashes on master because "SLURM_JOB_ID" not set #127

Closed
igor-krawczuk opened this issue Oct 19, 2019 · 6 comments · Fixed by #139
Closed

Slurm Job creation crashes on master because "SLURM_JOB_ID" not set #127

igor-krawczuk opened this issue Oct 19, 2019 · 6 comments · Fixed by #139

Comments

@igor-krawczuk
Copy link

igor-krawczuk commented Oct 19, 2019

In order to get around the replace bug mentioned in #118 I installed directly from master, but this introduced another bug namely that the .out name change introduced in #123 causes job creation to crash, since it seems that on my cluster does not seem to set SLURM_JOB_ID (nor any other slurm variable) at the point in time which the code expects?

Removing the jobID solved the problem.

@grahamas
Copy link
Contributor

grahamas commented Dec 3, 2019

I'm having the same problem. That code runs on the calling machine, which isn't a slurm node, so SLURM_JOB_ID isn't set.

@vchuravy
Copy link
Member

vchuravy commented Dec 3, 2019

Probably that code needs to check if that environment variable is available and only then load it.

@grahamas
Copy link
Contributor

grahamas commented Dec 3, 2019

Probably that code needs to check if that environment variable is available and only then load it.

I don't think so. The next line creates the srun command and uses jobID to set the name of the output file. What it seems like it wants to do is either a) use %j to put the job ID into the name of the output file, or b) to give the output file a known name so that we can find it. However, I don't think we can do both. Possibly after we run the command, we can figure out the ID and then know the name of the output file, but I'm not sure how to do that.

(I'm happy to do the coding and make a PR if someone can tell me what it's supposed to do; for now I'll get something that works and we can see what you think)

@mkschleg
Copy link
Contributor

We can probably do a quick check for job files in the directory we are wanting to save (which already kinda exists) and instead of deleting all the files:

  • We check to see if there will be clashes with the job_id if the env variable is available (I.e. check for files that look like joinpath(location, "job_%jobID")).
  • If the env variable is not available we can just check for any job files and increment the highest value, or replace job_id with the date/time or something like this to make it clear what is the most recent.

All of these should be pretty straightforward to add. Another thing you could add is a flag which turns job files on/off and the job_id functionality on and off.

@jishnub
Copy link

jishnub commented Dec 12, 2019

In my cluster I have noticed that the SLURM_JOB_ID is set after launching a job using srun, as might be expected from the name of the variable. A workaround at the moment is to submit an interactive job, run julia on the compute note and add workers using ClusterManagers. However we should not be expecting it to be set before an srun command is called

@KajWiik
Copy link

KajWiik commented Mar 12, 2020

Scripts in https://github.com/magerton/julia-slurm-example work OK, maybe this should be mentioned in the front page and documentation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants