Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with jobs being killed when using qsubcellfun #20

Open
JosePMarques opened this issue Oct 14, 2022 · 19 comments
Open

Problem with jobs being killed when using qsubcellfun #20

JosePMarques opened this issue Oct 14, 2022 · 19 comments

Comments

@JosePMarques
Copy link

Describe the issue
I get non reproducible crashes of submitted jobs using qsubcellfun (while the majority runs successfully).
If I rerun the script using the same input data, a different subset of submitted jobs will appear as a crash.
The nodes where this crashes happen also doesn't seem to be consistent (a note might run sucessfully 4 jobs and fail 4 other jobs)

Describe yourself

  1. José Marques
  2. MR physicist
  3. MR techniques

Audience
whoever is using the cluster....

Test data
/home/common/temporary/4Staff
To Reproduce
Steps to reproduce the behavior:

  1. Open TestQsuberrors.m

  2. run the script

  3. wait 10 or 15 minutes

  4. Look at the output image, you will see that some slices are missing...

  5. if you rerun the script another set of slices will be missing.. wall time and memory are never an issue.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
example

Environment and versions (please complete the following information):
using matlab on the cluster
Additional context
Add any other context about the problem here.

@schoffelen
Copy link
Collaborator

OK, I copied the data over (temporarily) to a more persistent location, and we will look into it, if needed.

Have you checked the error-logs?

@JosePMarques
Copy link
Author

JosePMarques commented Oct 14, 2022 via email

@schoffelen
Copy link
Collaborator

OK, please ping me @achetverikov if you need input. Right now I cannot reproduce, because the AcquiredData/2022_09_invivo_data folder seems to be missing (at least on my end).

@JosePMarques
Copy link
Author

JosePMarques commented Oct 14, 2022 via email

@schoffelen
Copy link
Collaborator

OK thanks, it needed to be pwd indeed, I missed initially the fact that it was an attempt to load in the mat file from the current folder.

@schoffelen
Copy link
Collaborator

So far, it runs throught without error for me (I couldn't resist trying it out)

@achetverikov
Copy link

I was able to reproduce the error this afternoon, but it works fine if qsubfeval is used instead. The script also throws errors (when using qsubcellfun) pointing at error logs at /var/spool/torque/mom_priv/jobs/ but when I try to open the logs, they aren't there anymore. I'll look more into it next week.

@achetverikov
Copy link

achetverikov commented Oct 14, 2022

Actually, now it seems to run fine. So: temporary problem with full scratch disk or too much simultaneous read-write?

@JosePMarques
Copy link
Author

JosePMarques commented Oct 14, 2022 via email

@marcelzwiers
Copy link
Collaborator

marcelzwiers commented Oct 14, 2022 via email

@JosePMarques
Copy link
Author

JosePMarques commented Oct 19, 2022 via email

@marcelzwiers
Copy link
Collaborator

marcelzwiers commented Oct 19, 2022 via email

@robertoostenveld
Copy link
Member

@hurngchunlee can you confirm that the jobIDs on Torque may change / are not in register with the jobIDs that are printed on screen upon submission?

@hurngchunlee
Copy link
Member

Yes, the qstat will keep all running and pending jobs; but completed jobs will only be kept for 1 hour (i.e. jobs completed one-hour ago will be gone from qstat). This is to reduce the memory requirement and to have fast qstat response.

@robertoostenveld
Copy link
Member

that is not what @marcelzwiers reports: he stated that job IDs change, not that they disappear (which might also be a reason for problems, but nevertheless differs).

@hurngchunlee
Copy link
Member

hurngchunlee commented Oct 20, 2022

sorry for misunderstanding the question ... no the job ids never change and I never saw jobIDs being changed after submission.

@marcelzwiers
Copy link
Collaborator

marcelzwiers commented Oct 20, 2022 via email

@hurngchunlee
Copy link
Member

Ok ... but it has nothing to do with changing JobID.

The issue Marcel mentioned is due to the fact that the -o and -e are pointed to a directory that doesn't exist at the job submission time; but the directory is later created as part of the job.

In this case, it confused the Torque as at the job submission time, the -o/-e are resolved as files with the specification as the prefix (because the directory doesn't exist); but at the end of the job, since the directory becomes available, it has to change the plan to write -o/-e into the directory (and in this case, it doesn't use jobName as the prefix; but the jobId as the prefix).

If the directory of -o/-e is available before job submission, it is never an issue like this.

@marcelzwiers
Copy link
Collaborator

marcelzwiers commented Oct 20, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants