Problem with jobs being killed when using qsubcellfun #20

JosePMarques · 2022-10-14T09:47:24Z

Describe the issue
I get non reproducible crashes of submitted jobs using qsubcellfun (while the majority runs successfully).
If I rerun the script using the same input data, a different subset of submitted jobs will appear as a crash.
The nodes where this crashes happen also doesn't seem to be consistent (a note might run sucessfully 4 jobs and fail 4 other jobs)

Describe yourself

José Marques
MR physicist
MR techniques

Audience
whoever is using the cluster....

Test data
/home/common/temporary/4Staff
To Reproduce
Steps to reproduce the behavior:

Open TestQsuberrors.m
run the script
wait 10 or 15 minutes
Look at the output image, you will see that some slices are missing...
if you rerun the script another set of slices will be missing.. wall time and memory are never an issue.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots

Environment and versions (please complete the following information):
using matlab on the cluster
Additional context
Add any other context about the problem here.

schoffelen · 2022-10-14T11:49:06Z

OK, I copied the data over (temporarily) to a more persistent location, and we will look into it, if needed.

Have you checked the error-logs?

JosePMarques · 2022-10-14T11:59:00Z

Hi, Andrey is already looking at it… I looked at the error files, I think they were very close to: “computer says no…” The .e????? files are just empty. And the .o???? don’t show anything interesting Memory was fine and so was wall time And the crashing nodes were not consistent.

schoffelen · 2022-10-14T12:08:17Z

OK, please ping me @achetverikov if you need input. Right now I cannot reproduce, because the AcquiredData/2022_09_invivo_data folder seems to be missing (at least on my end).

JosePMarques · 2022-10-14T12:13:40Z

My fault, when moving the data to the temporary folder I removed the data folder. Line 17 should be Input.data_path ='/'

schoffelen · 2022-10-14T12:23:19Z

OK thanks, it needed to be pwd indeed, I missed initially the fact that it was an attempt to load in the mat file from the current folder.

schoffelen · 2022-10-14T14:47:06Z

So far, it runs throught without error for me (I couldn't resist trying it out)

achetverikov · 2022-10-14T15:07:17Z

I was able to reproduce the error this afternoon, but it works fine if qsubfeval is used instead. The script also throws errors (when using qsubcellfun) pointing at error logs at /var/spool/torque/mom_priv/jobs/ but when I try to open the logs, they aren't there anymore. I'll look more into it next week.

achetverikov · 2022-10-14T15:21:23Z

Actually, now it seems to run fine. So: temporary problem with full scratch disk or too much simultaneous read-write?

JosePMarques · 2022-10-14T16:02:51Z

This "temporary" problem has been annoying me for the last 3 days! I will check if tonight everything still runs smoothly :) Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Andrey Chetverikov ***@***.***> Sent: Friday, October 14, 2022 5:21:33 PM To: Donders-Institute/staff-scientists ***@***.***> Cc: Marques, J.P. (José) ***@***.***>; Author ***@***.***> Subject: Re: [Donders-Institute/staff-scientists] Problem with jobs being killed when using qsubcellfun (Issue #20) Actually, now it seems to run fine again. So: temporary problem with full scratch disk or too much simultaneous read-write? — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/Donders-Institute/staff-scientists/issues/20*issuecomment-1279147941__;Iw!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE7M3nUHfQ$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEUDHP373DHV64MTE3G7KFTWDF273ANCNFSM6AAAAAARFCJFNY__;!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE4Sx0aAKw$>. You are receiving this because you authored the thread.Message ID: ***@***.***>

marcelzwiers · 2022-10-14T16:43:54Z

If it works, but not always, it then seems to me that the tg should be involved in this...

…

_____________________ Sent from my phone Op vr 14 okt. 2022 18:03 schreef José P. Marques ***@***.***>:

This "temporary" problem has been annoying me for the last 3 days! I will check if tonight everything still runs smoothly :) Get Outlook for Android<https://aka.ms/AAb9ysg> ________________________________ From: Andrey Chetverikov ***@***.***> Sent: Friday, October 14, 2022 5:21:33 PM To: Donders-Institute/staff-scientists ***@***.***> Cc: Marques, J.P. (José) ***@***.***>; Author ***@***.***> Subject: Re: [Donders-Institute/staff-scientists] Problem with jobs being killed when using qsubcellfun (Issue #20) Actually, now it seems to run fine again. So: temporary problem with full scratch disk or too much simultaneous read-write? — Reply to this email directly, view it on GitHub< https://urldefense.com/v3/__https://github.com/Donders-Institute/staff-scientists/issues/20*issuecomment-1279147941__;Iw!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE7M3nUHfQ$>, or unsubscribe< https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEUDHP373DHV64MTE3G7KFTWDF273ANCNFSM6AAAAAARFCJFNY__;!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE4Sx0aAKw$ >. You are receiving this because you authored the thread.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub <#20 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTUGL4ZJLH5XUYIVH4W7TTWDF73NANCNFSM6AAAAAARFCJFNY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

JosePMarques · 2022-10-19T14:16:40Z

Sorry, only today I had the chance to run this again. It indeed run smoothly… the first time. As I tried to repeat the process by just changing the regularisation parameter (the lambda)… suddenly the jobs start opening very slowly (one job every 5 seconds)… and after claiming to have finished all the jobs, it never actually gives the results back to matlab. ***@***.*** (I actually counted the 64 and 108 jobs respectively I had submitted on two different iterative matlab sessions… so all successfully completed)… and it is hanging there for the last 2 hours :S Any idea what the problem can be? Some prioritization issue for having asked too many jobs or too much memory in one go? José From: Marques, J.P. (José) ***@***.***> Sent: Friday, 14 October 2022 18:03 To: Donders-Institute/staff-scientists ***@***.***>; Donders-Institute/staff-scientists ***@***.***> Cc: Marques, J.P. (José) ***@***.***>; Author ***@***.***> Subject: Re: [Donders-Institute/staff-scientists] Problem with jobs being killed when using qsubcellfun (Issue #20) This "temporary" problem has been annoying me for the last 3 days! I will check if tonight everything still runs smoothly :) Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Andrey Chetverikov ***@***.******@***.***>> Sent: Friday, October 14, 2022 5:21:33 PM To: Donders-Institute/staff-scientists ***@***.******@***.***>> Cc: Marques, J.P. (José) ***@***.******@***.***>>; Author ***@***.******@***.***>> Subject: Re: [Donders-Institute/staff-scientists] Problem with jobs being killed when using qsubcellfun (Issue #20) Actually, now it seems to run fine again. So: temporary problem with full scratch disk or too much simultaneous read-write? — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/Donders-Institute/staff-scientists/issues/20*issuecomment-1279147941__;Iw!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE7M3nUHfQ$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AEUDHP373DHV64MTE3G7KFTWDF273ANCNFSM6AAAAAARFCJFNY__;!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE4Sx0aAKw$>. You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>

marcelzwiers · 2022-10-19T22:59:12Z

For years it has been such that the jobid's on Torque may change / are not in register with the jobid's that qsubcellfun has in memory. So qsubcellfun waits forever for the e/o-files, whereas the jobs have completed and have produced different e/o-files (unknown to qsubcellfun). The only thing you can do is to restart qsubcellfun... Op wo 19 okt. 2022 om 16:16 schreef José P. Marques < ***@***.***>:

…

Sorry, only today I had the chance to run this again. It indeed run smoothly… the first time. As I tried to repeat the process by just changing the regularisation parameter (the lambda)… suddenly the jobs start opening very slowly (one job every 5 seconds)… and after claiming to have finished all the jobs, it never actually gives the results back to matlab. ***@***.*** (I actually counted the 64 and 108 jobs respectively I had submitted on two different iterative matlab sessions… so all successfully completed)… and it is hanging there for the last 2 hours :S Any idea what the problem can be? Some prioritization issue for having asked too many jobs or too much memory in one go? José From: Marques, J.P. (José) ***@***.***> Sent: Friday, 14 October 2022 18:03 To: Donders-Institute/staff-scientists ***@***.***>; Donders-Institute/staff-scientists ***@***.***> Cc: Marques, J.P. (José) ***@***.***>; Author ***@***.***> Subject: Re: [Donders-Institute/staff-scientists] Problem with jobs being killed when using qsubcellfun (Issue #20) This "temporary" problem has been annoying me for the last 3 days! I will check if tonight everything still runs smoothly :) Get Outlook for Android<https://aka.ms/AAb9ysg> ________________________________ From: Andrey Chetverikov ***@***.******@***.***>> Sent: Friday, October 14, 2022 5:21:33 PM To: Donders-Institute/staff-scientists ***@***.******@***.***>> Cc: Marques, J.P. (José) ***@***.******@***.***>>; Author ***@***.******@***.***>> Subject: Re: [Donders-Institute/staff-scientists] Problem with jobs being killed when using qsubcellfun (Issue #20) Actually, now it seems to run fine again. So: temporary problem with full scratch disk or too much simultaneous read-write? — Reply to this email directly, view it on GitHub< https://urldefense.com/v3/__https:/github.com/Donders-Institute/staff-scientists/issues/20*issuecomment-1279147941__;Iw!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE7M3nUHfQ$>, or unsubscribe< https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AEUDHP373DHV64MTE3G7KFTWDF273ANCNFSM6AAAAAARFCJFNY__;!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE4Sx0aAKw$>. You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>> — Reply to this email directly, view it on GitHub <#20 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTUGL5XYMG3NEV3JRKMV2DWD77FHANCNFSM6AAAAAARFCJFNY> . You are receiving this because you commented.Message ID: ***@***.***>

robertoostenveld · 2022-10-20T07:54:49Z

@hurngchunlee can you confirm that the jobIDs on Torque may change / are not in register with the jobIDs that are printed on screen upon submission?

hurngchunlee · 2022-10-20T09:25:13Z

Yes, the qstat will keep all running and pending jobs; but completed jobs will only be kept for 1 hour (i.e. jobs completed one-hour ago will be gone from qstat). This is to reduce the memory requirement and to have fast qstat response.

robertoostenveld · 2022-10-20T09:26:46Z

that is not what @marcelzwiers reports: he stated that job IDs change, not that they disappear (which might also be a reason for problems, but nevertheless differs).

hurngchunlee · 2022-10-20T09:28:15Z

sorry for misunderstanding the question ... no the job ids never change and I never saw jobIDs being changed after submission.

marcelzwiers · 2022-10-20T09:59:26Z

Well, it shouldn't but it happens. It was a long time ago that we looked into it, and I don't remember the details, but I believe we thought it may have to do with Torque re-running/scheduling the job. Perhaps qsubcellfun would be more robust if it doesn't rely on the job id to retrieve the results, but instead use the qsub option to write the e/o files with a name determined by qsubcellfun? Op do 20 okt. 2022 om 11:28 schreef Hurng-Chun Lee ***@***.***

…

: sorry for misunderstanding the question ... no the job id should never change. — Reply to this email directly, view it on GitHub <#20 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTUGL5J2KBOIQRBWJ34QPTWEEGDVANCNFSM6AAAAAARFCJFNY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

hurngchunlee · 2022-10-20T10:43:42Z

Ok ... but it has nothing to do with changing JobID.

The issue Marcel mentioned is due to the fact that the -o and -e are pointed to a directory that doesn't exist at the job submission time; but the directory is later created as part of the job.

In this case, it confused the Torque as at the job submission time, the -o/-e are resolved as files with the specification as the prefix (because the directory doesn't exist); but at the end of the job, since the directory becomes available, it has to change the plan to write -o/-e into the directory (and in this case, it doesn't use jobName as the prefix; but the jobId as the prefix).

If the directory of -o/-e is available before job submission, it is never an issue like this.

marcelzwiers · 2022-10-20T11:04:15Z

No, that's a separate issue (well, not an issue actually), qsubcellfun doesn't specify the e/o-files, i.e. they are written in the cwd. Op do 20 okt. 2022 om 12:43 schreef Hurng-Chun Lee ***@***.***

…

: Ok ... but it has nothing to do with changing JobID. The issue Marcel mentioned is due to the fact that the -o and -e are pointed to a directory that doesn't exist at the job submission time; but the directory is later created as part of the job. In this case, it confused the Torque as at the job submission time, the -o/-e are resolved as files with the specification as the prefix (because the directory doesn't exist); but at the end of the job, since the directory becomes available, it has to change the plan to write -o/-e into the directory (and in this case, it doesn't use jobName as the prefix; but the jobId as the prefix). If the directory of -o/-e is available before job submission, it is never an issue like this. — Reply to this email directly, view it on GitHub <#20 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTUGL243BLNGU2YFVINWS3WEEO6RANCNFSM6AAAAAARFCJFNY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

schoffelen added the unconfirmed label Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with jobs being killed when using qsubcellfun #20

Problem with jobs being killed when using qsubcellfun #20

JosePMarques commented Oct 14, 2022

schoffelen commented Oct 14, 2022

JosePMarques commented Oct 14, 2022 via email •

edited by schoffelen

Loading

schoffelen commented Oct 14, 2022

JosePMarques commented Oct 14, 2022 via email •

edited by schoffelen

Loading

schoffelen commented Oct 14, 2022

schoffelen commented Oct 14, 2022

achetverikov commented Oct 14, 2022

achetverikov commented Oct 14, 2022 •

edited

Loading

JosePMarques commented Oct 14, 2022 via email

marcelzwiers commented Oct 14, 2022 via email

JosePMarques commented Oct 19, 2022 via email

marcelzwiers commented Oct 19, 2022 via email

robertoostenveld commented Oct 20, 2022

hurngchunlee commented Oct 20, 2022

robertoostenveld commented Oct 20, 2022

hurngchunlee commented Oct 20, 2022 •

edited

Loading

marcelzwiers commented Oct 20, 2022 via email

hurngchunlee commented Oct 20, 2022

marcelzwiers commented Oct 20, 2022 via email

Problem with jobs being killed when using qsubcellfun #20

Problem with jobs being killed when using qsubcellfun #20

Comments

JosePMarques commented Oct 14, 2022

schoffelen commented Oct 14, 2022

JosePMarques commented Oct 14, 2022 via email • edited by schoffelen Loading

schoffelen commented Oct 14, 2022

JosePMarques commented Oct 14, 2022 via email • edited by schoffelen Loading

schoffelen commented Oct 14, 2022

schoffelen commented Oct 14, 2022

achetverikov commented Oct 14, 2022

achetverikov commented Oct 14, 2022 • edited Loading

JosePMarques commented Oct 14, 2022 via email

marcelzwiers commented Oct 14, 2022 via email

JosePMarques commented Oct 19, 2022 via email

marcelzwiers commented Oct 19, 2022 via email

robertoostenveld commented Oct 20, 2022

hurngchunlee commented Oct 20, 2022

robertoostenveld commented Oct 20, 2022

hurngchunlee commented Oct 20, 2022 • edited Loading

marcelzwiers commented Oct 20, 2022 via email

hurngchunlee commented Oct 20, 2022

marcelzwiers commented Oct 20, 2022 via email

JosePMarques commented Oct 14, 2022 via email •

edited by schoffelen

Loading

JosePMarques commented Oct 14, 2022 via email •

edited by schoffelen

Loading

achetverikov commented Oct 14, 2022 •

edited

Loading

hurngchunlee commented Oct 20, 2022 •

edited

Loading