-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with jobs being killed when using qsubcellfun #20
Comments
OK, I copied the data over (temporarily) to a more persistent location, and we will look into it, if needed. Have you checked the error-logs? |
Hi, Andrey is already looking at it…
I looked at the error files, I think they were very close to: “computer says no…”
The .e????? files are just empty.
And the .o???? don’t show anything interesting
Memory was fine and so was wall time
And the crashing nodes were not consistent.
|
OK, please ping me @achetverikov if you need input. Right now I cannot reproduce, because the AcquiredData/2022_09_invivo_data folder seems to be missing (at least on my end). |
My fault, when moving the data to the temporary folder I removed the data folder.
Line 17 should be
Input.data_path ='/'
|
OK thanks, it needed to be pwd indeed, I missed initially the fact that it was an attempt to load in the mat file from the current folder. |
So far, it runs throught without error for me (I couldn't resist trying it out) |
I was able to reproduce the error this afternoon, but it works fine if qsubfeval is used instead. The script also throws errors (when using qsubcellfun) pointing at error logs at |
Actually, now it seems to run fine. So: temporary problem with full scratch disk or too much simultaneous read-write? |
This "temporary" problem has been annoying me for the last 3 days! I will check if tonight everything still runs smoothly :)
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Andrey Chetverikov ***@***.***>
Sent: Friday, October 14, 2022 5:21:33 PM
To: Donders-Institute/staff-scientists ***@***.***>
Cc: Marques, J.P. (José) ***@***.***>; Author ***@***.***>
Subject: Re: [Donders-Institute/staff-scientists] Problem with jobs being killed when using qsubcellfun (Issue #20)
Actually, now it seems to run fine again. So: temporary problem with full scratch disk or too much simultaneous read-write?
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/Donders-Institute/staff-scientists/issues/20*issuecomment-1279147941__;Iw!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE7M3nUHfQ$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEUDHP373DHV64MTE3G7KFTWDF273ANCNFSM6AAAAAARFCJFNY__;!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE4Sx0aAKw$>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
If it works, but not always, it then seems to me that the tg should be
involved in this...
…_____________________
Sent from my phone
Op vr 14 okt. 2022 18:03 schreef José P. Marques ***@***.***>:
This "temporary" problem has been annoying me for the last 3 days! I will
check if tonight everything still runs smoothly :)
Get Outlook for Android<https://aka.ms/AAb9ysg>
________________________________
From: Andrey Chetverikov ***@***.***>
Sent: Friday, October 14, 2022 5:21:33 PM
To: Donders-Institute/staff-scientists ***@***.***>
Cc: Marques, J.P. (José) ***@***.***>; Author ***@***.***>
Subject: Re: [Donders-Institute/staff-scientists] Problem with jobs being
killed when using qsubcellfun (Issue #20)
Actually, now it seems to run fine again. So: temporary problem with full
scratch disk or too much simultaneous read-write?
—
Reply to this email directly, view it on GitHub<
https://urldefense.com/v3/__https://github.com/Donders-Institute/staff-scientists/issues/20*issuecomment-1279147941__;Iw!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE7M3nUHfQ$>,
or unsubscribe<
https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEUDHP373DHV64MTE3G7KFTWDF273ANCNFSM6AAAAAARFCJFNY__;!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE4Sx0aAKw$
>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
—
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADTUGL4ZJLH5XUYIVH4W7TTWDF73NANCNFSM6AAAAAARFCJFNY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Sorry, only today I had the chance to run this again.
It indeed run smoothly… the first time.
As I tried to repeat the process by just changing the regularisation parameter (the lambda)… suddenly the jobs start opening very slowly (one job every 5 seconds)…
and after claiming to have finished all the jobs, it never actually gives the results back to matlab.
***@***.***
(I actually counted the 64 and 108 jobs respectively I had submitted on two different iterative matlab sessions… so all successfully completed)… and it is hanging there for the last 2 hours :S
Any idea what the problem can be? Some prioritization issue for having asked too many jobs or too much memory in one go?
José
From: Marques, J.P. (José) ***@***.***>
Sent: Friday, 14 October 2022 18:03
To: Donders-Institute/staff-scientists ***@***.***>; Donders-Institute/staff-scientists ***@***.***>
Cc: Marques, J.P. (José) ***@***.***>; Author ***@***.***>
Subject: Re: [Donders-Institute/staff-scientists] Problem with jobs being killed when using qsubcellfun (Issue #20)
This "temporary" problem has been annoying me for the last 3 days! I will check if tonight everything still runs smoothly :)
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Andrey Chetverikov ***@***.******@***.***>>
Sent: Friday, October 14, 2022 5:21:33 PM
To: Donders-Institute/staff-scientists ***@***.******@***.***>>
Cc: Marques, J.P. (José) ***@***.******@***.***>>; Author ***@***.******@***.***>>
Subject: Re: [Donders-Institute/staff-scientists] Problem with jobs being killed when using qsubcellfun (Issue #20)
Actually, now it seems to run fine again. So: temporary problem with full scratch disk or too much simultaneous read-write?
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/Donders-Institute/staff-scientists/issues/20*issuecomment-1279147941__;Iw!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE7M3nUHfQ$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AEUDHP373DHV64MTE3G7KFTWDF273ANCNFSM6AAAAAARFCJFNY__;!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE4Sx0aAKw$>.
You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>
|
For years it has been such that the jobid's on Torque may change / are not
in register with the jobid's that qsubcellfun has in memory. So qsubcellfun
waits forever for the e/o-files, whereas the jobs have completed and have
produced different e/o-files (unknown to qsubcellfun). The only thing you
can do is to restart qsubcellfun...
Op wo 19 okt. 2022 om 16:16 schreef José P. Marques <
***@***.***>:
… Sorry, only today I had the chance to run this again.
It indeed run smoothly… the first time.
As I tried to repeat the process by just changing the regularisation
parameter (the lambda)… suddenly the jobs start opening very slowly (one
job every 5 seconds)…
and after claiming to have finished all the jobs, it never actually gives
the results back to matlab.
***@***.***
(I actually counted the 64 and 108 jobs respectively I had submitted on
two different iterative matlab sessions… so all successfully completed)…
and it is hanging there for the last 2 hours :S
Any idea what the problem can be? Some prioritization issue for having
asked too many jobs or too much memory in one go?
José
From: Marques, J.P. (José) ***@***.***>
Sent: Friday, 14 October 2022 18:03
To: Donders-Institute/staff-scientists ***@***.***>;
Donders-Institute/staff-scientists ***@***.***>
Cc: Marques, J.P. (José) ***@***.***>; Author ***@***.***>
Subject: Re: [Donders-Institute/staff-scientists] Problem with jobs being
killed when using qsubcellfun (Issue #20)
This "temporary" problem has been annoying me for the last 3 days! I will
check if tonight everything still runs smoothly :)
Get Outlook for Android<https://aka.ms/AAb9ysg>
________________________________
From: Andrey Chetverikov ***@***.******@***.***>>
Sent: Friday, October 14, 2022 5:21:33 PM
To: Donders-Institute/staff-scientists ***@***.******@***.***>>
Cc: Marques, J.P. (José) ***@***.******@***.***>>; Author
***@***.******@***.***>>
Subject: Re: [Donders-Institute/staff-scientists] Problem with jobs being
killed when using qsubcellfun (Issue #20)
Actually, now it seems to run fine again. So: temporary problem with full
scratch disk or too much simultaneous read-write?
—
Reply to this email directly, view it on GitHub<
https://urldefense.com/v3/__https:/github.com/Donders-Institute/staff-scientists/issues/20*issuecomment-1279147941__;Iw!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE7M3nUHfQ$>,
or unsubscribe<
https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AEUDHP373DHV64MTE3G7KFTWDF273ANCNFSM6AAAAAARFCJFNY__;!!HJOPV4FYYWzcc1jazlU!5yOUJCCL7bBMciuDHk2wcuJ2M2TuKFwHVk6JwtsuAx6cq2115Ca-dMTI5xgrRM1fb3cS9xsI4yDtHGff8GvyVE4Sx0aAKw$>.
You are receiving this because you authored the thread.Message ID:
***@***.******@***.***>>
—
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADTUGL5XYMG3NEV3JRKMV2DWD77FHANCNFSM6AAAAAARFCJFNY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@hurngchunlee can you confirm that the jobIDs on Torque may change / are not in register with the jobIDs that are printed on screen upon submission? |
Yes, the |
that is not what @marcelzwiers reports: he stated that job IDs change, not that they disappear (which might also be a reason for problems, but nevertheless differs). |
sorry for misunderstanding the question ... no the job ids never change and I never saw jobIDs being changed after submission. |
Well, it shouldn't but it happens. It was a long time ago that we looked
into it, and I don't remember the details, but I believe we thought it may
have to do with Torque re-running/scheduling the job. Perhaps qsubcellfun
would be more robust if it doesn't rely on the job id to retrieve the
results, but instead use the qsub option to write the e/o files with a name
determined by qsubcellfun?
Op do 20 okt. 2022 om 11:28 schreef Hurng-Chun Lee ***@***.***
…:
sorry for misunderstanding the question ... no the job id should never
change.
—
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADTUGL5J2KBOIQRBWJ34QPTWEEGDVANCNFSM6AAAAAARFCJFNY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ok ... but it has nothing to do with changing JobID. The issue Marcel mentioned is due to the fact that the -o and -e are pointed to a directory that doesn't exist at the job submission time; but the directory is later created as part of the job. In this case, it confused the Torque as at the job submission time, the -o/-e are resolved as files with the specification as the prefix (because the directory doesn't exist); but at the end of the job, since the directory becomes available, it has to change the plan to write -o/-e into the directory (and in this case, it doesn't use jobName as the prefix; but the jobId as the prefix). If the directory of -o/-e is available before job submission, it is never an issue like this. |
No, that's a separate issue (well, not an issue actually), qsubcellfun
doesn't specify the e/o-files, i.e. they are written in the cwd.
Op do 20 okt. 2022 om 12:43 schreef Hurng-Chun Lee ***@***.***
…:
Ok ... but it has nothing to do with changing JobID.
The issue Marcel mentioned is due to the fact that the -o and -e are
pointed to a directory that doesn't exist at the job submission time; but
the directory is later created as part of the job.
In this case, it confused the Torque as at the job submission time, the
-o/-e are resolved as files with the specification as the prefix (because
the directory doesn't exist); but at the end of the job, since the
directory becomes available, it has to change the plan to write -o/-e into
the directory (and in this case, it doesn't use jobName as the prefix; but
the jobId as the prefix).
If the directory of -o/-e is available before job submission, it is never
an issue like this.
—
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADTUGL243BLNGU2YFVINWS3WEEO6RANCNFSM6AAAAAARFCJFNY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Describe the issue
I get non reproducible crashes of submitted jobs using qsubcellfun (while the majority runs successfully).
If I rerun the script using the same input data, a different subset of submitted jobs will appear as a crash.
The nodes where this crashes happen also doesn't seem to be consistent (a note might run sucessfully 4 jobs and fail 4 other jobs)
Describe yourself
Audience
whoever is using the cluster....
Test data
/home/common/temporary/4Staff
To Reproduce
Steps to reproduce the behavior:
Open TestQsuberrors.m
run the script
wait 10 or 15 minutes
Look at the output image, you will see that some slices are missing...
if you rerun the script another set of slices will be missing.. wall time and memory are never an issue.
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
Environment and versions (please complete the following information):
using matlab on the cluster
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: