Slurm: Error during job creation, leaves stale jobs #114

jishnub · 2019-05-01T10:42:27Z

I am encountering this error if jobs time out

julia> addprocs_slurm(100);
srun: job 1218546 queued and waiting for resources
Error launching Slurm job:
ERROR: UndefVarError: warn not defined
Stacktrace:
 [1] wait(::Task) at ./task.jl:191
 [2] #addprocs_locked#44(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:418
 [3] addprocs_locked at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:372 [inlined]
 [4] #addprocs#43(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:365
 [5] #addprocs_slurm#15 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:359 [inlined]
 [6] addprocs_slurm(::Int64) at /home/jb6888/.julia/packages/ClusterManagers/7pPEP/src/slurm.jl:85
 [7] top-level scope at none:0

The issue seems to be with @async_launch in cluster.jl. However, even after the error, the job is left pending on the queue and might be allocated resources later.

squeue -u jb6888                                                                                                                                                                                                                                                                
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
1218546   par_std julia-14  jb6888 PD       0:00      4 (Priority)

Shouldn't an error launching jobs remove it from the queue as well? Or is it still there because the warn error prevents subsequent clean-up from taking place?

The text was updated successfully, but these errors were encountered:

vchuravy · 2019-05-01T14:31:54Z

Cleanup is normally performed when a process shuts down on the Compute node, so you are right we could and should do a better job with error handling here.

vchuravy changed the title ~~Timed-out jobs not removed from the queue on a Slurm cluster~~ Error during job creation, leaves stale jobs May 1, 2019

vchuravy added the help wanted label May 1, 2019

DilumAluthge changed the title ~~Error during job creation, leaves stale jobs~~ Slurm: Error during job creation, leaves stale jobs Jan 2, 2025

DilumAluthge added the manager: SLURM The Slurm Workload Manager label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm: Error during job creation, leaves stale jobs #114

Slurm: Error during job creation, leaves stale jobs #114

jishnub commented May 1, 2019

vchuravy commented May 1, 2019

Slurm: Error during job creation, leaves stale jobs #114

Slurm: Error during job creation, leaves stale jobs #114

Comments

jishnub commented May 1, 2019

vchuravy commented May 1, 2019