Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm: Error during job creation, leaves stale jobs #114

Open
jishnub opened this issue May 1, 2019 · 1 comment
Open

Slurm: Error during job creation, leaves stale jobs #114

jishnub opened this issue May 1, 2019 · 1 comment
Labels
help wanted manager: SLURM The Slurm Workload Manager

Comments

@jishnub
Copy link

jishnub commented May 1, 2019

I am encountering this error if jobs time out

julia> addprocs_slurm(100);
srun: job 1218546 queued and waiting for resources
Error launching Slurm job:
ERROR: UndefVarError: warn not defined
Stacktrace:
 [1] wait(::Task) at ./task.jl:191
 [2] #addprocs_locked#44(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:418
 [3] addprocs_locked at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:372 [inlined]
 [4] #addprocs#43(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:365
 [5] #addprocs_slurm#15 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:359 [inlined]
 [6] addprocs_slurm(::Int64) at /home/jb6888/.julia/packages/ClusterManagers/7pPEP/src/slurm.jl:85
 [7] top-level scope at none:0

The issue seems to be with @async_launch in cluster.jl. However, even after the error, the job is left pending on the queue and might be allocated resources later.

squeue -u jb6888                                                                                                                                                                                                                                                                
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
1218546   par_std julia-14  jb6888 PD       0:00      4 (Priority)

Shouldn't an error launching jobs remove it from the queue as well? Or is it still there because the warn error prevents subsequent clean-up from taking place?

@vchuravy vchuravy changed the title Timed-out jobs not removed from the queue on a Slurm cluster Error during job creation, leaves stale jobs May 1, 2019
@vchuravy
Copy link
Member

vchuravy commented May 1, 2019

Cleanup is normally performed when a process shuts down on the Compute node, so you are right we could and should do a better job with error handling here.

@DilumAluthge DilumAluthge changed the title Error during job creation, leaves stale jobs Slurm: Error during job creation, leaves stale jobs Jan 2, 2025
@DilumAluthge DilumAluthge added the manager: SLURM The Slurm Workload Manager label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted manager: SLURM The Slurm Workload Manager
Projects
None yet
Development

No branches or pull requests

3 participants