Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm example non-functional #134

Closed
tbenst opened this issue Apr 9, 2020 · 11 comments
Closed

Slurm example non-functional #134

tbenst opened this issue Apr 9, 2020 · 11 comments
Labels
manager: SLURM The Slurm Workload Manager

Comments

@tbenst
Copy link

tbenst commented Apr 9, 2020

Thanks for your work on this!

First problem is need using ClusterManagers, Distributed for first line. With this fix, I get the following error:

Error launching Slurm job:
ERROR: LoadError: TaskFailedException:
MethodError: no method matching replace(::String, ::String, ::String)
Closest candidates are:
  replace(::String, !Matched::Pair{#s64,B} where B where #s64<:AbstractChar; count) at strings/util.jl:421
  replace(::String, !Matched::Pair{#s61,B} where B where #s61<:Union{Tuple{Vararg{AbstractChar,N} where N}, Set{#s58} where #s58<:AbstractChar, AbstractArray{#s59,1} where #s59<:AbstractChar}; count) at strings/util.jl:426
  replace(::String, !Matched::Pair; count) at strings/util.jl:433
  ...
Stacktrace:
 [1] launch(::SlurmManager, ::Dict{Symbol,Any}, ::Array{WorkerConfig,1}, ::Base.GenericCondition{Base.AlwaysLockedST}) at /home/users/tbenst/.julia/packages/ClusterManagers/7pPEP/src/slurm.jl:28
 [2] (::Distributed.var"#41#44"{SlurmManager,Dict{Symbol,Any},Array{WorkerConfig,1},Base.GenericCondition{Base.AlwaysLockedST}})() at ./task.jl:333
Stacktrace:
 [1] wait at ./task.jl:251 [inlined]
 [2] #addprocs_locked#40(::Base.Iterators.Pairs{Symbol,String,Tuple{Symbol,Symbol},NamedTuple{(:partition, :t),Tuple{String,String}}}, ::typeof(Distributed.addprocs_locked), ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:494
 [3] #addprocs_locked at ./none:0 [inlined]
 [4] #addprocs#39(::Base.Iterators.Pairs{Symbol,String,Tuple{Symbol,Symbol},NamedTuple{(:partition, :t),Tuple{String,String}}}, ::typeof(addprocs), ::SlurmManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:441
 [5] (::Distributed.var"#kw##addprocs")(::NamedTuple{(:partition, :t),Tuple{String,String}}, ::typeof(addprocs), ::SlurmManager) at ./none:0
 [6] top-level scope at /home/users/tbenst/code/julia_11_slurm/example.jl:13
 [7] include at ./boot.jl:328 [inlined]
 [8] include_relative(::Module, ::String) at ./loading.jl:1105
 [9] include(::Module, ::String) at ./Base.jl:31
 [10] exec_options(::Base.JLOptions) at ./client.jl:287
 [11] _start() at ./client.jl:460
in expression starting at /home/users/tbenst/code/julia_11_slurm/example.jl:13
@kescobo
Copy link
Collaborator

kescobo commented Apr 12, 2020

I just ran into this as well. Without Distributed, you can use addprocs_slurm.

The replace thing is fixed on master, the replace method used in the current release is old syntax and doesn't work on julia 1+

@kescobo
Copy link
Collaborator

kescobo commented Apr 12, 2020

I just ran into #127 after fixing this though...

@zxjroger
Copy link

Hi, have these issues been solved?

@kescobo
Copy link
Collaborator

kescobo commented May 20, 2020

Alas no - everything is being blocked by lack of testing at the moment, and unfortunately no one seems to have the time to work on them :-(

@neversakura
Copy link

I just want to report that v0.4.0 solves this problem. However v0.4.0 is not tagged yet. I would appreciate if v0.4.0 can be released if there is no other concerns.

@kescobo
Copy link
Collaborator

kescobo commented Aug 15, 2020

Yeah, see #118

@TimeExplorer
Copy link

I also run into this trouble. But strange enough, among a job array I submitted to the cluster, there were a random number of jobs that can be successfully carried out, while others just fail with the same error as yours. see #140 . Have you found a way to get around with this? If v0.4.0 works, where can I access v0.4.0?

@kescobo
Copy link
Collaborator

kescobo commented Sep 21, 2020

@TimeExplorer In Pkg REPL: ] add ClusterManagers#master

@TimeExplorer
Copy link

@TimeExplorer In Pkg REPL: ] add ClusterManagers#master

Thanks! Unfortunately, even with the updated v.0.4.0, in my case there are still some random instances among the slurm job array that fail with similar error...

@grahamas
Copy link
Contributor

Hi! Would anyone have a minute to review #139? I'm happy to fix any further problems that I overlooked, too...

@juliohm juliohm added the manager: SLURM The Slurm Workload Manager label Oct 6, 2020
@juliohm
Copy link
Collaborator

juliohm commented Oct 6, 2020

I've just tagged v0.4.0. The issue should be solved there according to comments.

@juliohm juliohm closed this as completed Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
manager: SLURM The Slurm Workload Manager
Projects
None yet
Development

No branches or pull requests

7 participants