Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: ClusterManager not working on PBS #419

Open
nathaliesoy opened this issue Aug 30, 2023 · 9 comments
Open

[BUG]: ClusterManager not working on PBS #419

nathaliesoy opened this issue Aug 30, 2023 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@nathaliesoy
Copy link

What happened?

When using the cluster manager on pbs the code breaks. It seems to fail to activate the workers due to wrong qsub flags.

Version

0.14.1

Operating System

Linux

Package Manager

pip

Interface

Script (i.e., python my_script.py)

Relevant log output

output: 
Compiling Julia backend...
Error launching workers
ErrorException("")
Activating environment on workers.
Importing installed module on workers...Finished!
Testing module on workers...Finished!
Testing entire pipeline on workers...Finished!
error: 
qsub: invalid option -- 'w'
qsub: invalid option -- 'd'
qsub: invalid option -- 't'
usage: qsub [-a date_time] [-A account_string] [-c interval]
	[-C directive_prefix] [-e path] [-f ] [-h ] [-I [-X]] [-j oe|eo] [-J X-Y[:Z]]
	[-k keep] [-l resource_list] [-m mail_options] [-M user_list]
	[-N jobname] [-o path] [-p priority] [-P project] [-q queue] [-r y|n]
	[-R o|e|oe] [-S path] [-u user_list] [-W otherattributes=value...]
	[-S path] [-u user_list] [-W otherattributes=value...]
	[-v variable_list] [-V ] [-z] [script | -- command [arg1 ...]]
       qsub --version
/srv01/agrp/soybna/.local/lib/python3.8/site-packages/pysr/sr.py:1230: UserWarning: Note: Using a large maxsize for the equation search will be exponentially slower and use significant memory. You should consider turning `use_frequency` to False, and perhaps use `warmup_maxsize_by`.
  warnings.warn(
/srv01/agrp/soybna/.local/lib/python3.8/site-packages/pysr/julia_helpers.py:195: UserWarning: Your system's Python library is static (e.g., conda), so precompilation will be turned off. For a dynamic library, try `pyenv`.
  warnings.warn(
Traceback (most recent call last):
  File "run_pysr.py", line 28, in <module>
    model.fit(traindata['features'], traindata['init_hidden_rep'])
  File "/srv01/agrp/soybna/.local/lib/python3.8/site-packages/pysr/sr.py", line 1845, in fit
    self._run(X, y, mutated_params, weights=weights, seed=seed)
  File "/srv01/agrp/soybna/.local/lib/python3.8/site-packages/pysr/sr.py", line 1705, in _run
    self.raw_julia_state_ = SymbolicRegression.EquationSearch(
RuntimeError: <PyCall.jlwrap (in a Julia function called from Python)
JULIA: MethodError: reducing over an empty collection is not allowed; consider supplying `init` to the reducer
Stacktrace:
  [1] mapreduce_empty(#unused#::typeof(identity), op::Function, T::Type)
    @ Base ./reduce.jl:367
  [2] reduce_empty(op::Base.MappingRF{typeof(identity), SymbolicRegression.SearchUtilsModule.var"#2#4"{Dict{Int64, Int64}}}, #unused#::Type{Int64})
    @ Base ./reduce.jl:356
  [3] reduce_empty_iter
    @ ./reduce.jl:379 [inlined]
  [4] mapreduce_empty_iter(f::Function, op::Function, itr::Vector{Int64}, ItrEltype::Base.HasEltype)
    @ Base ./reduce.jl:375
  [5] _mapreduce(f::typeof(identity), op::SymbolicRegression.SearchUtilsModule.var"#2#4"{Dict{Int64, Int64}}, #unused#::IndexLinear, A::Vector{Int64})
    @ Base ./reduce.jl:427
  [6] _mapreduce_dim
    @ ./reducedim.jl:365 [inlined]
  [7] #mapreduce#800
    @ ./reducedim.jl:357 [inlined]
  [8] mapreduce
    @ ./reducedim.jl:357 [inlined]
  [9] #reduce#802
    @ ./reducedim.jl:406 [inlined]
 [10] reduce
    @ ./reducedim.jl:406 [inlined]
 [11] next_worker(worker_assignment::Dict{Tuple{Int64, Int64}, Int64}, procs::Vector{Int64})
    @ SymbolicRegression.SearchUtilsModule ~/.julia/packages/SymbolicRegression/Y57Eu/src/SearchUtils.jl:23
 [12] _EquationSearch(parallelism::Symbol, datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, NamedTuple{(), Tuple{}}}}; niterations::Int64, options::Options{Int64, Optim.Options{Float64, Nothing}, L2DistLoss, Nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, numprocs::Int64, procs::Nothing, addprocs_function::typeof(addprocs_pbs), runtests::Bool, saved_state::Nothing)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/Y57Eu/src/SymbolicRegression.jl:572
 [13] _EquationSearch
    @ ~/.julia/packages/SymbolicRegression/Y57Eu/src/SymbolicRegression.jl:412 [inlined]
 [14] EquationSearch(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, NamedTuple{(), Tuple{}}}}; niterations::Int64, options::Options{Int64, Optim.Options{Float64, Nothing}, L2DistLoss, Nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, parallelism::String, numprocs::Int64, procs::Nothing, addprocs_function::typeof(addprocs_pbs), runtests::Bool, saved_state::Nothing)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/Y57Eu/src/SymbolicRegression.jl:399
 [15] EquationSearch(X::Matrix{Float32}, y::Matrix{Float32}; niterations::Int64, weights::Nothing, varMap::Vector{String}, options::Options{Int64, Optim.Options{Float64, Nothing}, L2DistLoss, Nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, parallelism::String, numprocs::Int64, procs::Nothing, addprocs_function::typeof(addprocs_pbs), runtests::Bool, saved_state::Nothing, multithreaded::Nothing, loss_type::Type)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/Y57Eu/src/SymbolicRegression.jl:332
 [16] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Any, NTuple{8, Symbol}, NamedTuple{(:weights, :niterations, :varMap, :options, :numprocs, :parallelism, :saved_state, :addprocs_function), Tuple{Nothing, Int64, Vector{String}, Options{Int64, Optim.Options{Float64, Nothing}, L2DistLoss, Nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, Int64, String, Nothing, typeof(addprocs_pbs)}}})
    @ Base ./essentials.jl:818
 [17] _pyjlwrap_call(f::Function, args_::Ptr{PyCall.PyObject_struct}, kw_::Ptr{PyCall.PyObject_struct})
    @ PyCall ~/.julia/packages/PyCall/twYvK/src/callback.jl:32
 [18] pyjlwrap_call(self_::Ptr{PyCall.PyObject_struct}, args_::Ptr{PyCall.PyObject_struct}, kw_::Ptr{PyCall.PyObject_struct})
    @ PyCall ~/.julia/packages/PyCall/twYvK/src/callback.jl:44>

Extra Info

Setting multithreading to False doesn't change anything.

@MilesCranmer
Copy link
Owner

Thanks! This looks like it might be an issue in ClusterManagers.jl JuliaParallel/ClusterManagers.jl#179

What is your qsub --version?

@nathaliesoy
Copy link
Author

pbs_version = 20.0.1

@MilesCranmer
Copy link
Owner

Okay this might take a bit longer to solve. It turns out to be really hard to set up a local version of PBS for testing things. But I'm working on it!

JuliaParallel/ClusterManagers.jl#193

@MilesCranmer
Copy link
Owner

Basically what we need to do is modify these lines to fix ClusterManagers.jl:

https://github.com/JuliaParallel/ClusterManagers.jl/blob/0b0ee3dc772beee0c8cccc77079d941b979ffeac/src/qsub.jl#L52-L54

            qsub_cmd = pipeline(`echo $(Base.shell_escape(cmd))` , (isPBS ?
                    `qsub -N $jobname -wd $wd -j oe -k o -t 1-$np $queue` :
                    `qsub -N $jobname -wd $wd -terse -j y -R y -t 1-$np -V $queue`))

It sounds like they haven't yet updated this qsub call to PBS version 20.

If you are proficient with qsub and know what flags need to be used here, you might be able to make a local modification of ClusterManagers.jl, and then switch to that copy of ClusterManagers.jl with PySR with:

cd ClusterManagers.jl
julia [email protected] -e 'using Pkg; pkg"dev ."'

This will get the PySR environment for 0.16.3 to use the local copy of ClusterManagers.jl. Then if you are able to update the qsub call in the src/qsub.jl file to the qsub version 20 syntax, it should work.

@nathaliesoy
Copy link
Author

Thank you Miles for investigating this! I think I figured out the new PBS 20 flags and changed it accordingly.

So I added these two lines to my submission shell script

cd ClusterManagers.jl
julia [email protected] -e 'using Pkg; pkg"dev ."' 

but it doesn't look like it is picking up the local package. The julia version I am using is globally installed on the cluster. I can't recall, does the ClusterManagers.jl need to be in a specific folder? Do I need to set some path somewhere?

@MilesCranmer
Copy link
Owner

MilesCranmer commented Aug 30, 2023

Even if the Julia version is globally installed, you should have the environments appear in your local folder ~/.julia/environments. There should be a pysr-0.16.3 one in that folder (or whatever version of PySR you have installed).

If you open the file ~/.julia/environments/pysr-0.16.3/Manifest.toml, and go to the "ClusterManagers.jl" section, it should tell you if it is a local version or not, and what folder it is using. Maybe the path name is a relative path rather than absolute? You could also try

julia --project=@pysr-0.16.3 -e 'using Pkg; Pkg.develop(path="/path/to/clustermanagers.jl")' 

and give the full absolute path (to the location of your modified ClusterManagers.jl) there?

@MilesCranmer
Copy link
Owner

Oh wait, sorry. I just realized you said in the original post that you are using PySR 0.14.1. So either (1) update to PySR 0.16.3 and go through the normal installation with python -m pysr install before implementing these changes, or (2) use [email protected] instead of -0.16.3.

@nathaliesoy
Copy link
Author

okay so that part seems okay now, thanks!
now the issue is that when submitting it can't connect to the server, errno=15010, seems like a permission thing... Probably I should pick it up with our system administrator?

@MilesCranmer
Copy link
Owner

MilesCranmer commented Aug 30, 2023

Hm, yeah the sysadmin might know best for that type of issue. How are you running things?

You could also try running a parallel Julia command manually, just to see if it gives a more helpful error message.

First, create an interactive job on the cluster that you can ssh into. Ssh into it and start Julia with: julia [email protected]. Then, execute the following (copy-paste)

import Distributed: pmap
import ClusterManagers: addprocs_pbs

num_workers = 10

# Create the workers:
procs = addprocs_pbs(num_workers)

# Run a computation on each worker:
pmap(worker_id -> worker_id^2, procs)

It should return a vector like [4, 9, 16, ...] if successful. And each of those computations will have run on a different worker across the PBS allocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants