Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlight lack of ongoing support on README #197

Closed
wants to merge 2 commits into from

Conversation

MilesCranmer
Copy link
Collaborator

Since many of the cluster managers do not actually work (#196 #189 #185 #179 #163 ...), I think it is very important to state this up-front on the README. Currently the README explicitly states that all of these cluster managers are "currently supported" which causes users like myself to spend an enormous amount of time trying to get it working, only to find out it is not actually maintained/tested.

I sincerely empathize that it is very hard to find an active maintainer for all of these cluster managers. So I think it is just much better to be up front about this lack of ongoing support so that users don't get stuck.

@@ -1,8 +1,17 @@
# ClusterManagers

Support for different job queue systems commonly used on compute clusters.
> [!WARNING]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to grab a repostatus.org badge. "Unsupported" or "Suspended" seem like likely fits?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we could have it at a finer level, like "Unsupported" on the different individual cluster managers? Since it seems like from the discourse thread that there are certain managers that work well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw I use and try to help maintaining the lsf manager.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it sounds like lsf is known to work well. Any others? We could just have a column with whether a method works or not

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Condor works with some common filesystem mounted assumptions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be worth it to also have a clause explaining that most cluster managers are not expected to be guaranteed to work out of the box. I need to supply quite a few extra program flags to the LSF manger and I also need a separate hack to find an open port for the workers for it to work on my system. I don't consider it a bug or deficit that ClusterManagers can't find a set of working arguments for me automatically.

For the maintenance status question:
Would it be useful if the level of support is indicated through a column with maintainer names (with something like "maintainer needed" indicates that it is not well maintained)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DrChainsaw I think both of those ideas are really good.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this stand - did you want to add that stuff to the table in the readme?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Sorry I didn’t adjust the pr yet; just busy with teaching

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good! There's no real urgency, just want to make sure I'm not the one holding it up 😉

@DilumAluthge
Copy link
Member

I do think we need some kind of warning in the README.

I'm not sure if "looking for a maintainer" is the right way to describe this package. I think the fundamental problem here is that it's not really possible to have a single maintainer for this entire package, because no one person works with all of these cluster schedulers on a day-to-day basis. I think the best long-term solution is #58, i.e. to have multiple different packages, with a different package for each scheduler. Then, different packages can have different maintainers, and the README in this repo will just link to each of those packages.

For example, @DrChainsaw mentioned that they use the LSF manager. So we should probably spin the LSF manager out into a separate repo, and then (if they are willing to be the maintainer) we can make @DrChainsaw the maintainer of that package. And then that package will be able to focus solely on LSF, and won't need to worry about other schedulers.

@DilumAluthge
Copy link
Member

@DrChainsaw Would you be interested in being the maintainer of the LSF functionality? If so, I will create a new JuliaParallel/LSFClusterManagers.jl package, and I'll make you the maintainer of that repo. I can use git-filter-repo to copy over the contents from this repo, but only include the LSF-related files.

@MilesCranmer
Copy link
Collaborator Author

SGTM

@DilumAluthge DilumAluthge deleted the MilesCranmer-patch-1 branch January 2, 2025 06:54
@DrChainsaw
Copy link
Collaborator

@DrChainsaw Would you be interested in being the maintainer of the LSF functionality? If so, I will create a new JuliaParallel/LSFClusterManagers.jl package, and I'll make you the maintainer of that repo. I can use git-filter-repo to copy over the contents from this repo, but only include the LSF-related files.

Sure, I can do that. @bjarthur is the 'senior' lsf maintainer so I guess he should also be added.

@DilumAluthge
Copy link
Member

Awesome. I will set up the repo this weekend, and I'll add both you and @bjarthur as maintainers.

@DilumAluthge
Copy link
Member

Okay, I have done the following tasks.

  1. I've created the new repo here: https://github.com/JuliaParallel/LSFClusterManager.jl
  2. I've invited @DrChainsaw and @bjarthur to the repo.
  3. I've removed all non-LSF code, tests, and docs from the repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants