Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update scrip_remap_conservative to use MPI #1268

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

jcwarner-usgs
Copy link

Pull Request Summary

When using nesting, the SCRIP routines are called to compute remapping weights. But each processor computes all the weights. Here we updated SCRIP/scrip_remap_conservative.F to use all the MPI processors to compute separate portions of the remap weights, and then distribute a full set of weights to all the nodes.

Description

The results should be identical with or without this update.
Reviewer could be: Erick Rogers, NRL

Issue(s) addressed

Related to Discussion #1252

Commit Message

Update scrip_remap_conservative to be more efficient on first use.

Check list

Testing

  • How were these changes tested? I ran several personal tests on local HPC, nothing too extensive.
  • Are the changes covered by regression tests? (If not, why? Do new tests need to be added?)
  • Have the matrix regression tests been run (if yes, please note HPC and compiler)?
  • Please indicate the expected changes in the regression test output, (Note the list of known non-identical tests.)
  • Please provide the summary output of matrix.comp (matrix.Diff.txt, matrixCompFull.txt and matrixCompSummary.txt):

@JessicaMeixner-NOAA
Copy link
Collaborator

@ErickRogers pinging you as you were suggested as a possible reviewer.

Copy link
Collaborator

@ErickRogers ErickRogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks safe to me, since it's put behind a switch. I'm not proficient with MPI programming, so I can't comment on that, but I expect that it is good, since John says it works. Let me know if you want an MPI-fluent reviewer.
NB: We have some rudimentary MPI in WMGHGH. However, it splits things up like "one process per remapping", e.g., if you have 2 grid-pairs, it only splits into 2 processes. So this "within grid" parallelism should be significantly faster.

@MatthewMasarik-NOAA
Copy link
Collaborator

Thanks for submitting @jcwarner-usgs. I'll start reviewing.

@MatthewMasarik-NOAA
Copy link
Collaborator

This looks safe to me, since it's put behind a switch. I'm not proficient with MPI programming, so I can't comment on that, but I expect that it is good, since John says it works. Let me know if you want an MPI-fluent reviewer. NB: We have some rudimentary MPI in WMGHGH. However, it splits things up like "one process per remapping", e.g., if you have 2 grid-pairs, it only splits into 2 processes. So this "within grid" parallelism should be significantly faster.

Thanks for reviewing @ErickRogers.

@jcwarner-usgs
Copy link
Author

I think the MPI could be re-written to be more WW3 centric. And my switches in there are not active.
If you want me to spend more time on this i can. I was not sure if WW3 people would re-write the code or not.

@MatthewMasarik-NOAA
Copy link
Collaborator

MatthewMasarik-NOAA commented Jul 10, 2024

I think the MPI could be re-written to be more WW3 centric. And my switches in there are not active. If you want me to spend more time on this i can. I was not sure if WW3 people would re-write the code or not.

Hi @jcwarner-usgs, thanks for this follow up. I had two thoughts on the switches you mentioned. Currently they are W3_SCRIP_JCW. This should be modified to remove your initials from the end, maybe W3_SCRIP_MPI or something similar? We should also have the switches be active and set a default value (I believe if MPI isn't enabled then we wouldn't want to enter those code blocks, so maybe we should have a default value of .false.?).

Regarding re-writing to be more WW3 centric. Being as WW3 centric as possible is obviously best for the code base, though I can say in my initial review nothing jumped out as being problematic in that regard. It seemed clean and well commented which is great. We do have a Code Style Standards page though that may help out here. I would currently just make the suggestion of renaming some of these My* variables here

https://github.com/jcwarner-usgs/WW3/blob/0baa3df08da8bfeb8be1fe1c13755e1a9938c46c/model/src/SCRIP/scrip_remap_conservative.f#L269-L270

to something more descriptive if possible.

@ErickRogers
Copy link
Collaborator

@jcwarner-usgs I'm not sure I understand what you mean re: WW3 centric. What was your motivation for adding this feature? For you to use in your WW3 applications? Or some other applications?

@jcwarner-usgs
Copy link
Author

oh. I just meant like I used:

  call mpi_allreduce(Asend, Arecv, grid1_size, MPI_DOUBLE,                 &
 &                   MPI_SUM, MyComm, MyError)

but in other parts of the WW3 the mpi calls look like:

        CALL MPI_ALLREDUCE(SND_BUFF(1:NSEA), &
             RCV_BUFF(1:NSEA), &
             NSEA,     &
             MPI_REAL, &
             MPI_SUM,  &
             ID_LCOMM, &
             IL_ERR)

should i go thru and make the variables all capitals?
should i use SND_BUF and RCV_BUF instead of Asend and Arecv?
The My Start (Mystr), My end (Myend), My Rank, My COMM etc are from ROMS.
I can give it a clean up if you want.

@ErickRogers
Copy link
Collaborator

@jcwarner-usgs OK, I see. Matt and Jessica may disagree, but from my point of view, it is fine to use the conventions of the original file, rather than the conventions of WW3. I think of the stuff in the /SCRIP/ subdirectory as not strictly part of the WW3 code proper, but rather as a semi-external library that is adapted (with as light a touch as possible) to work with WW3, to be compiled into WW3. We maintain it in the same version control with WW3 because maintaining it separately would just result in extra work. (2 cents)

@ErickRogers
Copy link
Collaborator

All that aside, renaming the My* variables wouldn't hurt, for clarity purposes.

@JessicaMeixner-NOAA
Copy link
Collaborator

JessicaMeixner-NOAA commented Jul 11, 2024

I agree that keeping styles matching to what is in the routine is okay. It's more of the using your initials or "my" variable names versus just making them generic/descriptive to what they are.

That's really great to hear from @aliabdolali that it sped things up, we'll have to see if that can help us in some set-ups here. I don't see any regression tests that are using these updates.

@jcwarner-usgs Can you tell me a little more about your tests that you've run on your end so we can help advise how to best put in a test for the regression tests so we ensure this feature continues to work? Also, did you obtain identical results with and without this update?

@JessicaMeixner-NOAA
Copy link
Collaborator

@jcwarner-usgs Can you tell me a little more about your tests that you've run on your end so we can help advise how to best put in a test for the regression tests so we ensure this feature continues to work? Also, did you obtain identical results with and without this update?

@jcwarner-usgs - I realized I wasn't clear that this question was to you so I updated my comment and am pinging you on this.

Copy link
Collaborator

@MatthewMasarik-NOAA MatthewMasarik-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcwarner-usgs I wanted to touch base following the discussion on code style. I only have one request here, that is to rename the switch W3_SCRIP_JCW to something without your initials (maybe W3_SCRIP_MPI or other).

Also, I wanted to mention that a couple Github CI jobs failed. I'm re-running these manually now to see if they will resolve themselves.

@jcwarner-usgs
Copy link
Author

Thanks for checking back.
-I did make a small change today, for the size of the remap weights in the scrip_remap_conservative.f
-I will change the switch to something else. You already have SCRIP and SCRIPNC. So how about SCRIPMPI?

  • I am running some more tests here. We have a 4 nested set of grids that are challenging to get going.

@MatthewMasarik-NOAA
Copy link
Collaborator

-I will change the switch to something else. You already have SCRIP and SCRIPNC. So how about SCRIPMPI?

SCRIPMPI is fine with me.

@MatthewMasarik-NOAA
Copy link
Collaborator

Hi @jcwarner-usgs, just a note to say we are taking a short pause on our PR processing do to some temporary changes for our group. I just posted our group on the Commit Queue: Temporary PR Policy. Look forward to picking back up after our short break.

@jcwarner-usgs
Copy link
Author

Getting back onto this pull request now.
I have just updated 3 routines in my code:

  1. model/src/cmake/src_list.cmake
    changed
    ${CMAKE_CURRENT_SOURCE_DIR}/SCRIP/scrip_remap_conservative.f
    to
    ${CMAKE_CURRENT_SOURCE_DIR}/SCRIP/scrip_remap_conservative.F
    (notice the capital F extension)

  2. model/src/cmake/switches.json
    added:
    {
    "name": "scripmpi",
    "num_switches": "upto2",
    "description": "",
    "valid-options": [
    {
    "name": "SCRIPMPI"
    }
    ]
    },

  3. model/src/SCRIP/scrip_remap_conservative.F

  • changed the file extension from .f to .F This allows cpp to process this file.
  • put all my changes within #define SCRIPMPI blocks. I think we should have 3 options of SCRIP, SCRIPNC, and SCRIPMPI.
  • made use of MPI_COMM_WAVE, IAPROC, and NTPROC from w3adatmd and w3odatmd.
  • Removed
    Myrank, MyError, MyComm, MyStr and Myend and replaced with
    IAPROC, IERR_MPI, MPI_COMM_WAV, grid1/2_str, grid1/2_end
  • cleaned up the mpi send/recv calls.
  1. I tested these changes with a simple case we use of Hurricane Sandy with a nested grid. I ran the case with the original version of scrip_remap_conservative.f and the modified scrip_remap_conservative.F. For each case a new version of rmp_src_to_dst_conserv_002_001.nc was created. I compared the contents of these files by comparisons between src_address, dst_address, dst_grid_frac, and remap_matrix. The results were identical, so the scrip files being generated are the same. I also, looked at results of hs and dtp and they were identical. Happy to do any tests that you have.

Can you try this merge again?

@MatthewMasarik-NOAA
Copy link
Collaborator

Hi @jcwarner-usgs, thank you for addressing the previous comments with these updates, and including the description of the changes. Much appreciated.

A few weeks ago we posted a temporary PR Policy. It is in response to temporary resource reductions, where we don't have the manpower to review PRs, and so are limiting what we take on to just what is needed for the next GFS/GEFS releases being worked on. This is just till mid-November. I really apologize because I know you submitted this prior to that. We will pick this back up from our end just as soon as our temporary status ends sometime in November. We'll look forward to completing the review then. Thanks for understanding.

@jcwarner-usgs
Copy link
Author

All good! no worries.
I had time to make it cleaner and so i worked on it. We will keep using it and ensure it is working.
thanks
-j

@MatthewMasarik-NOAA
Copy link
Collaborator

All good! no worries. I had time to make it cleaner and so i worked on it. We will keep using it and ensure it is working. thanks -j

Awesome. thanks so much for understanding. having some more time to ensure it's working is a side benefit!

@MatthewMasarik-NOAA
Copy link
Collaborator

@jcwarner-usgs just to let you know we are back from our temporary PR policy, and I've started working on the review for this. I'll be back in touch again next week.

* mtm/scrip/mpi:
  fix: OMP do statement
  mv grid1/2 str/end integers outside of scripmpi
  update scrip_remap_conservative.F with SCRIPMPI, switches.json, and src_list.cmake
  update scrip_remap_conservative.F with SCRIPMPI, switches.json, and src_list.cmake
  update scrip_remap_conservative.F with SCRIPMPI, switches.json, and src_list.cmake
  change resize in scrip conserv
  update scrip remap conserv
@MatthewMasarik-NOAA
Copy link
Collaborator

Update
The review for this is underway, with some related OMP needed to be resolved. PR #1323 was submitted to resolve a typo in the ww3_ufs1.x regtests. Another small fix PR to this branch will be made early next week, then this PR should pass regtests b4b.

@MatthewMasarik-NOAA
Copy link
Collaborator

Hi @jcwarner-usgs. I'm about to submit a PR to this PR, which makes a slight fix to a couple of OMP sections. It will require you to review what I've submitted, then merge it if you are OK with it. After that I'll finish the review on this main PR. The PR I will submit only makes two re-ordering edits to the scrip_remap_conservative.F file, though currently you'll see changes in a few other files due to new commits, or automatic whitespace removal from the text editor.

To remove these extra changes before you merge you can sync your branch, or merging in the PR will bring those in with the PR. Since we will Squash Merge the main PR, I think either is fine. I'm happy to answer any questions on doing the merge if you like.

@MatthewMasarik-NOAA
Copy link
Collaborator

@jcwarner-usgs the PR has been submitted, jcwarner-usgs#1.

Fix OMP for scrip_remap_conservative MPI update
@MatthewMasarik-NOAA
Copy link
Collaborator

A quick update: the regtests all passed so everything looks good. I do want to do a few speed comparisons just for my own curiosity before officially approving it. I plan to be able to finish that and wrap the PR up later today or early tomorrow.

@MatthewMasarik-NOAA
Copy link
Collaborator

@jcwarner-usgs I ran into some build issues when trying to do the performance testing. I was doing this by adding 'SCRIPMPI' to a few existing regtests, and from my initial look at error log messages, I think there is something small that needs to be resolved with the calls to run_cmake_test and the cmake switch handling. I'm currently looking into it, though it might take me a day or two to report back. There shouldn't be anything more needed from you, except possibly a small sub-PR merge if I need to make an edit to the build system and/or regtest scripts.

@jcwarner-usgs
Copy link
Author

make sure in the src_list.cmake you have
${CMAKE_CURRENT_SOURCE_DIR}/SCRIP/scrip_remap_conservative.F
need to use .F. The capital F is an extension change for that file, and is needed to make sure the cpp runs that file.
is that what might be missing?

@MatthewMasarik-NOAA
Copy link
Collaborator

MatthewMasarik-NOAA commented Dec 11, 2024

That's a good thing to check. I do have the .F included in my testcase. For the testcase, I'm running a clone of the branch with edits only to the regtest switch files. I believe the issue is related to the switch file processing during a regtest call, which builds the executables at different stages which may or may not be built with MPI.

@MatthewMasarik-NOAA
Copy link
Collaborator

MatthewMasarik-NOAA commented Dec 13, 2024

@jcwarner-usgs I'm still working on how to resolve the regtest build issue, but I wanted to give an update before the weekend. The issue seems to be how our regtests have been structured, in particular where it uses the '-f' option in the call to run_cmake_test. This 'forces' a shared build of pre/post programs, which as set up, conflicts with also using MPI. I've tried simply removing this option in the call, which appears to get pass the build, but then looks to fail in MPI initialization during preprocessing. This makes sense because turning on MPI would then need those programs to be launched with mpiexec/mpirun/srun/etc, and that is not currently built in. So, this is a matter of modifying that portion of the regtest processing to allow for the possibility of compiling with MPI. I need a little more time to work on how that portion should function now potentially using MPI. I'll check in early next week with my progress.

@MatthewMasarik-NOAA
Copy link
Collaborator

MatthewMasarik-NOAA commented Dec 18, 2024

Hi @jcwarner-usgs I met with @JessicaMeixner-NOAA to discuss how to best proceed here. To recap the issue, in our regtest system using '-f' in the call to run_cmake_test forces the building of pre/post -processing programs to be built as Shared and this clashes with also requesting a Distributed build with MPI for those programs. All the existing regtests using SCRIP use the '-f', so when I tried adding SCRIPMPI and executing those tests the build stops because of those conflicting requests.

The plan we came up with to address this has two parts. 1. The regtests should continue to force a shared build when requested, so this will be enforced in run_cmake_test by removing the SCRIPMPI switch from the builds for those programs. 2. A regtest will be added to use the new switch in which a forced build will not be requested. This will be a quick setup made from an existing multi-grid regtest, and will provide a way to test this functionality, and a template for others who want to use the new switch.

There still shouldn't be anything needed from you except for a merge once I make a sub-PR to your branch for these changes. A timeline for when the sub-PR might be ready is next week after Christmas, though there may be some extra time needed due to the holidays. Let me know if you have any questions.

@MatthewMasarik-NOAA
Copy link
Collaborator

Quick update: I've made some progress on (2) above, and am continuing to try to work through (1). For that I'm hitting issues with how the switch file is manipulated by run_cmake_test. Work is ongoing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants