Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W3SRCE refactor [draft] #1123

Draft
wants to merge 112 commits into
base: develop
Choose a base branch
from

Conversation

ukmo-ccbunney
Copy link
Collaborator

@ukmo-ccbunney ukmo-ccbunney commented Nov 10, 2023

Pull Request Summary

Major refactor of W3SRCE subroutine to calculate source terms for arrays of spectra rather than a single spectrum.
Intended to facilitate acceleration on GPU architectures. See presentation in discussion #736 for more details.

Description

The purpose of this change is to address the poor performance of the WW3 source term calculations when running on a GPU. The current implementation of the source term routines processes a single spectrum at a time, which unfortunately does not expose sufficiently large arrays/loops to fully subscribe the large number of parallel threads available on a GPU resulting in poor GPU performance.

The proposed solution to this is to refactor the source term packages to process arrays of spectra, rather than a single spectrum. This provides a large independent seapoint dimension to fully utilise the available GPU threads.

The first step in achieving this is to refactor the W3SRCE routine to accept arrays with a seapoint dimension, rather than values for a single seapoint. This moves the NSEA loop one level down from W3WAVE into W3SRCE. This will allow for the second stage of this refactoring (which will happen at a later date) which will allow individual source term routines to be parallelised over multiple seapoints. However, for this change, source term subroutine calls will be wrapped in a seapoint loop to maintain their current functionality.

This first set of changes does not add any GPU acceleration - it is merely paving the way to facilitate GPU acceleration in the next phase. There should be no change to the model outputs and the performance impact on the CPU should be minimal.

There are a few points of consideration to this change:

  1. It does add some extra complexity to the code (w3srce is already quite large and complex)
  2. New temporary arrays are needed to gather variables on the full grid (NSEA) to contiguous arrays on the local grid (NSEAL).
  3. The change in the looping structure could affect B4B reproducibility (when compared against original code) with some compilers when using higher optimization levels. I have found the Cray compiler to be particularly prone to this.

Summary of important changes:

  • W3WAVE now passes full NSEA(L) dimensioned arrays to W3SRCE`
  • Loop over seapoints NSEA(L) now done in W3SRCE
  • W3SRCE now loops over tiles are seapoints
    • To keep source term subroutines working as before, these are wrapped in sepoint loops
    • As source term subroutines are accelerated on GPU, it is expected they will be refactored to receive arrays of spectra.
    • Array tiling implemented to mitigate impact on CPU performance (though cache misses) and maintain current functionality of processing 1 seapoint at a time on the CPU. Can be set to larger tiles for GPU parallelisation.
    • New TILE sized arrays for mapping of local seapoints to full grid seapoints. (JSEA -> ISEA).
  • Various variables no longer required, or were not being used:
    • VAOLD/VAoldDummy not used in W3SRCE; removed. VAOLD used by PDLIB, but not in w3srce.
    • DELX, DELY and DELA not used. Removed.
    • REFLEC and REFLECD scalars removed - passed in REFL[CD] arrays explicitly instead. Calculation now done in w3srce
    • D50 and PSIC scalars removed. Passing whole arrays.
    • TMP[1234] scalars removed and passed in arrays direct.
    • VSIO/VDIO/SHAVEIO now optional arguments (only for PDLIB)
    • LSLOC test on VSIO/VDIO now done in w3srce.
  • A new loop added to end of main chunk loop to copy local grid variables back to full grid (e.g. USTAR)
  • The following 2D variables have been split into multiple 1D arrays to avoid temporary arrays being created when tiling:
    • WHITECAP
    • TAUICE
    • TAUBBL
    • BEDFORMS

Commit Message

Refactor of W3SRCE to allow source term for arrays of seapoints to be calculated rather than a single seapoint.
To facilitate later acceleration of source terms on GPU architecture.

Check list

Testing

  • How were these changes tested? Regression tests.
  • Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Yes.
  • Have the matrix regression tests been run (if yes, please note HPC and compiler)? Yes; Cray XC; Intel compiler.
  • Please indicate the expected changes in the regression test output, (Note the list of known non-identical tests.)
  • Please provide the summary output of matrix.comp (matrix.Diff.txt, matrixCompFull.txt and matrixCompSummary.txt):

Some code moved around to group together more logically.
Wasted most of a day finding them. :(
NOT COMPLETE! CURRENTLY EXITS AFTER FIRST W3SPR4 call
Made change to allow first W3SPR4 call (when IT=0) to run using all
arrays:
  - Updates some local variables with CHUNKSIZE dimension.
  - Some input parameters to W3SRCE now have extra NSEA[L[M]] dimension
  - Some chunk element loops (CSEA) added to blocks of code
  - W3SRCE only called once, rather than in seapoint loop
Chunk dimensions added to VSIN, VDIN and associated variables
Explicit loop added around calls to W3SINx routines in integration loop
Changes are currently B4B with develop (ST4 compile)
  - variables were being zeroed in chunk loop
  - claculations where being skipped as SRC_MASK was all true
Note: Had to recalculate the source mask after integration loop; I need
to find a better way of doing this.
Commented out some unusued vars
…after non-b4b results with non-homeogenous depth).

Also updated PHICE.
  - Required deprecation of REFLEC and REFLED locals in w3srce
  - Calculation of `REFLEC(4) * BERG` moved into w3srce
  - Needs #ifdefs in call to w3srce as REFLC and REFLD are not allocated
	if W3_REF1 not set
  - Had to make IX and IY chunk arrays in w3srce
@ukmo-ccbunney
Copy link
Collaborator Author

@JessicaMeixner-NOAA and @MatthewMasarik-NOAA
This first stage refactor of W3SRCE is mostly complete. I am marking this PR as draft at the moment, but I would appreciate it if someone could run the regression test matrix on their machine and let me know the results. It would be really useful if you could run it on one of your larger operational configs too and let me know if you see any performance hit on the CPU.

Please note, I have NOT run the OMP/OMPH tests as the original OMP acceleration for the source terms has been lost (it was in the W3WAVE loop) and I have no re-implemented it yet in W3SRCE.

For me, almost everything is B4B apart from the following:

ww3_tp2.19/./work_1B_a                     (9 files differ)
ww3_tp2.19/./work_1A_a                     (9 files differ)
ww3_tp2.19/./work_1C_a                     (9 files differ)
ww3_ts1/./work_ST4_T700                     (2 files differ)
ww3_ts1/./work_T713GQM                     (2 files differ)
ww3_ts1/./work_T707GQM                     (2 files differ)

I think the WW3_TS1 tests will be B4B once issue #1085 is resolved (Mickael has added it to his upcoming ST4 PR).

There are a few things left for me to do before this one is fully ready for testing, most importantly sorting out the OMP stuff. I need to do some more performance analysis on the CPU too.

I am also aware that @mickaelaccensi's upcoming ST4 PR will likely impact these changes, so some time will be needed to sort out any incoming conflicts anyhow.

Getting some early feedback would be good though just to catch anything that my tests might have missed. Thanks!

@ukmo-ccbunney ukmo-ccbunney marked this pull request as draft November 10, 2023 12:59
@ukmo-ccbunney ukmo-ccbunney added the enhancement New feature or request label Nov 10, 2023
@ukmo-ccbunney ukmo-ccbunney self-assigned this Nov 10, 2023
@MatthewMasarik-NOAA
Copy link
Collaborator

@ukmo-ccbunney, thank you, sure thing! I'll get the regression tests going now.

@MatthewMasarik-NOAA
Copy link
Collaborator

@ukmo-ccbunney, I just got the test results, so wanted to share some quick feedback. Below I've organized into three groups: the known non-b4b's, then the non-b4b's you found, and lastly the remaining unexpected (ww3_tp2.6/work_pdlib, ww3_tp2.17/work_mc1,mc,c, ww3_ufs1.1/work_unstr_c). The last group of unexpected tests are all unstructured cases. Tomorrow, I'll look into collecting some performance data.

**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
## known non-b4b
## -------------
mww3_test_03/./work_PR1_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (16 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (11 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (10 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (13 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (17 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2                     (14 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (16 files differ)
mww3_test_09/./work_MPI_ASCII                     (0 files differ)
ww3_tp2.6/./work_ST4_ASCII                     (0 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (6 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_ufs1.3/./work_a                     (3 files differ)

## Tests you found not b4b
## -----------------------
ww3_tp2.19/./work_1B_a                     (9 files differ)
ww3_tp2.19/./work_1A_a                     (9 files differ)
ww3_tp2.19/./work_1C_a                     (9 files differ)
ww3_ts1/./work_ST4_T700                     (2 files differ)
ww3_ts1/./work_T707GQM                     (2 files differ)
ww3_ts1/./work_T713GQM                     (2 files differ)

## unexpected
## ----------
ww3_tp2.6/./work_pdlib                     (13 files differ)
ww3_tp2.17/./work_mc1                     (6 files differ)
ww3_tp2.17/./work_mc                     (6 files differ)
ww3_tp2.17/./work_c                     (10 files differ)
ww3_ufs1.1/./work_unstr_c                     (6 files differ)

**********************************************************************
************************ identical cases *****************************
**********************************************************************

2023-11-13.matrixCompSummary.txt
2023-11-13.matrixCompFull.txt
(ps, the matrixDiff.txt was too large to post, but let me know if you'd like me to share it offline)

@ukmo-ccbunney
Copy link
Collaborator Author

That's great - thanks for dong that @MatthewMasarik-NOAA .
Your results are broadly consistent with mine - the PDLIB tests often give different results for me (I've not got to the bottom of that yet, but I am hoping one of Aron's PRs might solve that). I've also confirmed that that (for me at least), the WW3_TS1 tests are B4B if I add the upcoming WHITECAP initialisation fix to develop.

@MatthewMasarik-NOAA
Copy link
Collaborator

That's great - thanks for dong that @MatthewMasarik-NOAA . Your results are broadly consistent with mine - the PDLIB tests often give different results for me (I've not got to the bottom of that yet, but I am hoping one of Aron's PRs might solve that). I've also confirmed that that (for me at least), the WW3_TS1 tests are B4B if I add the upcoming WHITECAP initialisation fix to develop.

Sure thing, you're welcome @ukmo-ccbunney. That's good to know about the ww3_ts1 results. Can I ask how you tested the ww3_ts1 initialization fix? Separately, I got some non-b4b's running the matrix on #1124 code so I was curious if you pulled that in to test.

@ukmo-ccbunney
Copy link
Collaborator Author

ukmo-ccbunney commented Nov 14, 2023

Can I ask how you tested the ww3_ts1 initialization fix?

I manually added the WHITECAP initialisation to my checked out version of develop
Essentially updating line 2556 in w3srcemd to be WHITECAP(1:4) = 0.0
I didn't pull in any other changes from #1124 .

@MatthewMasarik-NOAA
Copy link
Collaborator

Okay, I see. thanks for explaining

@JessicaMeixner-NOAA
Copy link
Collaborator

@ukmo-ccbunney I ran the regression tests and got these as the diffs:

**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR1_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (8 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (12 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (14 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (17 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2                     (15 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (12 files differ)
mww3_test_09/./work_MPI_ASCII                     (0 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (7 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_tp2.17/./work_mc1                     (6 files differ)
ww3_tp2.17/./work_mc                     (6 files differ)
ww3_tp2.17/./work_c                     (10 files differ)
ww3_tp2.19/./work_1B_a                     (9 files differ)
ww3_tp2.19/./work_1A_a                     (9 files differ)
ww3_tp2.19/./work_1C_a                     (9 files differ)
ww3_tp2.6/./work_pdlib                     (13 files differ)
ww3_tp2.6/./work_ST4_ASCII                     (0 files differ)
ww3_ufs1.1/./work_unstr_c                     (6 files differ)
ww3_ufs1.3/./work_a                     (3 files differ)

So I also am getting that the unstructured grid cases are having minor differences, which I believe you are as well. Was this related to an initalization bug that never got properly initialized: #1017

…ce_refactor

* upstream/develop:
  Addition of Regression Test (ww3_tic1.1/IC4_M10) (NOAA-EMC#1331)
  IC4M10: New wave damping scheme in sea ice (NOAA-EMC#1293)
  Fixing uninitialized issues within the implicit scheme (NOAA-EMC#1142)
  ww3_ufs1.x: fix typo in switch_MPI_OMPH (NOAA-EMC#1323)
  README.md: update with link to doxygen documentation (NOAA-EMC#1316)
@ukmo-ccbunney
Copy link
Collaborator Author

@ukmo-ccbunney I ran the regression tests and got these as the diffs:

So I also am getting that the unstructured grid cases are having minor differences, which I believe you are as well. Was this related to an initalization bug that never got properly initialized: #1017

I am hoping it is related to the initialization bug. I've got some time to work on this again now - I'm merging in the changes and will rerun the regression tests.

@JessicaMeixner-NOAA
Copy link
Collaborator

When you have things merged in - let me know and I'm happy to run tests on my end as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants