Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory "management" issue with intel #1322

Open
guillaumevernieres opened this issue Oct 11, 2024 · 23 comments
Open

Memory "management" issue with intel #1322

guillaumevernieres opened this issue Oct 11, 2024 · 23 comments
Labels

Comments

@guillaumevernieres
Copy link
Contributor

The soca variational application takes an insane amount of memory on Hecules and Gaea (~8TB for the simple 3DVAR), both use intel 2021.9.0. The same application on Hera requires ~0.8TB of memory, the intel compiler version on Hera is 2021.5.0.

I have no idea if the compiler is the issue.

@guillaumevernieres
Copy link
Contributor Author

I'm labeling this as soca, but I wonder if it's an issues for the fv3-jedi application as well. @RussTreadon-NOAA , @CoryMartin-NOAA or others, have you tried running the variational application on Hercules lately?

@travissluka
Copy link

fyi @fmahebert

@guillaumevernieres
Copy link
Contributor Author

gnu-vs-intel

path to logs:

/work2/noaa/da/gvernier/runs/profiling-3dvar

@fmahebert
Copy link

Yes, we've seen the intel compiler produce executables that take up way more memory for a little while now. We haven't been able to pinpoint the cause yet. Thanks for opening an issue to track this and to share your measurements.

Related issue (presumably):

@guillaumevernieres
Copy link
Contributor Author

@jswhit
Copy link

jswhit commented Dec 12, 2024

@fmahebert has there been any progress on reducing the memory footprint using intel? This has been blocking us for months now - can't run even a low resolution coupled DA experiment.

@fmahebert
Copy link

@jswhit There's been no progress towards understanding this issue from the JCSDA core team, largely for lack of resources (not a lack of concern).

@dkokron
Copy link

dkokron commented Dec 30, 2024

If you set me up with a small (<= 32 nodes) reproducer on wcoss2, then I can take a look and work with Intel on a solution. I have a complete build of the global-workflow on dogwood.
dogwood:/lfs/h2/hpc/support/daniel.kokron/Projects/GlobalWorkflow/global-workflow

@jswhit
Copy link

jswhit commented Jan 2, 2025

would using gcc for GDASapp (instead of intel) be a potential workaround for this in the short term?

@shlyaeva
Copy link
Collaborator

shlyaeva commented Jan 2, 2025

@jswhit yes, certain compilers/platform combinations certainly work significantly better. E.g. intel + hera doesn't have this issue as Guillaume pointed out. Using gnu on orion/hercules is another option, I think Guillaume had a lot more success with that than with intel.
Bo did a great summary of the results with gnu vs intel on hera and orion in https://github.com/JCSDA-internal/fv3-jedi/issues/1256. It's profiling fv3-jedi + LETKF, but I think it's representative of runs with soca, as well as with Var too.

@jswhit
Copy link

jswhit commented Jan 3, 2025

@shlyaeva I'm mainly interested in running on gaea, but I think the issue there is the same as on hercules/orion (they all use a newer intel compiler than hera).

@dkokron
Copy link

dkokron commented Jan 8, 2025

Is there any interest in having me work this issue with Intel? I'll need a reproducer on WCOSS2.

@guillaumevernieres
Copy link
Contributor Author

Is there any interest in having me work this issue with Intel? I'll need a reproducer on WCOSS2.

It only seems to happen with a specific intel compiler version @dkokron (2021.9.0). I don't think that version is available on WCOSS2.

@guillaumevernieres
Copy link
Contributor Author

guillaumevernieres commented Jan 10, 2025

3DVAR timing and memory footprint

Not a real solution but rather a reasonable work around is to have the background error at lower resolution.
These are the stats that I get for the 3DVAR/100 inner iterations and using the 1/2 grid provided by @travissluka :

OOPS_STATS ------------------------------------------------------------------------------------------------------------------
OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 400 MPI tasks) -----------------------------------
OOPS_STATS ------------------------------------------------------------------------------------------------------------------
OOPS_STATS Name                                                :     min (ms)    max (ms)    avg (ms)     % total   imbal (%)
OOPS_STATS oops::Covariance::SABER::Constructor                :      1103.82     1198.85     1146.34        1.00        8.29
OOPS_STATS oops::Covariance::SABER::multiply                   :     58715.02    59243.15    58930.03       51.27        0.90
OOPS_STATS oops::Diffusion::multiplySqrtAD                     :     23431.70    25692.76    24786.82       21.56        9.12
OOPS_STATS oops::Diffusion::multiplySqrtTL                     :     18287.15    19840.62    19109.24       16.62        8.13
OOPS_STATS oops::Geometry::Geometry                            :      2675.98     2749.84     2716.12        2.36        2.72
OOPS_STATS oops::GeometryData::setGlobalTree                   :      8839.24    11841.87    11075.01        9.64       27.11
OOPS_STATS oops::GetValues::GetValues                          :       322.96      630.51      372.30        0.32       82.61
OOPS_STATS oops::GetValues::fillGeoVaLsAD                      :        96.94      218.40      146.52        0.13       82.89
OOPS_STATS oops::GetValues::fillGeoVaLsTL                      :       333.48      727.05      566.53        0.49       69.47
OOPS_STATS oops::GetValues::finalizeAD                         :        50.92      411.44      184.99        0.16      194.88
OOPS_STATS oops::GetValues::processAD                          :       375.12      795.87      434.65        0.38       96.80
OOPS_STATS oops::GetValues::processTL                          :       484.02      910.18      549.37        0.48       77.57
OOPS_STATS oops::Increment::Increment                          :      3428.65     4929.35     4658.44        4.05       32.21
OOPS_STATS oops::Increment::diff                               :      6225.35     6495.68     6443.38        5.61        4.20
OOPS_STATS oops::Increment::fromFieldSet                       :       170.79      300.65      255.70        0.22       50.79
OOPS_STATS oops::Increment::operator+=                         :       118.17      238.08      209.65        0.18       57.19
OOPS_STATS oops::Increment::operator+=(State, Increment)       :       377.02      419.71      411.78        0.36       10.37
OOPS_STATS oops::Increment::operator=                          :      4396.43     5846.15     4677.29        4.07       30.99
OOPS_STATS oops::Increment::read                               :       826.37      988.37      887.24        0.77       18.26
OOPS_STATS oops::Increment::toFieldSet                         :       433.85      628.13      565.44        0.49       34.36
OOPS_STATS oops::Increment::write                              :      4670.61     4762.67     4683.27        4.07        1.97
OOPS_STATS oops::LinearObsOper::adt_rads_all::simulateObsAD    :        43.70      493.48      146.35        0.13      307.34
OOPS_STATS oops::LinearObsOper::adt_rads_all::simulateObsTL    :       127.41      425.35      331.05        0.29       90.00
OOPS_STATS oops::LinearVariableChange::changeVarAD             :      5457.51     6193.37     5905.87        5.14       12.46
OOPS_STATS oops::LinearVariableChange::changeVarTL             :      3758.81     4353.66     3932.29        3.42       15.13
OOPS_STATS oops::LinearVariableChange::changeVarTraj           :      1163.66     1251.76     1209.80        1.05        7.28
OOPS_STATS oops::Model::step                                   :      3149.08     3165.68     3160.71        2.75        0.53
OOPS_STATS oops::ObsSpace::ObsSpace                            :      1262.82     1464.81     1312.78        1.14       15.39
OOPS_STATS oops::ObsSpace::save                                :       873.28     1008.77      981.19        0.85       13.81
OOPS_STATS oops::ObsVector::dot_product                        :       502.90      660.94      606.96        0.53       26.04
OOPS_STATS oops::Parameters::deserialize                       :       190.62      371.58      319.44        0.28       56.65
OOPS_STATS oops::State::State                                  :      8297.06     8555.88     8466.91        7.37        3.06
OOPS_STATS oops::State::print                                  :       125.20      265.05      145.83        0.13       95.90
OOPS_STATS oops::State::read                                   :      3149.02     3165.62     3160.65        2.75        0.53
OOPS_STATS oops::State::write                                  :      6392.60     6446.01     6399.05        5.57        0.83
OOPS_STATS oops::UnstructuredInterpolator::UnstructuredInterpolator:        70.26      462.72      120.90        0.11      324.61
OOPS_STATS oops::UnstructuredInterpolator::apply               :       995.31     1399.08     1104.99        0.96       36.54
OOPS_STATS oops::UnstructuredInterpolator::applyAD             :       256.86      510.30      294.35        0.26       86.10
OOPS_STATS oops::VariableChange::changeVar                     :       793.25      909.37      815.99        0.71       14.23
OOPS_STATS oops::mpi::broadcast                                :       248.01      472.37      414.88        0.36       54.08
OOPS_STATS util::Timers::Total                                 :    114943.10   114955.84   114945.58      100.00        0.01
OOPS_STATS util::Timers::measured                              :    110541.30   111279.83   110949.84       96.52        0.67
OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 400 MPI tasks) -----------------------------------

OOPS_STATS Run end                                  - Runtime:    119.34 sec,  Memory: total:  2055.94 Gb, per task: min =     4.70 Gb, max =     6.21 Gb
Run: Finishing oops::Variational<SOCA, UFO and IODA observations> with status = 0
OOPS Ending   2025-01-10 14:23:19 (UTC-0600)

The memory footprint is still way off what it should be, but that should allow running at least the 3DVAR on the MSU machines. I like the runtime!

@shlyaeva
Copy link
Collaborator

@guillaumevernieres and @travissluka excellent!

@guillaumevernieres
Copy link
Contributor Author

guillaumevernieres commented Jan 14, 2025

Envar (only ens. B) timing and memory footprint

32 nodes on hercules so about 16.4TB of memory available.

OOPS_STATS ------------------------------------------------------------------------------------------------------------------
OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 384 MPI tasks) -----------------------------------
OOPS_STATS ------------------------------------------------------------------------------------------------------------------
OOPS_STATS Name                                                :     min (ms)    max (ms)    avg (ms)     % total   imbal (%)
OOPS_STATS oops::Covariance::EnsembleCovariance                :     41562.28    41566.72    41565.02       16.97        0.01
OOPS_STATS oops::Covariance::HybridCovariance                  :     41562.38    41566.81    41565.11       16.97        0.01
OOPS_STATS oops::Covariance::ensemble::multiply                :    141330.27   141468.64   141419.71       57.74        0.10
OOPS_STATS oops::Covariance::hybrid::multiply                  :    142527.23   143173.96   142725.83       58.27        0.45
OOPS_STATS oops::Diffusion::multiplySqrtAD                     :     13347.71    26988.30    20699.80        8.45       65.90
OOPS_STATS oops::Diffusion::multiplySqrtTL                     :     11974.23    12411.24    12174.95        4.97        3.59
OOPS_STATS oops::Geometry::Geometry                            :      4513.01     4826.65     4729.41        1.93        6.63
OOPS_STATS oops::GeometryData::setGlobalTree                   :      7778.75     8272.73     7953.50        3.25        6.21
OOPS_STATS oops::GetValues::GetValues                          :       672.11     1345.24     1067.41        0.44       63.06
OOPS_STATS oops::GetValues::fillGeoVaLsAD                      :       182.49      449.04      279.43        0.11       95.39
OOPS_STATS oops::GetValues::fillGeoVaLsTL                      :       551.07     2372.48     2019.65        0.82       90.18
OOPS_STATS oops::GetValues::finalizeAD                         :        90.13      953.59      456.38        0.19      189.20
OOPS_STATS oops::GetValues::processAD                          :      1139.94     3050.05     1362.92        0.56      140.15
OOPS_STATS oops::GetValues::processTL                          :      1383.54     3350.98     1607.84        0.66      122.36
OOPS_STATS oops::Increment::Increment                          :     16213.04    21797.49    17508.36        7.15       31.90
OOPS_STATS oops::Increment::deserialize                        :      1270.86     1463.40     1391.76        0.57       13.83
OOPS_STATS oops::Increment::diff                               :      5036.97     5177.78     5132.75        2.10        2.74
OOPS_STATS oops::Increment::fromFieldSet                       :      4728.63     6766.53     6040.92        2.47       33.73
OOPS_STATS oops::Increment::operator+=                         :      3679.93     4015.73     3807.93        1.55        8.82
OOPS_STATS oops::Increment::operator+=(State, Increment)       :       494.53      557.88      548.78        0.22       11.54
OOPS_STATS oops::Increment::operator=                          :     17725.19    29973.16    23352.51        9.53       52.45
OOPS_STATS oops::Increment::schur_product_with                 :      6944.84     7680.13     7179.69        2.93       10.24
OOPS_STATS oops::Increment::serialize                          :      5458.99     6124.22     5843.42        2.39       11.38
OOPS_STATS oops::Increment::toFieldSet                         :      6835.37     8428.32     7182.96        2.93       22.18
OOPS_STATS oops::Increment::write                              :      5151.49     5238.21     5194.55        2.12        1.67
OOPS_STATS oops::LinearObsOper::adt_rads_all::simulateObsTL    :        58.84      364.90      280.13        0.11      109.26
OOPS_STATS oops::LinearVariableChange::changeVarAD             :      1029.50     2871.49     1676.27        0.68      109.89
OOPS_STATS oops::LinearVariableChange::changeVarTL             :       540.71      722.36      613.74        0.25       29.60
OOPS_STATS oops::Localization::Localization                    :       341.41      361.83      354.49        0.14        5.76
OOPS_STATS oops::Localization::multiply                        :     95456.96   106501.69   101509.03       41.45       10.88
OOPS_STATS oops::Model::step                                   :      4705.01     4729.89     4718.94        1.93        0.53
OOPS_STATS oops::ObsSpace::ObsSpace                            :      2342.38     2389.17     2373.72        0.97        1.97
OOPS_STATS oops::ObsSpace::save                                :      1698.46     1807.57     1773.40        0.72        6.15
OOPS_STATS oops::ObsVector::dot_product                        :      1728.08     2340.35     2140.95        0.87       28.60
OOPS_STATS oops::Parameters::deserialize                       :       480.64      662.44      536.48        0.22       33.89
OOPS_STATS oops::State::State                                  :     47629.94    48087.49    47908.15       19.56        0.96
OOPS_STATS oops::State::print                                  :       216.71      368.33      352.11        0.14       43.06
OOPS_STATS oops::State::read                                   :      4704.97     4729.84     4718.89        1.93        0.53
OOPS_STATS oops::State::write                                  :      6295.20     6372.02     6325.02        2.58        1.21
OOPS_STATS oops::UnstructuredInterpolator::apply               :      1777.95     3953.01     2038.08        0.83      106.72
OOPS_STATS oops::UnstructuredInterpolator::applyAD             :       752.56     2610.84      924.51        0.38      201.00
OOPS_STATS oops::VariableChange::changeVar                     :       736.43      849.89      777.40        0.32       14.60
OOPS_STATS oops::mpi::broadcast                                :      6772.03     7625.82     7277.89        2.97       11.73
OOPS_STATS util::Timers::Total                                 :    244899.57   245088.85   244919.20      100.00        0.08
OOPS_STATS util::Timers::measured                              :    240758.60   243753.44   242169.49       98.88        1.24
OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 384 MPI tasks) -----------------------------------

OOPS_STATS Run end                                  - Runtime:    249.50 sec,  Memory: total: 14252.49 Gb, per task: min =    29.84 Gb, max =    39.28 Gb
Run: Finishing oops::Variational<SOCA, UFO and IODA observations> with status = 0
OOPS Ending   2025-01-14 08:15:24 (UTC-0600)

@shlyaeva
Copy link
Collaborator

Interesting that Increment assignment takes so long, maybe something we can look into (maybe there are too many in the ensemble covariance multiply and that can be simplified?). Also, looks like diffusion adjoint takes a lot longer compared to tl (esp comparing to the Var stats), but it could be that the memory swapping is warping the statistics.

@shlyaeva
Copy link
Collaborator

@guillaumevernieres how many Increment::operator= calls are there compared to Covariance::ensemble::multiply calls?

@shlyaeva
Copy link
Collaborator

@guillaumevernieres we could also try switching to saber ensemble covariance (the end result should be the same, it's just a different way of doing ensemble covariances, in saber on atlas fieldsets instead of in oops on model increments), and seeing if fieldset assignments/copies are cheaper than soca increments assignments/copies.

@travissluka
Copy link

@guillaumevernieres we could also try switching to saber ensemble covariance (the end result should be the same, it's just a different way of doing ensemble covariances, in saber on atlas fieldsets instead of in oops on model increments), and seeing if fieldset assignments/copies are cheaper than soca increments assignments/copies.

Yes, saber should be more efficient. There are lingering inefficiencies in some soca state/increment calls where extra conversions to/from atlas are taking place. A temporary side effect of moving soca to store everything internally as atlas fieldsets

@guillaumevernieres
Copy link
Contributor Author

@shlyaeva , @travissluka , I'll redo the test above with a few more nodes. The first 85+ iterations were quite quick but everything slowed down after that. I suspect the system was swapping/running out of memory so the timing above is probably not accurate.

@guillaumevernieres
Copy link
Contributor Author

guillaumevernieres commented Jan 14, 2025

Looking at the correct output now.
Left is the SABER ensemble right is OOPS ensemble
envar-test
So 2TB vs 14TB ... Good hunch @travissluka and @shlyaeva

The results are the same also:
envar-test-2

@guillaumevernieres
Copy link
Contributor Author

Same as above for for the hybid:

OOPS_STATS ------------------------------------------------------------------------------------------------------------------
OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 320 MPI tasks) -----------------------------------
OOPS_STATS ------------------------------------------------------------------------------------------------------------------
OOPS_STATS Name                                                :     min (ms)    max (ms)    avg (ms)     % total   imbal (%)
OOPS_STATS oops::Covariance::HybridCovariance                  :     32546.17    32553.11    32550.28       12.94        0.02
OOPS_STATS oops::Covariance::SABER::Constructor                :       913.14      979.31      940.93        0.37        7.03
OOPS_STATS oops::Covariance::SABER::multiply                   :    157383.20   157631.05   157485.01       62.61        0.16
OOPS_STATS oops::Covariance::hybrid::multiply                  :    159325.24   159741.42   159458.67       63.39        0.26
OOPS_STATS oops::Diffusion::multiplySqrtAD                     :     37250.82    39920.72    38443.99       15.28        6.94
OOPS_STATS oops::Diffusion::multiplySqrtTL                     :     30224.91    31180.82    30701.80       12.21        3.11
OOPS_STATS oops::Geometry::Geometry                            :      5050.11     5641.13     5261.35        2.09       11.23
OOPS_STATS oops::GeometryData::setGlobalTree                   :     11199.75    12161.53    11763.62        4.68        8.18
OOPS_STATS oops::GetValues::GetValues                          :       751.81     1127.68      823.91        0.33       45.62
OOPS_STATS oops::GetValues::fillGeoVaLsAD                      :       177.67      484.59      276.60        0.11      110.96
OOPS_STATS oops::GetValues::fillGeoVaLsTL                      :       453.26     2380.51     2049.45        0.81       94.04
OOPS_STATS oops::GetValues::finalizeAD                         :        75.25      759.79      352.43        0.14      194.23
OOPS_STATS oops::GetValues::processAD                          :       967.67     3038.35     1210.55        0.48      171.05
OOPS_STATS oops::GetValues::processTL                          :      1163.24     3244.19     1423.43        0.57      146.19
OOPS_STATS oops::Increment::Increment                          :      3785.82     4141.98     4007.73        1.59        8.89
OOPS_STATS oops::Increment::diff                               :      7440.53     7511.48     7471.24        2.97        0.95
OOPS_STATS oops::Increment::fromFieldSet                       :       462.78      641.97      541.01        0.22       33.12
OOPS_STATS oops::Increment::operator+=                         :       516.85      586.55      546.45        0.22       12.76
OOPS_STATS oops::Increment::operator+=(State, Increment)       :       395.56      400.74      397.46        0.16        1.31
OOPS_STATS oops::Increment::operator=                          :      5172.03     5698.00     5465.41        2.17        9.62
OOPS_STATS oops::Increment::read                               :       763.99      814.95      800.32        0.32        6.37
OOPS_STATS oops::Increment::toFieldSet                         :       824.81     1012.28      904.43        0.36       20.73
OOPS_STATS oops::Increment::write                              :      1403.66     1421.43     1409.53        0.56        1.26
OOPS_STATS oops::LinearObsOper::adt_rads_all::simulateObsTL    :        21.16      340.36      271.95        0.11      117.37
OOPS_STATS oops::LinearVariableChange::changeVarAD             :      4591.56     6564.58     5048.98        2.01       39.08
OOPS_STATS oops::LinearVariableChange::changeVarTL             :      3794.12     4031.92     3922.07        1.56        6.06
OOPS_STATS oops::LinearVariableChange::changeVarTraj           :       997.01     1081.35     1034.31        0.41        8.15
OOPS_STATS oops::Model::step                                   :      4681.61     4687.85     4685.61        1.86        0.13
OOPS_STATS oops::ObsSpace::ObsSpace                            :      2760.37     2824.17     2785.27        1.11        2.29
OOPS_STATS oops::ObsSpace::save                                :      1798.49     1895.83     1873.13        0.74        5.20
OOPS_STATS oops::ObsVector::dot_product                        :       890.55     1341.97     1186.49        0.47       38.05
OOPS_STATS oops::Parameters::deserialize                       :       501.14      579.66      540.28        0.21       14.53
OOPS_STATS oops::State::State                                  :     34111.91    34284.96    34205.59       13.60        0.51
OOPS_STATS oops::State::read                                   :      4681.54     4687.77     4685.55        1.86        0.13
OOPS_STATS oops::State::write                                  :      6560.15     6609.21     6562.67        2.61        0.75
OOPS_STATS oops::UnstructuredInterpolator::apply               :      1761.80     3939.60     2077.34        0.83      104.84
OOPS_STATS oops::UnstructuredInterpolator::applyAD             :       632.49     2561.78      827.84        0.33      233.05
OOPS_STATS oops::VariableChange::changeVar                     :       917.54      926.91      922.06        0.37        1.02
OOPS_STATS oops::mpi::broadcast                                :      9783.89    11783.99    10521.28        4.18       19.01
OOPS_STATS util::Timers::Total                                 :    251516.92   251592.34   251542.39      100.00        0.03
OOPS_STATS util::Timers::measured                              :    247448.18   250575.87   248576.11       98.82        1.26
OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 320 MPI tasks) -----------------------------------

OOPS_STATS Run end                                  - Runtime:    257.84 sec,  Memory: total:  2902.69 Gb, per task: min =     6.58 Gb, max =    10.19 Gb
Run: Finishing oops::Variational<SOCA, UFO and IODA observations> with status = 0
OOPS Ending   2025-01-15 07:57:32 (UTC-0600)

Seems reasonable to me. Thanks for your help/guidance @shlyaeva and @travissluka .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants