-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory "management" issue with intel #1322
Comments
I'm labeling this as |
fyi @fmahebert |
Yes, we've seen the intel compiler produce executables that take up way more memory for a little while now. We haven't been able to pinpoint the cause yet. Thanks for opening an issue to track this and to share your measurements. Related issue (presumably): |
Running with map profiling: |
@fmahebert has there been any progress on reducing the memory footprint using intel? This has been blocking us for months now - can't run even a low resolution coupled DA experiment. |
@jswhit There's been no progress towards understanding this issue from the JCSDA core team, largely for lack of resources (not a lack of concern). |
If you set me up with a small (<= 32 nodes) reproducer on wcoss2, then I can take a look and work with Intel on a solution. I have a complete build of the global-workflow on dogwood. |
would using gcc for GDASapp (instead of intel) be a potential workaround for this in the short term? |
@jswhit yes, certain compilers/platform combinations certainly work significantly better. E.g. intel + hera doesn't have this issue as Guillaume pointed out. Using gnu on orion/hercules is another option, I think Guillaume had a lot more success with that than with intel. |
@shlyaeva I'm mainly interested in running on gaea, but I think the issue there is the same as on hercules/orion (they all use a newer intel compiler than hera). |
Is there any interest in having me work this issue with Intel? I'll need a reproducer on WCOSS2. |
It only seems to happen with a specific intel compiler version @dkokron (2021.9.0). I don't think that version is available on WCOSS2. |
3DVAR timing and memory footprintNot a real solution but rather a reasonable work around is to have the background error at lower resolution. OOPS_STATS ------------------------------------------------------------------------------------------------------------------
OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 400 MPI tasks) -----------------------------------
OOPS_STATS ------------------------------------------------------------------------------------------------------------------
OOPS_STATS Name : min (ms) max (ms) avg (ms) % total imbal (%)
OOPS_STATS oops::Covariance::SABER::Constructor : 1103.82 1198.85 1146.34 1.00 8.29
OOPS_STATS oops::Covariance::SABER::multiply : 58715.02 59243.15 58930.03 51.27 0.90
OOPS_STATS oops::Diffusion::multiplySqrtAD : 23431.70 25692.76 24786.82 21.56 9.12
OOPS_STATS oops::Diffusion::multiplySqrtTL : 18287.15 19840.62 19109.24 16.62 8.13
OOPS_STATS oops::Geometry::Geometry : 2675.98 2749.84 2716.12 2.36 2.72
OOPS_STATS oops::GeometryData::setGlobalTree : 8839.24 11841.87 11075.01 9.64 27.11
OOPS_STATS oops::GetValues::GetValues : 322.96 630.51 372.30 0.32 82.61
OOPS_STATS oops::GetValues::fillGeoVaLsAD : 96.94 218.40 146.52 0.13 82.89
OOPS_STATS oops::GetValues::fillGeoVaLsTL : 333.48 727.05 566.53 0.49 69.47
OOPS_STATS oops::GetValues::finalizeAD : 50.92 411.44 184.99 0.16 194.88
OOPS_STATS oops::GetValues::processAD : 375.12 795.87 434.65 0.38 96.80
OOPS_STATS oops::GetValues::processTL : 484.02 910.18 549.37 0.48 77.57
OOPS_STATS oops::Increment::Increment : 3428.65 4929.35 4658.44 4.05 32.21
OOPS_STATS oops::Increment::diff : 6225.35 6495.68 6443.38 5.61 4.20
OOPS_STATS oops::Increment::fromFieldSet : 170.79 300.65 255.70 0.22 50.79
OOPS_STATS oops::Increment::operator+= : 118.17 238.08 209.65 0.18 57.19
OOPS_STATS oops::Increment::operator+=(State, Increment) : 377.02 419.71 411.78 0.36 10.37
OOPS_STATS oops::Increment::operator= : 4396.43 5846.15 4677.29 4.07 30.99
OOPS_STATS oops::Increment::read : 826.37 988.37 887.24 0.77 18.26
OOPS_STATS oops::Increment::toFieldSet : 433.85 628.13 565.44 0.49 34.36
OOPS_STATS oops::Increment::write : 4670.61 4762.67 4683.27 4.07 1.97
OOPS_STATS oops::LinearObsOper::adt_rads_all::simulateObsAD : 43.70 493.48 146.35 0.13 307.34
OOPS_STATS oops::LinearObsOper::adt_rads_all::simulateObsTL : 127.41 425.35 331.05 0.29 90.00
OOPS_STATS oops::LinearVariableChange::changeVarAD : 5457.51 6193.37 5905.87 5.14 12.46
OOPS_STATS oops::LinearVariableChange::changeVarTL : 3758.81 4353.66 3932.29 3.42 15.13
OOPS_STATS oops::LinearVariableChange::changeVarTraj : 1163.66 1251.76 1209.80 1.05 7.28
OOPS_STATS oops::Model::step : 3149.08 3165.68 3160.71 2.75 0.53
OOPS_STATS oops::ObsSpace::ObsSpace : 1262.82 1464.81 1312.78 1.14 15.39
OOPS_STATS oops::ObsSpace::save : 873.28 1008.77 981.19 0.85 13.81
OOPS_STATS oops::ObsVector::dot_product : 502.90 660.94 606.96 0.53 26.04
OOPS_STATS oops::Parameters::deserialize : 190.62 371.58 319.44 0.28 56.65
OOPS_STATS oops::State::State : 8297.06 8555.88 8466.91 7.37 3.06
OOPS_STATS oops::State::print : 125.20 265.05 145.83 0.13 95.90
OOPS_STATS oops::State::read : 3149.02 3165.62 3160.65 2.75 0.53
OOPS_STATS oops::State::write : 6392.60 6446.01 6399.05 5.57 0.83
OOPS_STATS oops::UnstructuredInterpolator::UnstructuredInterpolator: 70.26 462.72 120.90 0.11 324.61
OOPS_STATS oops::UnstructuredInterpolator::apply : 995.31 1399.08 1104.99 0.96 36.54
OOPS_STATS oops::UnstructuredInterpolator::applyAD : 256.86 510.30 294.35 0.26 86.10
OOPS_STATS oops::VariableChange::changeVar : 793.25 909.37 815.99 0.71 14.23
OOPS_STATS oops::mpi::broadcast : 248.01 472.37 414.88 0.36 54.08
OOPS_STATS util::Timers::Total : 114943.10 114955.84 114945.58 100.00 0.01
OOPS_STATS util::Timers::measured : 110541.30 111279.83 110949.84 96.52 0.67
OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 400 MPI tasks) -----------------------------------
OOPS_STATS Run end - Runtime: 119.34 sec, Memory: total: 2055.94 Gb, per task: min = 4.70 Gb, max = 6.21 Gb
Run: Finishing oops::Variational<SOCA, UFO and IODA observations> with status = 0
OOPS Ending 2025-01-10 14:23:19 (UTC-0600) The memory footprint is still way off what it should be, but that should allow running at least the 3DVAR on the MSU machines. I like the runtime! |
@guillaumevernieres and @travissluka excellent! |
Envar (only ens. B) timing and memory footprint32 nodes on hercules so about 16.4TB of memory available. OOPS_STATS ------------------------------------------------------------------------------------------------------------------
OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 384 MPI tasks) -----------------------------------
OOPS_STATS ------------------------------------------------------------------------------------------------------------------
OOPS_STATS Name : min (ms) max (ms) avg (ms) % total imbal (%)
OOPS_STATS oops::Covariance::EnsembleCovariance : 41562.28 41566.72 41565.02 16.97 0.01
OOPS_STATS oops::Covariance::HybridCovariance : 41562.38 41566.81 41565.11 16.97 0.01
OOPS_STATS oops::Covariance::ensemble::multiply : 141330.27 141468.64 141419.71 57.74 0.10
OOPS_STATS oops::Covariance::hybrid::multiply : 142527.23 143173.96 142725.83 58.27 0.45
OOPS_STATS oops::Diffusion::multiplySqrtAD : 13347.71 26988.30 20699.80 8.45 65.90
OOPS_STATS oops::Diffusion::multiplySqrtTL : 11974.23 12411.24 12174.95 4.97 3.59
OOPS_STATS oops::Geometry::Geometry : 4513.01 4826.65 4729.41 1.93 6.63
OOPS_STATS oops::GeometryData::setGlobalTree : 7778.75 8272.73 7953.50 3.25 6.21
OOPS_STATS oops::GetValues::GetValues : 672.11 1345.24 1067.41 0.44 63.06
OOPS_STATS oops::GetValues::fillGeoVaLsAD : 182.49 449.04 279.43 0.11 95.39
OOPS_STATS oops::GetValues::fillGeoVaLsTL : 551.07 2372.48 2019.65 0.82 90.18
OOPS_STATS oops::GetValues::finalizeAD : 90.13 953.59 456.38 0.19 189.20
OOPS_STATS oops::GetValues::processAD : 1139.94 3050.05 1362.92 0.56 140.15
OOPS_STATS oops::GetValues::processTL : 1383.54 3350.98 1607.84 0.66 122.36
OOPS_STATS oops::Increment::Increment : 16213.04 21797.49 17508.36 7.15 31.90
OOPS_STATS oops::Increment::deserialize : 1270.86 1463.40 1391.76 0.57 13.83
OOPS_STATS oops::Increment::diff : 5036.97 5177.78 5132.75 2.10 2.74
OOPS_STATS oops::Increment::fromFieldSet : 4728.63 6766.53 6040.92 2.47 33.73
OOPS_STATS oops::Increment::operator+= : 3679.93 4015.73 3807.93 1.55 8.82
OOPS_STATS oops::Increment::operator+=(State, Increment) : 494.53 557.88 548.78 0.22 11.54
OOPS_STATS oops::Increment::operator= : 17725.19 29973.16 23352.51 9.53 52.45
OOPS_STATS oops::Increment::schur_product_with : 6944.84 7680.13 7179.69 2.93 10.24
OOPS_STATS oops::Increment::serialize : 5458.99 6124.22 5843.42 2.39 11.38
OOPS_STATS oops::Increment::toFieldSet : 6835.37 8428.32 7182.96 2.93 22.18
OOPS_STATS oops::Increment::write : 5151.49 5238.21 5194.55 2.12 1.67
OOPS_STATS oops::LinearObsOper::adt_rads_all::simulateObsTL : 58.84 364.90 280.13 0.11 109.26
OOPS_STATS oops::LinearVariableChange::changeVarAD : 1029.50 2871.49 1676.27 0.68 109.89
OOPS_STATS oops::LinearVariableChange::changeVarTL : 540.71 722.36 613.74 0.25 29.60
OOPS_STATS oops::Localization::Localization : 341.41 361.83 354.49 0.14 5.76
OOPS_STATS oops::Localization::multiply : 95456.96 106501.69 101509.03 41.45 10.88
OOPS_STATS oops::Model::step : 4705.01 4729.89 4718.94 1.93 0.53
OOPS_STATS oops::ObsSpace::ObsSpace : 2342.38 2389.17 2373.72 0.97 1.97
OOPS_STATS oops::ObsSpace::save : 1698.46 1807.57 1773.40 0.72 6.15
OOPS_STATS oops::ObsVector::dot_product : 1728.08 2340.35 2140.95 0.87 28.60
OOPS_STATS oops::Parameters::deserialize : 480.64 662.44 536.48 0.22 33.89
OOPS_STATS oops::State::State : 47629.94 48087.49 47908.15 19.56 0.96
OOPS_STATS oops::State::print : 216.71 368.33 352.11 0.14 43.06
OOPS_STATS oops::State::read : 4704.97 4729.84 4718.89 1.93 0.53
OOPS_STATS oops::State::write : 6295.20 6372.02 6325.02 2.58 1.21
OOPS_STATS oops::UnstructuredInterpolator::apply : 1777.95 3953.01 2038.08 0.83 106.72
OOPS_STATS oops::UnstructuredInterpolator::applyAD : 752.56 2610.84 924.51 0.38 201.00
OOPS_STATS oops::VariableChange::changeVar : 736.43 849.89 777.40 0.32 14.60
OOPS_STATS oops::mpi::broadcast : 6772.03 7625.82 7277.89 2.97 11.73
OOPS_STATS util::Timers::Total : 244899.57 245088.85 244919.20 100.00 0.08
OOPS_STATS util::Timers::measured : 240758.60 243753.44 242169.49 98.88 1.24
OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 384 MPI tasks) -----------------------------------
OOPS_STATS Run end - Runtime: 249.50 sec, Memory: total: 14252.49 Gb, per task: min = 29.84 Gb, max = 39.28 Gb
Run: Finishing oops::Variational<SOCA, UFO and IODA observations> with status = 0
OOPS Ending 2025-01-14 08:15:24 (UTC-0600) |
Interesting that Increment assignment takes so long, maybe something we can look into (maybe there are too many in the ensemble covariance multiply and that can be simplified?). Also, looks like diffusion adjoint takes a lot longer compared to tl (esp comparing to the Var stats), but it could be that the memory swapping is warping the statistics. |
@guillaumevernieres how many Increment::operator= calls are there compared to Covariance::ensemble::multiply calls? |
@guillaumevernieres we could also try switching to saber ensemble covariance (the end result should be the same, it's just a different way of doing ensemble covariances, in saber on atlas fieldsets instead of in oops on model increments), and seeing if fieldset assignments/copies are cheaper than soca increments assignments/copies. |
Yes, saber should be more efficient. There are lingering inefficiencies in some soca state/increment calls where extra conversions to/from atlas are taking place. A temporary side effect of moving soca to store everything internally as atlas fieldsets |
@shlyaeva , @travissluka , I'll redo the test above with a few more nodes. The first 85+ iterations were quite quick but everything slowed down after that. I suspect the system was swapping/running out of memory so the timing above is probably not accurate. |
Looking at the correct output now. |
Same as above for for the hybid: OOPS_STATS ------------------------------------------------------------------------------------------------------------------
OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 320 MPI tasks) -----------------------------------
OOPS_STATS ------------------------------------------------------------------------------------------------------------------
OOPS_STATS Name : min (ms) max (ms) avg (ms) % total imbal (%)
OOPS_STATS oops::Covariance::HybridCovariance : 32546.17 32553.11 32550.28 12.94 0.02
OOPS_STATS oops::Covariance::SABER::Constructor : 913.14 979.31 940.93 0.37 7.03
OOPS_STATS oops::Covariance::SABER::multiply : 157383.20 157631.05 157485.01 62.61 0.16
OOPS_STATS oops::Covariance::hybrid::multiply : 159325.24 159741.42 159458.67 63.39 0.26
OOPS_STATS oops::Diffusion::multiplySqrtAD : 37250.82 39920.72 38443.99 15.28 6.94
OOPS_STATS oops::Diffusion::multiplySqrtTL : 30224.91 31180.82 30701.80 12.21 3.11
OOPS_STATS oops::Geometry::Geometry : 5050.11 5641.13 5261.35 2.09 11.23
OOPS_STATS oops::GeometryData::setGlobalTree : 11199.75 12161.53 11763.62 4.68 8.18
OOPS_STATS oops::GetValues::GetValues : 751.81 1127.68 823.91 0.33 45.62
OOPS_STATS oops::GetValues::fillGeoVaLsAD : 177.67 484.59 276.60 0.11 110.96
OOPS_STATS oops::GetValues::fillGeoVaLsTL : 453.26 2380.51 2049.45 0.81 94.04
OOPS_STATS oops::GetValues::finalizeAD : 75.25 759.79 352.43 0.14 194.23
OOPS_STATS oops::GetValues::processAD : 967.67 3038.35 1210.55 0.48 171.05
OOPS_STATS oops::GetValues::processTL : 1163.24 3244.19 1423.43 0.57 146.19
OOPS_STATS oops::Increment::Increment : 3785.82 4141.98 4007.73 1.59 8.89
OOPS_STATS oops::Increment::diff : 7440.53 7511.48 7471.24 2.97 0.95
OOPS_STATS oops::Increment::fromFieldSet : 462.78 641.97 541.01 0.22 33.12
OOPS_STATS oops::Increment::operator+= : 516.85 586.55 546.45 0.22 12.76
OOPS_STATS oops::Increment::operator+=(State, Increment) : 395.56 400.74 397.46 0.16 1.31
OOPS_STATS oops::Increment::operator= : 5172.03 5698.00 5465.41 2.17 9.62
OOPS_STATS oops::Increment::read : 763.99 814.95 800.32 0.32 6.37
OOPS_STATS oops::Increment::toFieldSet : 824.81 1012.28 904.43 0.36 20.73
OOPS_STATS oops::Increment::write : 1403.66 1421.43 1409.53 0.56 1.26
OOPS_STATS oops::LinearObsOper::adt_rads_all::simulateObsTL : 21.16 340.36 271.95 0.11 117.37
OOPS_STATS oops::LinearVariableChange::changeVarAD : 4591.56 6564.58 5048.98 2.01 39.08
OOPS_STATS oops::LinearVariableChange::changeVarTL : 3794.12 4031.92 3922.07 1.56 6.06
OOPS_STATS oops::LinearVariableChange::changeVarTraj : 997.01 1081.35 1034.31 0.41 8.15
OOPS_STATS oops::Model::step : 4681.61 4687.85 4685.61 1.86 0.13
OOPS_STATS oops::ObsSpace::ObsSpace : 2760.37 2824.17 2785.27 1.11 2.29
OOPS_STATS oops::ObsSpace::save : 1798.49 1895.83 1873.13 0.74 5.20
OOPS_STATS oops::ObsVector::dot_product : 890.55 1341.97 1186.49 0.47 38.05
OOPS_STATS oops::Parameters::deserialize : 501.14 579.66 540.28 0.21 14.53
OOPS_STATS oops::State::State : 34111.91 34284.96 34205.59 13.60 0.51
OOPS_STATS oops::State::read : 4681.54 4687.77 4685.55 1.86 0.13
OOPS_STATS oops::State::write : 6560.15 6609.21 6562.67 2.61 0.75
OOPS_STATS oops::UnstructuredInterpolator::apply : 1761.80 3939.60 2077.34 0.83 104.84
OOPS_STATS oops::UnstructuredInterpolator::applyAD : 632.49 2561.78 827.84 0.33 233.05
OOPS_STATS oops::VariableChange::changeVar : 917.54 926.91 922.06 0.37 1.02
OOPS_STATS oops::mpi::broadcast : 9783.89 11783.99 10521.28 4.18 19.01
OOPS_STATS util::Timers::Total : 251516.92 251592.34 251542.39 100.00 0.03
OOPS_STATS util::Timers::measured : 247448.18 250575.87 248576.11 98.82 1.26
OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 320 MPI tasks) -----------------------------------
OOPS_STATS Run end - Runtime: 257.84 sec, Memory: total: 2902.69 Gb, per task: min = 6.58 Gb, max = 10.19 Gb
Run: Finishing oops::Variational<SOCA, UFO and IODA observations> with status = 0
OOPS Ending 2025-01-15 07:57:32 (UTC-0600) Seems reasonable to me. Thanks for your help/guidance @shlyaeva and @travissluka . |
The soca variational application takes an insane amount of memory on Hecules and Gaea (~8TB for the simple 3DVAR), both use
intel 2021.9.0
. The same application on Hera requires ~0.8TB of memory, the intel compiler version on Hera is2021.5.0
.I have no idea if the compiler is the issue.
The text was updated successfully, but these errors were encountered: