Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiled process getting killed much too early on Mac by OOM detector #287

Open
petergaultney opened this issue Jan 21, 2022 · 6 comments
Open
Labels

Comments

@petergaultney
Copy link

Version information

Fil: 2021.12.2
Python: 3.7.12 (default, Dec 20 2021, 11:33:29)
[Clang 13.0.0 (clang-1300.0.29.3)]

Additional context that could be valuable is that this is on MacOS Monterey on an M1 Max - but I'm specifically running this as an x64 process, not ARM.

The machine has 64 GB of RAM.

This is what is getting output:

=fil-profile= WARNING: Excessive swapping. Program itself allocated 28992299350 bytes, 19128373248 are resident, the difference (presumably swap) is 9863926102, which is more than available system bytes 9830146048
=fil-profile= WARNING: Detected out-of-memory condition, exiting soon.
=fil-profile= Host memory info: Ok(VirtualMemory { total: 68719476736, available: 9830146048, used: 7917293568, free: 2665578496, percent: 85.69525, active: 6971088896, inactive: 5435318272, wired: 946204672 }) Ok(SwapMemory { total: 1073741824, used: 106168320, free: 967573504, percent: 9.887695, swapped_in: 25114910720, swapped_out: 488624128 })
=fil-profile= Process memory info: Ok(MemoryInfo { rss: 19128999936, vms: 68041039872, page_faults: 5594374, pageins: 0 })
=fil-profile= We'll try to dump out SVGs. Note that no HTML file will be written.
=fil-profile= Preparing to write to fil-result/2022-01-21T15:51:40.273
=fil-profile= Wrote flamegraph to "fil-result/2022-01-21T15:51:40.273/out-of-memory.svg"
=fil-profile= Wrote flamegraph to "fil-result/2022-01-21T15:51:40.273/out-of-memory-reversed.svg"

I can reproduce this consistently.

However, the process runs to completion if not run inside filprofile, and in fact seems to work with --disable-oom-detection just fine as well (the runs take hours, so all I can be certain of so far is that using the flag prevents the process from getting killed early on. I'll update/close this later if I get an actual OOM, but this exact same run has completed successfully before, and I've already hit ~5x the RAM that was in use when filprofile killed it, so I sort of doubt it will).

It's nice that the flag exists, but the behavior feels like a bug to me. My machine is handling the RAM usage just fine when I run the process, so somehow it seems like filprofile's calibration for OOM is way off.

@itamarst
Copy link
Collaborator

Thanks for the bug report! I will take a look at the heuristics again.

@itamarst itamarst added the NEXT label Jan 21, 2022
@itamarst
Copy link
Collaborator

itamarst commented Jan 22, 2022

Turning above into more readable form:

Host value
total 68,719,476,736
available 9,830,146,048
used 7,917,293,568
free 2,665,578,496
percent 85%
active 6,971,088,896
inactive 5,435,318,272
Process value
rss 19,128,999,936
vms 68,041,039,872

@itamarst
Copy link
Collaborator

itamarst commented Jan 23, 2022

  1. Theory: bad heuristic. Heuristic was added for macOS specifically, where swapping was very aggressive so the heuristic is "you have a lot of swap". But I wonder if on M1 hardware swapping is even more aggressive? Anecdotally I would expect that.
  2. Theory: more RAM invalidates heuristic. Or, maybe it's just that you a have a lot of RAM compared to macOS testing I've done.
  3. Theory: bad data. The numbers reported for the host are weird. Where did the rest of the memory go? So possibly the library used to get memory info is giving bad information.

@petergaultney
Copy link
Author

One interesting thing I've observed since reporting this is that I can get similar behavior on Ubuntu 18.04. Sometimes fil-profiler will cause the process to exit within the first few minutes, but the process can complete successfully when --disable-oom-detection is provided.

On both of these OSes, I have some form of RAM compression enabled. I think that's the default for MacOS, but on Ubuntu I've enabled zram. I don't know if that's relevant to any of your hypotheses.

@itamarst
Copy link
Collaborator

Fascinating. I guess that's a fourth theory: compression might distort the RAM availability statistics.

@itamarst
Copy link
Collaborator

itamarst commented Jan 30, 2022

Still thinking about what to do... Some options:

  • Status quo. Always an option. If small enough number of people have problems with the current OOM heuristic, maybe that's fine.
  • OOM detection off by default. Most people won't run out of memory, probably, and if they do they can rerun; the goal is offline profiling anyway, there's no assumption that this is running production code. If I do this, probably should print a message at start saying "if you die half-way, re-run with this flag enabled".
  • Tweak heuristics. Maybe heuristics could be more flexible/accurate? Linux has memory pressure data, on sufficiently new operating system with sufficiently good config. But not macOS.
  • Special-case compressed RAM. Not sure it's the actual issue, or can be detected reliably even if it is, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants