Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable switch profile. #25

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Reenable switch profile. #25

wants to merge 3 commits into from

Conversation

fruffy
Copy link
Contributor

@fruffy fruffy commented Jan 16, 2025

Check what breaks.

@fruffy
Copy link
Contributor Author

fruffy commented Jan 17, 2025

It is possible that we run out of memory trying to build the switch profile. This could be a regression in the compiler.

@fruffy
Copy link
Contributor Author

fruffy commented Jan 17, 2025

@jafingerhut Can you try this locally?

@jafingerhut
Copy link
Contributor

Yes, will try this locally on a VM today with 2 vcpu + 24gb ram. If it fails due to insufficient ram, that would be pretty bad. Not sure that I will be able to distinguish between that reason and whatever other failure reason might appear. Will let you know.

@jafingerhut
Copy link
Contributor

FYI, it seems occasionally useful to have commands like these in the CI scripts to tell us what basic hardware resources the CI VMs have:

uname -a
nproc
cat /proc/meminfo

@jafingerhut
Copy link
Contributor

With a VM with 2 VCPUs and 24 GB RAM, I was able to successfully build with this branch. I was also able to start bf_switchd, and press TAB at the prompt, and saw bfrt_python in the list of choices. I typed bfrt_python and pressed return, and got a bfrt_python> prompt. I did not try anything at that prompt to test further, but up to that point everything looks good.

I would recommend that maybe in CI we try to use something like the code I just added to batch-install.sh here: https://github.com/p4lang/open-p4studio/pull/27/files to limit the number of parallel tasks run in hopes of avoiding using too much RAM. Perhaps CI could even run the batch-install.sh script as is?

@fruffy fruffy force-pushed the fruffy/reenable_switch branch 2 times, most recently from 7da79e7 to ad74dc5 Compare January 17, 2025 17:14
@fruffy
Copy link
Contributor Author

fruffy commented Jan 17, 2025

I would recommend that maybe in CI we try to use something like the code I just added to batch-install.sh here: https://github.com/p4lang/open-p4studio/pull/27/files to limit the number of parallel tasks run in hopes of avoiding using too much RAM. Perhaps CI could even run the batch-install.sh script as is?

Hmmm the problem is not the compiler build but bf-p4c itself, which is single-threaded. It looks like it has massively increased memory usage.

@jafingerhut
Copy link
Contributor

I would recommend that maybe in CI we try to use something like the code I just added to batch-install.sh here: https://github.com/p4lang/open-p4studio/pull/27/files to limit the number of parallel tasks run in hopes of avoiding using too much RAM. Perhaps CI could even run the batch-install.sh script as is?

Hmmm the problem is not the compiler build but bf-p4c itself, which is single-threaded. It looks like it has massively increased memory usage.

We can tweak my changes to limit the parallelism down to as low as 1 parallel job, for the entire build run. Not great in that it increases the elapsed time of the rest of the build, but if it always fails, that is effectively infinite build time, so worse :-)

@fruffy fruffy force-pushed the fruffy/reenable_switch branch 2 times, most recently from 6b693f6 to 8fe0e64 Compare January 17, 2025 18:21
@fruffy
Copy link
Contributor Author

fruffy commented Jan 17, 2025

We can tweak my changes to limit the parallelism down to as low as 1 parallel job, for the entire build run. Not great in that it increases the elapsed time of the rest of the build, but if it always fails, that is effectively infinite build time, so worse :-)

I reduced the number of jobs to 1 but the tests still fail. The problem is not related to parallelism but to increased memory usage in bf-p4c. Also the runners only have 16GB available.

It is possible that we had a regression in P4C at some point. @asl pointed out the absurd memory usage with https://github.com/p4lang/p4c/issues?q=is%3Aissue%20state%3Aopen%20label%3Acompiler-performance.

Maybe the changes to bdwgc are at fault... let me check.

@asl
Copy link

asl commented Jan 17, 2025

@fruffy I do not see particular error message in the log, do you see how the problem is reported?

I doubt BWDGC is an issue. It is just bad memory use practices overall in the compiler...

@jafingerhut
Copy link
Contributor

If it is an out-of-memory issue, unless Github CI has special mechanisms to report them, they tend not to be reported unless you look specially for them.

Maybe adding some commands like these at the end of a CI run, if there is a way to predictably run bash commands after one of the earlier build steps fails?

dmesg -T | egrep -i 'killed process'
grep -i 'killed process' /var/log/messages

That is from reading some SO answers here [1], but if you have knowledge of better ways to look for these events, let me know.

[1] https://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer

@fruffy
Copy link
Contributor Author

fruffy commented Jan 17, 2025

I doubt BWDGC is an issue. It is just bad memory use practices overall in the compiler...

The reason I suspect BDWGC is simply that I do not remember the compiler consuming this much memory in the past for the basic switch programs and the only change that I can think of that could impact the allocation behavior at this scale is the garbage collector.

@fruffy fruffy force-pushed the fruffy/reenable_switch branch from 8fe0e64 to 0374fd5 Compare January 17, 2025 22:47
@fruffy fruffy force-pushed the fruffy/reenable_switch branch from 0374fd5 to b43147c Compare January 17, 2025 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants