Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make proxy dump print out more meaningful information. #1412

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

thananon
Copy link
Contributor

@thananon thananon commented Nov 6, 2024

Details

Proxy dump is a debugging tool to print out the current state and outstanding operations of the proxy. Proxy dump was not working properly prior to this patch. This PR made some modification to proxy dump code. This does not affect the communication logic in anyway.

Example:

[0-27|1780| Coll->Send 0G/2]  (send to rank 0 / state: waiting on GPU / on channel: 2)
| `-> [0-71|1782| Coll->Send 0/2]
| `-> [0-150|1784| Coll->Send 0/2]
| `-> [0-45|1786| Coll->Send 0/2]
| `-> [0-61|1788| Coll->Send 0/2]
| `-> [0-68|1790| Coll->Send 0/2]
| `-> [0-62|1792| Coll->Send 0/2]

Note that NCCL 2.23.4 made improvements in this area. We should revisit this when we eventually sync to that NCCL version.

This proves to be useful for debugging large scale job for us, and our partner are asking us to put this in as soon as possible.

Work item: : HPEXA-63 / LWPCOMMLIBS-559

What were the changes?
Change how proxy dump prints out information (added more info).

Why were the changes made?
This proxy dump is useful in debugging large scale multi-node hang.

How was the outcome achieved?
printf

Additional Documentation:
Note that NCCL 2.23.4 made improvements in this area. We should revisit this when we eventually sync to that NCCL version.

Proxy dump is not used in normal circumstance. This only affects debug instance when we set NCCL_DUMP_PROXY_SIGNAL.

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

for (int s=0; s<op->nsubs; s++) {
struct ncclProxySubArgs* sub = op->subs+s;
if (op->state == ncclProxyOpProgress) {
char status = ' ';
if (op->pattern == ncclPatternRecv) {
if (! op->send) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong with original "op->pattern == ncclPatternRecv" checking?

if (sub->posted < sub->nsteps && sub->posted < sub->done + NCCL_STEPS) status = 'I'; // Init
else if (sub->received < sub->posted) status = 'R'; // Receiving
else if (sub->received < sub->transmitted) status = 'R'; // Receiving
else if (sub->transmitted < sub->received) status = 'F'; // Flushing
else if (sub->done < sub->transmitted) status = 'G'; // Waiting on GPU
else status = 'D'; // Done
} else if (op->pattern == ncclPatternSend) {
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many "patterns". Why remove the original checking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants