Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.batch_isend_irecv is not recorded properly in ET #134

Open
shengfukevin opened this issue Jul 16, 2024 · 0 comments
Open

Comments

@shengfukevin
Copy link
Contributor

We are developing comm_repay and finding a problem with torch.distributed.batch_isend_irecv , which is used in one of our testing trace.
The p2p comm sequence of real training between rank0 and rank8 is:
rank 0: batch -> send -> batch -> send -> recv -> batch -> send -> recv -> batch -> send -> recv -> batch -> recv
rank 8: batch -> recv -> batch -> send -> recv -> batch -> send -> recv -> batch -> send -> recv -> batch -> send
The p2p comm sequence in Execution Trace for replay between rank0 and rank8 is:
rank0: send-> send -> recv-> send -> recv-> send -> recv -> recv
rank8: recv-> send -> recv-> send -> recv-> send -> recv -> send

The issue can be reproduced with the collected ET for https://github.com/pytorch/pytorch/blob/main/test/distributed/test_c10d_nccl.py#L3846

batched-send-recv-0.json attached two files are simpler version of that unit test (with only one batch_isend_irecv call). You can find nccl::coalesced node, which marks the end of the coalescing buffer. I think the trace missed the node to mark the start of the coalescing buffer. After that is added, all send/recv nodes between the start of coalescing and the end of the coalescing should be treated as one coalesced group to replay.
batched-send-recv-1.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant