-
With 8*H200 and R1, I used nsys to capture the performance and found that the ncclAllGather operator took up 36%. Is this normal? Is it possible that NVLink is not enabled? |
Beta Was this translation helpful? Give feedback.
Answered by
hcyz33
Feb 16, 2025
Replies: 2 comments
-
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
hcyz33
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It seems that the reason is dp attention. I enable dp attention. In MLA, the dp process who has no req, almost doesn't consume any time and quickly enters the allgather operator. then wait for other process to finish MLA. This is why the allgather operator seems to take so long.