Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] questions about Collective Communication Group Initialization Optimization in the paper #40

Closed
siddharthaOnRoad opened this issue Jun 24, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@siddharthaOnRoad
Copy link

hi, I'm interested in the Collective Communication Group Initialization part of the paper, which has greatly reduced the initialization time of a training task (from 1047s to under 5s):
image

image

It is mentioned in the paper that initialization is slow because of the global barrier after every process group creation. I noticed that from pytorch 2.1, after initializing process group, the store based barrier operation is controlled by TORCH_DIST_INIT_BARRIER environment variable (see pytorch release note). By default this variable is "0", which means by default there is no need to barrier after initializing process group.

image

My questions are:
(1) After removing the global barrier operation, the communication group initialization in pytorch is still slow? And I wonder the optimization mentioned in the paper still bring huge benefits?
(2) will the source code related to this part be available in the future?

I really appreciate your awesome work. Looking forward to your reply.

@liwenchangbdbz liwenchangbdbz added the question Further information is requested label Jun 28, 2024
@vocaltract
Copy link
Collaborator

vocaltract commented Aug 30, 2024

(1) Our optimized KVStore performs even much better than Libuv TCPStore(introduced at torch 2.4) and brings huge benefits when scaling up.
(2) More discussion is required.

@vocaltract vocaltract self-assigned this Aug 30, 2024
@MackZackA
Copy link
Collaborator

Since there is no more followup discussion, we'll close this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants