You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hi, I'm interested in the Collective Communication Group Initialization part of the paper, which has greatly reduced the initialization time of a training task (from 1047s to under 5s):
It is mentioned in the paper that initialization is slow because of the global barrier after every process group creation. I noticed that from pytorch 2.1, after initializing process group, the store based barrier operation is controlled by TORCH_DIST_INIT_BARRIER environment variable (see pytorch release note). By default this variable is "0", which means by default there is no need to barrier after initializing process group.
My questions are:
(1) After removing the global barrier operation, the communication group initialization in pytorch is still slow? And I wonder the optimization mentioned in the paper still bring huge benefits?
(2) will the source code related to this part be available in the future?
I really appreciate your awesome work. Looking forward to your reply.
The text was updated successfully, but these errors were encountered:
(1) Our optimized KVStore performs even much better than Libuv TCPStore(introduced at torch 2.4) and brings huge benefits when scaling up.
(2) More discussion is required.
hi, I'm interested in the Collective Communication Group Initialization part of the paper, which has greatly reduced the initialization time of a training task (from 1047s to under 5s):
It is mentioned in the paper that initialization is slow because of the global barrier after every process group creation. I noticed that from pytorch 2.1, after initializing process group, the store based barrier operation is controlled by
TORCH_DIST_INIT_BARRIER
environment variable (see pytorch release note). By default this variable is "0", which means by default there is no need to barrier after initializing process group.My questions are:
(1) After removing the global barrier operation, the communication group initialization in pytorch is still slow? And I wonder the optimization mentioned in the paper still bring huge benefits?
(2) will the source code related to this part be available in the future?
I really appreciate your awesome work. Looking forward to your reply.
The text was updated successfully, but these errors were encountered: