Description
hi, I'm interested in the Collective Communication Group Initialization part of the paper, which has greatly reduced the initialization time of a training task (from 1047s to under 5s):
It is mentioned in the paper that initialization is slow because of the global barrier after every process group creation. I noticed that from pytorch 2.1, after initializing process group, the store based barrier operation is controlled by TORCH_DIST_INIT_BARRIER
environment variable (see pytorch release note). By default this variable is "0", which means by default there is no need to barrier after initializing process group.
My questions are:
(1) After removing the global barrier operation, the communication group initialization in pytorch is still slow? And I wonder the optimization mentioned in the paper still bring huge benefits?
(2) will the source code related to this part be available in the future?
I really appreciate your awesome work. Looking forward to your reply.