Describe the bug
If DeepSpeedEngine.save_checkpoint should be run only under main_process(rank=0) ?
I have seen the save the model only for the main process when using the distributed training mode in pytorch inside https://pytorch.org/tutorials/intermediate/ddp_tutorial.html, but it seems that there is no need to be under main_process when using save_checkpoint. Does any know why ?
If the save_checkpoint is under main_process in huggingface diffusers, the save stage will hands on.
System info (please complete the following information):
Ubuntu 20.04
Nvidia GTX 3090
CUDA Version: 11.7
Torch: 1.13.1
Diffusers: 0.15.0.dev0
deepspeed: 0.8.1
xformers: 0.0.17.dev466
accelerate: 0.16.0