should DeepSpeedEngine.save_checkpoint be only under main_process

**Describe the bug**
If `DeepSpeedEngine.save_checkpoint` should be run only under main_process(rank=0) ？

I have seen `the save the model only for the main process when using the distributed training mode in pytorch` inside `https://pytorch.org/tutorials/intermediate/ddp_tutorial.html`, but it seems that there is no need to be under main_process when using `save_checkpoint`. Does any know why ?

If the `save_checkpoint` is under main_process in huggingface diffusers, the save stage will hands on.

**System info (please complete the following information):**
Ubuntu 20.04
Nvidia GTX 3090
CUDA Version: 11.7
Torch: 1.13.1
Diffusers: 0.15.0.dev0
deepspeed: 0.8.1
xformers: 0.0.17.dev466
accelerate: 0.16.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should DeepSpeedEngine.save_checkpoint be only under main_process #2993

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

should DeepSpeedEngine.save_checkpoint be only under main_process #2993

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions