Skip to content

should DeepSpeedEngine.save_checkpoint be only under main_process #2993

@better629

Description

@better629

Describe the bug
If DeepSpeedEngine.save_checkpoint should be run only under main_process(rank=0) ?

I have seen the save the model only for the main process when using the distributed training mode in pytorch inside https://pytorch.org/tutorials/intermediate/ddp_tutorial.html, but it seems that there is no need to be under main_process when using save_checkpoint. Does any know why ?

If the save_checkpoint is under main_process in huggingface diffusers, the save stage will hands on.

System info (please complete the following information):
Ubuntu 20.04
Nvidia GTX 3090
CUDA Version: 11.7
Torch: 1.13.1
Diffusers: 0.15.0.dev0
deepspeed: 0.8.1
xformers: 0.0.17.dev466
accelerate: 0.16.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions