Skip to content

DeepSpeed checkpoint performance improvements #312

@adammoody

Description

@adammoody

I have a PR open on Microsoft's DeepSpeed page that parallelizes the task of writing per-layer checkpoint files across data parallel instances:

deepspeedai/DeepSpeed#1419

On my system, I found that this reduces checkpoint cost -- much of the time seems to be spent processing data structures in torch.save() rather than actually writing the bytes to disk. I'm curious to know whether the bigscience training runs might benefit (or would have benefitted) from a change like this. I also have a follow on PR that improves upon checkpointing further, but it requires this first PR.

I have updated the PR a few times over the past 9 months to keep up with changes on the main branch, but it has merge conflicts again. I've lost track of the precise version of DeepSpeed being used in the bigscience training runs.

Would someone be willing to try this out?

First or all, do the changes in this PR apply cleanly to the bigscience DeepSpeed version?

If not, would you please point me to the version that is being used?

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions