DeepSpeed checkpoint performance improvements

I have a PR open on Microsoft's DeepSpeed page that parallelizes the task of writing per-layer checkpoint files across data parallel instances:

https://github.com/microsoft/DeepSpeed/pull/1419

On my system, I found that this reduces checkpoint cost -- much of the time seems to be spent processing data structures in ``torch.save()`` rather than actually writing the bytes to disk.  I'm curious to know whether the bigscience training runs might benefit (or would have benefitted) from a change like this.  I also have a follow on PR that improves upon checkpointing further, but it requires this first PR.

I have updated the PR a few times over the past 9 months to keep up with changes on the main branch, but it has merge conflicts again.  I've lost track of the precise version of DeepSpeed being used in the bigscience training runs.

Would someone be willing to try this out?

First or all, do the changes in this PR apply cleanly to the bigscience DeepSpeed version?

If not, would you please point me to the version that is being used?

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed checkpoint performance improvements #312

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeepSpeed checkpoint performance improvements #312

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions