Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. #7707

zhengchenyu · 2025-11-27T02:42:37Z

ds_secondary_tensor may be dirty during model loading or zero checkpointing for zero++.

1 Loading model

My task is transformers SFT. In the transformers code, initialization is done using code like the following:

with deepspeed.zero.Init():
    model = xxx

After this, param is already a ds tensor, meaning both ds_tensor and ds_secondary_tensor exist. Then load_model is called to reload the model.

with deepspeed.zero.GatheredParameters(params_to_gather, modifier_rank=0):
    if torch.distributed.get_rank() == 0:
        module._load_from_state_dict(*args)

In GatheredParameters.__exit__, params[0].partition is called, and has_been_updated is set to True, indicating that data updates are needed. However, _partition_param_sec did not pass has_been_updated. This results in ds_secondary_tensor being dirty.

2 Loading zero checkpoint

The zero checkpoint is loaded into fp16_partitioned_groups_flat, meaning param.ds_tensor has been updated. However, the data in param.ds_secondary_tensor has not been updated. But the next allgather will use the dirty param.ds_secondary_tensor.

A dirty ds_secondary_tensor can lead to abnormal loss. After calling invalidate_secondary_tensor in _post_step, the loss returns to normal. This is why loss anomaly only occurs during beginning steps.

Relate issue: #7606

zhengchenyu · 2025-11-27T03:15:02Z

This picture proves that the bug has been fixed. The experimental conditions for fix are exactly the same as those for bug in #7606. The only difference is that the executable code applies this pr.

…ero checkpoint for zero++. Signed-off-by: zhengchenyu <[email protected]>

sfc-gh-truwase · 2025-11-27T20:48:26Z

@zhengchenyu thanks for PR. We are taking a look.

zhengchenyu · 2025-11-28T08:31:40Z

The unit test test_compile_zero.py::TestDeepCompile::test[True-1-dtype0] failed. This seems to be unrelated to this PR. Despite multiple tests on my own server, I am still unable to reproduce the failure...

zhengchenyu requested review from loadams, tjruwase and tohtana as code owners November 27, 2025 02:42

zhengchenyu closed this Nov 27, 2025

zhengchenyu deleted the issue-7606 branch November 27, 2025 10:18

Fix that ds_secondary_tensor may be dirty when loading the model or z…

9214092

…ero checkpoint for zero++. Signed-off-by: zhengchenyu <[email protected]>

zhengchenyu reopened this Nov 27, 2025

zhengchenyu force-pushed the issue-7606 branch from 7f90ee5 to 9214092 Compare November 27, 2025 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. #7707

Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. #7707

zhengchenyu commented Nov 27, 2025

Uh oh!

zhengchenyu commented Nov 27, 2025

Uh oh!

sfc-gh-truwase commented Nov 27, 2025

Uh oh!

zhengchenyu commented Nov 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. #7707

Are you sure you want to change the base?

Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. #7707

Conversation

zhengchenyu commented Nov 27, 2025

Uh oh!

zhengchenyu commented Nov 27, 2025

Uh oh!

sfc-gh-truwase commented Nov 27, 2025

Uh oh!

zhengchenyu commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengchenyu commented Nov 28, 2025 •

edited

Loading