Skip to content

Conversation

@zhengchenyu
Copy link
Contributor

ds_secondary_tensor may be dirty during model loading or zero checkpointing for zero++.

  • 1 Loading model

My task is transformers SFT. In the transformers code, initialization is done using code like the following:

with deepspeed.zero.Init():
    model = xxx

After this, param is already a ds tensor, meaning both ds_tensor and ds_secondary_tensor exist. Then load_model is called to reload the model.

with deepspeed.zero.GatheredParameters(params_to_gather, modifier_rank=0):
    if torch.distributed.get_rank() == 0:
        module._load_from_state_dict(*args)

In GatheredParameters.__exit__, params[0].partition is called, and has_been_updated is set to True, indicating that data updates are needed. However, _partition_param_sec did not pass has_been_updated. This results in ds_secondary_tensor being dirty.

  • 2 Loading zero checkpoint

The zero checkpoint is loaded into fp16_partitioned_groups_flat, meaning param.ds_tensor has been updated. However, the data in param.ds_secondary_tensor has not been updated. But the next allgather will use the dirty param.ds_secondary_tensor.

A dirty ds_secondary_tensor can lead to abnormal loss. After calling invalidate_secondary_tensor in _post_step, the loss returns to normal. This is why loss anomaly only occurs during beginning steps.

Relate issue: #7606

@zhengchenyu
Copy link
Contributor Author

This picture proves that the bug has been fixed. The experimental conditions for fix are exactly the same as those for bug in #7606. The only difference is that the executable code applies this pr.

截屏2025-11-27 11 10 55

@zhengchenyu zhengchenyu deleted the issue-7606 branch November 27, 2025 10:18
…ero checkpoint for zero++.

Signed-off-by: zhengchenyu <[email protected]>
@sfc-gh-truwase
Copy link
Collaborator

@zhengchenyu thanks for PR. We are taking a look.

@zhengchenyu
Copy link
Contributor Author

zhengchenyu commented Nov 28, 2025

The unit test test_compile_zero.py::TestDeepCompile::test[True-1-dtype0] failed. This seems to be unrelated to this PR. Despite multiple tests on my own server, I am still unable to reproduce the failure...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants