Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. #7707
+103
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.

ds_secondary_tensormay be dirty during model loading or zero checkpointing for zero++.My task is transformers SFT. In the transformers code, initialization is done using code like the following:
After this,
paramis already a ds tensor, meaning bothds_tensorandds_secondary_tensorexist. Thenload_modelis called to reload the model.In
GatheredParameters.__exit__,params[0].partitionis called, andhas_been_updatedis set toTrue, indicating that data updates are needed. However,_partition_param_secdid not passhas_been_updated. This results inds_secondary_tensorbeing dirty.The zero checkpoint is loaded into
fp16_partitioned_groups_flat, meaningparam.ds_tensorhas been updated. However, the data inparam.ds_secondary_tensorhas not been updated. But the nextallgatherwill use the dirtyparam.ds_secondary_tensor.A dirty
ds_secondary_tensorcan lead to abnormal loss. After callinginvalidate_secondary_tensorin_post_step, the loss returns to normal. This is why loss anomaly only occurs during beginning steps.Relate issue: #7606