[DSV3][CP][AC] 'nan' in grad after attention op in backward when CP and full AC are used at the same time

### Bug description

I'm running DeepSeek v3 17B model with the following options:
- `context_parallel_degree` in [1, 2]
- `activation_checkpoint_mode` in [none, selective, full]

When `context_parallel_degree` is 1 (only FSDP is used), the backward pass is fine and no 'nan' appears in grad. However, when `context_parallel_degree` is 2 (FSDP+CP) the backward pass produces 'nan' in grad when executing SDPA backward op, and this only happens when `activation_checkpoint_mode` is full. 

The command to reproduce the 'nan' issue is:
`NGPU=8 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.context-parallel-degree 2 --activation-checkpoint.mode full`

The 'nan' in grad disappears if change the activation mode to "none" or "selective".

### Versions

torchtitan==ed288bc
torch==2.9.0a0+git33a1996

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSV3][CP][AC] 'nan' in grad after attention op in backward when CP and full AC are used at the same time #1522

Bug description

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[DSV3][CP][AC] 'nan' in grad after attention op in backward when CP and full AC are used at the same time #1522

Description

Bug description

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions