Skip to content

[DSV3][CP][AC] 'nan' in grad after attention op in backward when CP and full AC are used at the same time #1522

@XilunWu

Description

@XilunWu

Bug description

I'm running DeepSeek v3 17B model with the following options:

  • context_parallel_degree in [1, 2]
  • activation_checkpoint_mode in [none, selective, full]

When context_parallel_degree is 1 (only FSDP is used), the backward pass is fine and no 'nan' appears in grad. However, when context_parallel_degree is 2 (FSDP+CP) the backward pass produces 'nan' in grad when executing SDPA backward op, and this only happens when activation_checkpoint_mode is full.

The command to reproduce the 'nan' issue is:
NGPU=8 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.context-parallel-degree 2 --activation-checkpoint.mode full

The 'nan' in grad disappears if change the activation mode to "none" or "selective".

Versions

torchtitan==ed288bc
torch==2.9.0a0+git33a1996

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions