You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running DeepSeek v3 17B model with the following options:
context_parallel_degree in [1, 2]
activation_checkpoint_mode in [none, selective, full]
When context_parallel_degree is 1 (only FSDP is used), the backward pass is fine and no 'nan' appears in grad. However, when context_parallel_degree is 2 (FSDP+CP) the backward pass produces 'nan' in grad when executing SDPA backward op, and this only happens when activation_checkpoint_mode is full.
The command to reproduce the 'nan' issue is: NGPU=8 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.context-parallel-degree 2 --activation-checkpoint.mode full
The 'nan' in grad disappears if change the activation mode to "none" or "selective".