Enable bf16 check_grad_overflow by default (matching fp16) by yongzhe-wang · Pull Request #8035 · deepspeedai/DeepSpeed

yongzhe-wang · 2026-05-29T03:38:21Z

Summary

Flip DeepSpeedBF16Config.check_grad_overflow default from False to True, so bf16 users get the same gradient-overflow protection that fp16 users already get by default.

Motivation

The bf16 documentation states bf16 "does not require loss scaling" (deepspeed.ai/docs/config-json/), but this overstates the safety guarantee for the bf16 + ZeRO-2 (non-offload) partition-flat gradient accumulation path. We reproduced a deterministic catastrophic NaN under a small set of training conditions:

ZeRO-2 (non-offload) + bf16
Mixture-of-Transformers (modality-specific transformer branches)
Heterogeneous per-sample loss masks (e.g. 50% action-invalid samples in robotics VLA training)

Under these conditions, a single bf16 element in engine.optimizer.averaged_gradients[i] overflows to +inf. The downstream Adam.step then computes inf / sqrt(inf) = NaN in a fused kernel, which simultaneously corrupts the partition slice's exp_avg, exp_avg_sq, and fp32 master weights. The next forward pass propagates NaN through every layer; the training run is dead with no useful diagnostic. Reproduced consistently in DeepSpeed 0.16.9 - 0.17.1 at step ~22 with our internal repro.

The infrastructure to detect and skip such steps was correctly added by #6976 (check_grad_overflow option, DeepSpeedZeroOptimizer.check_overflow method, and step-skip logic at stage_1_and_2.step() lines ~2128-2143). However the default was set to False for bf16, so users hitting this condition do not receive the protection.

Change

Single line: check_grad_overflow: bool = False -> check_grad_overflow: bool = True in DeepSpeedBF16Config. Updated docstring + bf16 example block accordingly.

Backward compatibility

Users who have benchmarked the check as too expensive AND have separately confirmed their bf16 path cannot overflow can opt out by setting:

```json
"bf16": {
"enabled": true,
"check_grad_overflow": false
}
```

The runtime cost is one isfinite-style scan over the gradient partition per optimizer step (already implemented in DeepSpeedZeroOptimizer.check_overflow); typically under 1% of step wallclock.

Test plan

Reproducer (private repo): with default False, run dies at step ~22; with True, training survives via DeepSpeed's existing skip-step path.
Existing CI should pass unchanged; this PR only changes a default value in precision_config.py.

tohtana · 2026-06-02T21:19:15Z

Thank you @yongzhe-wang,
This change overall looks good to me, but I'm still not sure about the performance impact. Adding a synchronization point might bring a noticeable difference.

What is your thought? @sfc-gh-truwase

yongzhe-wang · 2026-06-09T02:27:25Z

Thanks @tohtana — good question. A few things that I think bound the sync-cost concern:

This isn't a new synchronization point for DeepSpeed — fp16 already does exactly this, unconditionally, every step. In engine.py the flag is hard-wired on for fp16:

if self.bfloat16_enabled():
    check_grad_overflow =
self._config.bfloat16_config.check_grad_overflow
elif self.fp16_enabled():
    check_grad_overflow = True      # fp16 always pays this
else:
    check_grad_overflow = False

The work it gates — has_overflow(): an isfinite scan over the gradient partition, a scalar all_reduce(MAX), and one .item() — is the same code path fp16 has run by default for years. So this is a well-characterized production cost, not a new one; the PR just gives bf16 the same protection.

ZeRO-1/2 already incurs a per-step device sync regardless. In stage_1_and_2.step(), every non-overflow step calls scaled_global_norm() for gradient clipping, which does an all_reduce + .item(). The engine therefore isn't running fully async across the optimizer step anyway — the overflow check's .item() lands just before a sync that was already going to happen.

sfc-gh-truwase · 2026-06-23T23:21:57Z

-    Check for gradient overflows and underflows
+    check_grad_overflow: bool = True
+    """
+    Detect gradient overflow/underflow before optimizer step and skip the step


Make this terse, and move the details to the PR. Okay to leave the issue and PR links here.

sfc-gh-truwase · 2026-06-23T23:23:58Z

@yongzhe-wang, apologies for the delayed response. Thanks for this PR.

Similar to @tohtana's point, performance concern was what prevented making this default. But, I think your arguments are solid: (1) parity with known production-cost of fp16, and (2) potentially reusing the current grad norm scans to reduce cost.

I left one minor code cleanup comment.

tohtana · 2026-06-25T04:09:58Z

@yongzhe-wang Let me know when you think this PR is ready to merge.

Signed-off-by: Yongzhe Wang <yzwang2020@gmail.com>

yongzhe-wang · 2026-06-26T07:16:49Z

@sfc-gh-truwase Rebased onto latest master — branch is up to date, approved, and DCO green. The remaining required checks are waiting on maintainer workflow approval (fork PR). Could you approve the workflow runs and merge when green? Thanks!

yongzhe-wang requested review from tjruwase and tohtana as code owners May 29, 2026 03:38

sfc-gh-truwase reviewed Jun 23, 2026

View reviewed changes

sfc-gh-truwase closed this Jun 23, 2026

sfc-gh-truwase reopened this Jun 23, 2026

sfc-gh-truwase approved these changes Jun 23, 2026

View reviewed changes

yongzhe-wang force-pushed the fix/bf16-check-grad-overflow-default-true branch from 46dcf7b to 90c4c8b Compare June 25, 2026 04:05

Enable bf16 check_grad_overflow by default (matching fp16)

d6fc7d9

Signed-off-by: Yongzhe Wang <yzwang2020@gmail.com>

yongzhe-wang force-pushed the fix/bf16-check-grad-overflow-default-true branch from 90c4c8b to d6fc7d9 Compare June 26, 2026 07:09

sfc-gh-truwase enabled auto-merge June 26, 2026 14:31

sfc-gh-truwase added this pull request to the merge queue Jun 26, 2026

Merged via the queue into deepspeedai:master with commit f0253c8 Jun 26, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable bf16 check_grad_overflow by default (matching fp16)#8035

Enable bf16 check_grad_overflow by default (matching fp16)#8035
sfc-gh-truwase merged 1 commit into
deepspeedai:masterfrom
yongzhe-wang:fix/bf16-check-grad-overflow-default-true

yongzhe-wang commented May 29, 2026

Uh oh!

tohtana commented Jun 2, 2026

Uh oh!

yongzhe-wang commented Jun 9, 2026 •

edited

Loading

Uh oh!

sfc-gh-truwase Jun 23, 2026

Uh oh!

sfc-gh-truwase commented Jun 23, 2026

Uh oh!

tohtana commented Jun 25, 2026

Uh oh!

yongzhe-wang commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yongzhe-wang commented May 29, 2026

Summary

Motivation

Change

Backward compatibility

Related

Test plan

Uh oh!

tohtana commented Jun 2, 2026

Uh oh!

yongzhe-wang commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-truwase Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-truwase commented Jun 23, 2026

Uh oh!

tohtana commented Jun 25, 2026

Uh oh!

yongzhe-wang commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yongzhe-wang commented Jun 9, 2026 •

edited

Loading