Enable bf16 check_grad_overflow by default (matching fp16)#8035
Conversation
|
Thank you @yongzhe-wang, What is your thought? @sfc-gh-truwase |
|
Thanks @tohtana — good question. A few things that I think bound the sync-cost concern:
if self.bfloat16_enabled():
check_grad_overflow =
self._config.bfloat16_config.check_grad_overflow
elif self.fp16_enabled():
check_grad_overflow = True # fp16 always pays this
else:
check_grad_overflow = FalseThe work it gates — has_overflow(): an isfinite scan over the gradient partition, a scalar all_reduce(MAX), and one .item() — is the same code path fp16 has run by default for years. So this is a well-characterized production cost, not a new one; the PR just gives bf16 the same protection.
|
| Check for gradient overflows and underflows | ||
| check_grad_overflow: bool = True | ||
| """ | ||
| Detect gradient overflow/underflow before optimizer step and skip the step |
There was a problem hiding this comment.
Make this terse, and move the details to the PR. Okay to leave the issue and PR links here.
|
@yongzhe-wang, apologies for the delayed response. Thanks for this PR. Similar to @tohtana's point, performance concern was what prevented making this default. But, I think your arguments are solid: (1) parity with known production-cost of fp16, and (2) potentially reusing the current grad norm scans to reduce cost. I left one minor code cleanup comment. |
46dcf7b to
90c4c8b
Compare
|
@yongzhe-wang Let me know when you think this PR is ready to merge. |
Signed-off-by: Yongzhe Wang <yzwang2020@gmail.com>
90c4c8b to
d6fc7d9
Compare
|
@sfc-gh-truwase Rebased onto latest master — branch is up to date, approved, and DCO green. The remaining required checks are waiting on maintainer workflow approval (fork PR). Could you approve the workflow runs and merge when green? Thanks! |
Summary
Flip
DeepSpeedBF16Config.check_grad_overflowdefault fromFalsetoTrue, so bf16 users get the same gradient-overflow protection that fp16 users already get by default.Motivation
The bf16 documentation states bf16 "does not require loss scaling" (deepspeed.ai/docs/config-json/), but this overstates the safety guarantee for the bf16 + ZeRO-2 (non-offload) partition-flat gradient accumulation path. We reproduced a deterministic catastrophic NaN under a small set of training conditions:
Under these conditions, a single bf16 element in
engine.optimizer.averaged_gradients[i]overflows to+inf. The downstreamAdam.stepthen computesinf / sqrt(inf) = NaNin a fused kernel, which simultaneously corrupts the partition slice'sexp_avg,exp_avg_sq, and fp32 master weights. The next forward pass propagates NaN through every layer; the training run is dead with no useful diagnostic. Reproduced consistently in DeepSpeed 0.16.9 - 0.17.1 at step ~22 with our internal repro.The infrastructure to detect and skip such steps was correctly added by #6976 (
check_grad_overflowoption,DeepSpeedZeroOptimizer.check_overflowmethod, and step-skip logic atstage_1_and_2.step()lines ~2128-2143). However the default was set toFalsefor bf16, so users hitting this condition do not receive the protection.Change
Single line:
check_grad_overflow: bool = False->check_grad_overflow: bool = TrueinDeepSpeedBF16Config. Updated docstring + bf16 example block accordingly.Backward compatibility
Users who have benchmarked the check as too expensive AND have separately confirmed their bf16 path cannot overflow can opt out by setting:
```json
"bf16": {
"enabled": true,
"check_grad_overflow": false
}
```
The runtime cost is one isfinite-style scan over the gradient partition per optimizer step (already implemented in
DeepSpeedZeroOptimizer.check_overflow); typically under 1% of step wallclock.Related
check_grad_overflowoption and underlying skip logicTest plan
False, run dies at step ~22; withTrue, training survives via DeepSpeed's existing skip-step path.precision_config.py.