Split wgrad&dgrad from backward() to support a2a overlap by lhb8125 · Pull Request #1653 · NVIDIA/TransformerEngine

lhb8125 · 2025-04-08T02:10:48Z

Description

Add a flag split_bw to control if we should separate wgrad from backward() and schedule it in another function to better hide the a2a communication when training moe models.
This MR is to support the 1f1b with a2a overlap in MCore, similar with the idea in DualPipe.
This feature has an assertion:

ub_bulk_wgrad == False

because the knob will bind the output of wgrad with dgrad , which complicates the computing context of wgrad.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Add class WeightGradStore to put and pop the context of wgrad computation;
Wrap&store the wgrad computation of class Linear/LayernormLinear/GroupedLinear and pop it in wgrad_comp();
Add some unit tests in test_numerics.py;

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

for more information, see https://pre-commit.ci Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

for more information, see https://pre-commit.ci

ksivaman · 2025-04-10T15:13:36Z

/te-ci pytorch L0 L1

transformer_engine/pytorch/module/linear.py

transformer_engine/pytorch/module/layernorm_linear.py

transformer_engine/pytorch/module/grouped_linear.py

transformer_engine/pytorch/module/linear.py

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

for more information, see https://pre-commit.ci

denera · 2025-04-11T16:16:27Z

/te-ci pytorch L0 L1

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

for more information, see https://pre-commit.ci

lhb8125 · 2025-04-14T12:05:28Z

/te-ci pytorch L0 L1

denera

LGTM!

transformer_engine/pytorch/module/base.py

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

lhb8125 · 2025-04-17T04:42:16Z

/te-ci pytorch L0 L1

ptrendx · 2025-04-17T14:23:04Z

What about LayerNormMLP? Since the functionality is advertised as general in all TE modules LNMLP should also be changed.

transformer_engine/pytorch/module/linear.py

lhb8125 · 2025-04-17T15:57:56Z

What about LayerNormMLP? Since the functionality is advertised as general in all TE modules LNMLP should also be changed.

@ptrendx LNMLP has already been changed to delay the wgrad computation.

transformer_engine/pytorch/module/base.py

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

for more information, see https://pre-commit.ci

transformer_engine/pytorch/module/linear.py

transformer_engine/pytorch/module/layernorm_linear.py

transformer_engine/pytorch/module/layernorm_mlp.py

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman · 2025-04-18T02:17:29Z

/te-ci pytorch L0 L1

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

lhb8125 and others added 8 commits April 7, 2025 18:55

split wgrad for GroupedLinear

96db8f5

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

4d3326e

for more information, see https://pre-commit.ci Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

support wgrad split for linear and ln_linear

94f1892

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

add comments and fix WeightGradStore

4cac7d0

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

support bias and fix unit tests

981ed83

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

minor fix

d5f8376

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

support fuse_grad_accumulation=false

6c23454

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

cb49c70

for more information, see https://pre-commit.ci

lhb8125 mentioned this pull request Apr 8, 2025

Split wgrad&dgrad from backward() to support a2a overlap #1564

Open

13 tasks

Victarry mentioned this pull request Apr 10, 2025

[ENHANCEMENT] DualPipeV support？ NVIDIA/Megatron-LM#1524

Closed

Merge branch 'main' into hongbinl/split_wgrad_new

76eea17

ksivaman requested a review from denera April 10, 2025 15:13

denera requested changes Apr 10, 2025

View reviewed changes

lhb8125 force-pushed the hongbinl/split_wgrad_new branch from a718320 to 76eea17 Compare April 11, 2025 14:43

lhb8125 and others added 4 commits April 11, 2025 07:44

add wgrad split for layernorm_mlp

d91ed12

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'main' into hongbinl/split_wgrad_new

a8e786c

minor fix

3b38bb4

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7ec4182

for more information, see https://pre-commit.ci

lhb8125 force-pushed the hongbinl/split_wgrad_new branch from b80a842 to 7ec4182 Compare April 14, 2025 12:00

lhb8125 and others added 3 commits April 14, 2025 05:01

fix unittest

38e18f7

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'main' into hongbinl/split_wgrad_new

7aefe67

[pre-commit.ci] auto fixes from pre-commit.com hooks

5f16c79

for more information, see https://pre-commit.ci

lhb8125 requested a review from denera April 14, 2025 12:08

denera previously approved these changes Apr 14, 2025

View reviewed changes

transformer_engine/pytorch/module/base.py Outdated Show resolved Hide resolved

lhb8125 and others added 3 commits April 16, 2025 06:21

add unittest for distributed interface apply Dener's suggestion

cd03509

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'main' into hongbinl/split_wgrad_new

6a00d24

[pre-commit.ci] auto fixes from pre-commit.com hooks

92b80ac

for more information, see https://pre-commit.ci

lhb8125 added 3 commits April 16, 2025 20:27

minor fix

541acc7

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'main' into hongbinl/split_wgrad_new

5131080

Merge branch 'main' into hongbinl/split_wgrad_new

06306ce

ksivaman reviewed Apr 17, 2025

View reviewed changes

transformer_engine/pytorch/module/linear.py Outdated Show resolved Hide resolved

lhb8125 dismissed denera’s stale review via 06306ce April 17, 2025 17:35

denera reviewed Apr 17, 2025

View reviewed changes

transformer_engine/pytorch/module/base.py Show resolved Hide resolved

lhb8125 and others added 3 commits April 17, 2025 17:52

Merge branch 'main' into hongbinl/split_wgrad_new

34c2f8f

replace split_bw with delay_wgrad_compute

0fbb286

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

559e9bd

for more information, see https://pre-commit.ci

ksivaman reviewed Apr 18, 2025

View reviewed changes

transformer_engine/pytorch/module/linear.py Outdated Show resolved Hide resolved

ksivaman reviewed Apr 18, 2025

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Outdated Show resolved Hide resolved

ksivaman reviewed Apr 18, 2025

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

ksivaman added 4 commits April 17, 2025 19:16

Update transformer_engine/pytorch/module/layernorm_mlp.py

7b29265

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Update transformer_engine/pytorch/module/linear.py

bfb3d37

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Update transformer_engine/pytorch/module/layernorm_linear.py

c630beb

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Merge branch 'main' into hongbinl/split_wgrad_new

20fd1cd

ksivaman previously approved these changes Apr 18, 2025

View reviewed changes

remove comments

2a7087e

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

lhb8125 dismissed ksivaman’s stale review via 2a7087e April 18, 2025 02:23

ksivaman approved these changes Apr 18, 2025

View reviewed changes

ksivaman merged commit 9f8aadd into NVIDIA:main Apr 18, 2025
11 checks passed

timmoon10 mentioned this pull request Apr 25, 2025

MXFP8 support in Userbuffers #1711

Merged

13 tasks

lhb8125 mentioned this pull request Jul 21, 2025

Manually launch wgrad accumulation and reduce in backward_dw() instead of backward() #1976

Merged

13 tasks

Comments

Conversation

lhb8125 commented Apr 8, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

ksivaman commented Apr 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

denera commented Apr 11, 2025

Uh oh!

lhb8125 commented Apr 14, 2025

Uh oh!

denera left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhb8125 commented Apr 17, 2025

Uh oh!

ptrendx commented Apr 17, 2025

Uh oh!

Uh oh!

lhb8125 commented Apr 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ksivaman commented Apr 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants