Split wgrad&dgrad from backward() to support a2a overlap#1653
Merged
ksivaman merged 30 commits intoNVIDIA:mainfrom Apr 18, 2025
Merged
Split wgrad&dgrad from backward() to support a2a overlap#1653ksivaman merged 30 commits intoNVIDIA:mainfrom
ksivaman merged 30 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
for more information, see https://pre-commit.ci
13 tasks
Member
|
/te-ci pytorch L0 L1 |
denera
requested changes
Apr 10, 2025
a718320 to
76eea17
Compare
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
for more information, see https://pre-commit.ci
Collaborator
|
/te-ci pytorch L0 L1 |
b80a842 to
7ec4182
Compare
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
for more information, see https://pre-commit.ci
Contributor
Author
|
/te-ci pytorch L0 L1 |
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
for more information, see https://pre-commit.ci
Contributor
Author
|
/te-ci pytorch L0 L1 |
Member
|
What about LayerNormMLP? Since the functionality is advertised as general in all TE modules LNMLP should also be changed. |
ksivaman
reviewed
Apr 17, 2025
Contributor
Author
@ptrendx LNMLP has already been changed to delay the wgrad computation. |
denera
reviewed
Apr 17, 2025
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
for more information, see https://pre-commit.ci
ksivaman
reviewed
Apr 18, 2025
ksivaman
reviewed
Apr 18, 2025
ksivaman
reviewed
Apr 18, 2025
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Member
|
/te-ci pytorch L0 L1 |
ksivaman
previously approved these changes
Apr 18, 2025
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
ksivaman
approved these changes
Apr 18, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add a flag split_bw to control if we should separate wgrad from backward() and schedule it in another function to better hide the a2a communication when training moe models.
This MR is to support the 1f1b with a2a overlap in MCore, similar with the idea in DualPipe.
This feature has an assertion:
because the knob will bind the output of wgrad with dgrad , which complicates the computing context of wgrad.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: