Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Over 99% communication overlap in Tensor Parallelism using Domino #286

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

hwchen2017
Copy link

@hwchen2017 hwchen2017 commented Mar 1, 2025

Hi nanotron team, this is the follow-up work of #285. This PR implements the cross layer communication computation overlapping, and it can achieve over 99% communication overlapping on single node. Using Domino, TP training on single node is almost communication "free".
Domino paper: https://arxiv.org/abs/2409.15241

Profiling results

Forward pass

image

Backward pass
image

Please note that the flag enabling Domino is hardcoded here. It needs to be configurable for ease of use, but I haven't done it yet. There might also be other considerations on your side. You can further improve the code quality based on this branch. Please let me know if you have any other questions.

Cc: @GuanhuaWang

@xrsrke
Copy link
Member

xrsrke commented Mar 3, 2025

Hi, thanks for the PR. I did a bit of profiling and saw that the forward pass is impressive, as the communication fully overlaps, but the backward pass isn't as good

image

image

@xrsrke
Copy link
Member

xrsrke commented Mar 3, 2025

Thanks for the PR! I'm working on bringing your inter-layer into this PR #285, where we overlap the backward pass better and have also tested the convergence

@GuanhuaWang
Copy link

@xrsrke great catch! if possible, let's do it step-by-step (by merging this cross-layer overlap forward with some not so good backward, lets call it ver0.1.0).

We have noted down this issue and will provide solutions later (maybe later call the fixed as ver 0.1.1 then). Feel free to discuss any plan here, we are more than happy to help!

Hi, thanks for the PR. I did a bit of profiling and saw that the forward pass is impressive, as the communication fully overlaps, but the backward pass isn't as good

image

image

@hwchen2017
Copy link
Author

Hi @xrsrke , the performance of backward pass can be further optimized. But optimization of BWD pass will have conflicts with FWD pass (We will explain it later in out updated paper), and it makes the optimal e2e optimization complicated.

We may need to have different implementations for different models/config/nodes. And it's not finalized yet. You may still see some bubbles in computation stream for both FWD and BWD pass depending on the settings.

This branch can bring pretty good improvement over async megatron, and it doesn't have the potential data race problem. It's a good start point to think about the integration with your existing code. We will let you know once a better solution is available.

@NouamaneTazi NouamaneTazi self-requested a review March 4, 2025 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants