[Feature] Over 99% communication overlap in Tensor Parallelism using Domino #286

hwchen2017 · 2025-03-01T00:41:27Z

Hi nanotron team, this is the follow-up work of #285. This PR implements the cross layer communication computation overlapping, and it can achieve over 99% communication overlapping on single node. Using Domino, TP training on single node is almost communication "free".
Domino paper: https://arxiv.org/abs/2409.15241

Profiling results

Forward pass

Backward pass

Please note that the flag enabling Domino is hardcoded here. It needs to be configurable for ease of use, but I haven't done it yet. There might also be other considerations on your side. You can further improve the code quality based on this branch. Please let me know if you have any other questions.

Cc: @GuanhuaWang

xrsrke · 2025-03-03T10:34:20Z

Hi, thanks for the PR. I did a bit of profiling and saw that the forward pass is impressive, as the communication fully overlaps, but the backward pass isn't as good

xrsrke · 2025-03-03T15:37:24Z

Thanks for the PR! I'm working on bringing your inter-layer into this PR #285, where we overlap the backward pass better and have also tested the convergence

GuanhuaWang · 2025-03-03T19:12:48Z

@xrsrke great catch! if possible, let's do it step-by-step (by merging this cross-layer overlap forward with some not so good backward, lets call it ver0.1.0).

We have noted down this issue and will provide solutions later (maybe later call the fixed as ver 0.1.1 then). Feel free to discuss any plan here, we are more than happy to help!

Hi, thanks for the PR. I did a bit of profiling and saw that the forward pass is impressive, as the communication fully overlaps, but the backward pass isn't as good

hwchen2017 · 2025-03-03T21:33:04Z

Hi @xrsrke , the performance of backward pass can be further optimized. But optimization of BWD pass will have conflicts with FWD pass (We will explain it later in out updated paper), and it makes the optimal e2e optimization complicated.

We may need to have different implementations for different models/config/nodes. And it's not finalized yet. You may still see some bubbles in computation stream for both FWD and BWD pass depending on the settings.

This branch can bring pretty good improvement over async megatron, and it doesn't have the potential data race problem. It's a good start point to think about the integration with your existing code. We will let you know once a better solution is available.

hwchen2017 added 4 commits February 27, 2025 23:30

save temp work

5637615

change row linear layer

be01146

FWD cross layer overlap

c66e46b

Clean code

62312ff

NouamaneTazi self-requested a review March 4, 2025 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Over 99% communication overlap in Tensor Parallelism using Domino #286

[Feature] Over 99% communication overlap in Tensor Parallelism using Domino #286

hwchen2017 commented Mar 1, 2025 •

edited

Loading

xrsrke commented Mar 3, 2025

xrsrke commented Mar 3, 2025 •

edited

Loading

GuanhuaWang commented Mar 3, 2025

hwchen2017 commented Mar 3, 2025

[Feature] Over 99% communication overlap in Tensor Parallelism using Domino #286

Are you sure you want to change the base?

[Feature] Over 99% communication overlap in Tensor Parallelism using Domino #286

Conversation

hwchen2017 commented Mar 1, 2025 • edited Loading

Profiling results

xrsrke commented Mar 3, 2025

xrsrke commented Mar 3, 2025 • edited Loading

GuanhuaWang commented Mar 3, 2025

hwchen2017 commented Mar 3, 2025

hwchen2017 commented Mar 1, 2025 •

edited

Loading

xrsrke commented Mar 3, 2025 •

edited

Loading