Intra-node shared memory (SHM) optimizations for CPU primitives #458

gaopengff · 2025-07-31T07:57:42Z

This PR is for RFC #455. It has implemented shm allreduce.

meta-cla · 2025-07-31T07:57:49Z

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

gloo/allreduce.cc

d4l3k · 2025-08-11T17:16:28Z

gloo/allreduce_shm.cc

+
+bool is_intra_node(const int size) {
+    // must launch with torchrun
+  auto local_size_string = std::getenv("LOCAL_WORLD_SIZE");


I don't think this check is safe -- for torchft for instance we often run with Gloo only cross host and if you're using an 8x8 configuration this would trigger shm logic for cross host comms

In my understanding, world_size is 64 and the local_size is 8 if we launch program with 8x8 configuration, which won't introduce shm allreduce.

@gaopengff With hybrid parallelism we often create cross host groups that can be smaller than LOCAL_WORLD_SIZE.

I.e. with HSDP in a 2x8 configuration we could be using FSDP within one host and replicating across the hosts. Thus each worker has two instances of Gloo. Once that is size 8 for within the host and then the other with size 2 between each pair of ranks across hosts.

The opposite can happen as well. Say we're running with a tensor parallel dimension of 2 -- that means each 2 workers on one host will have a group of size 2 and this logic would incorrectly not use shm for that operation.

In this hybrid scenario, checking for LOCAL_WORLD_SIZE from environment variables is not correct indeed. It's difficult to get local_world_size in gloo context. So I changed the code to check max_local_rank whether is to equal to world_size. Supposing we are running on 8x8 configuration:
Group1: 1x8 intra-node max_local_rank(8) == world_size(8) -> shm allreduce
Group2: 8x1 inter-node max_local_rank(1) != world_size(8) -> ring allreduce
Do you think the check is right?

d4l3k

Still need to fix the triggering logic + add some unit tests for this behavior

d4l3k · 2025-08-21T20:23:01Z

gloo/allreduce_shm.cc

+
+bool is_intra_node(const int size) {
+    // must launch with torchrun
+  auto local_size_string = std::getenv("LOCAL_WORLD_SIZE");


@gaopengff With hybrid parallelism we often create cross host groups that can be smaller than LOCAL_WORLD_SIZE.

I.e. with HSDP in a 2x8 configuration we could be using FSDP within one host and replicating across the hosts. Thus each worker has two instances of Gloo. Once that is size 8 for within the host and then the other with size 2 between each pair of ranks across hosts.

The opposite can happen as well. Say we're running with a tensor parallel dimension of 2 -- that means each 2 workers on one host will have a group of size 2 and this logic would incorrectly not use shm for that operation.

d4l3k · 2025-08-29T16:48:03Z

gloo/allreduce.cc


-  switch (opts.algorithm) {
+  auto algorithm = opts.algorithm;
+  if (is_intra_node(opts)) {


Doing a broadcast prior to every single allreduce seems pretty expensive -- can we move this logic to the TCP init in createAndConnectAllPairs? We have enough information there I believe to compute the topology since we have all the hostnames.

The other option would be to inspect the pairs that are participating and check if they all share the same IP

Thanks for you advice. I have moved intra-node checking to createAndConnectAllPairs and it would use hostnames for checking. Could you help review? Thanks.

d4l3k · 2025-09-02T18:28:07Z

gloo/allreduce.cc

+  auto algorithm = opts.algorithm;
+
+#ifndef _WIN32
+  if (context->isIntraNode()) {


Can we add a check to make sure inputs are on CPU before setting it to shm? I believe this would fail when using local ibv cuda backend

Can also check device->hasGPUDirect() and instead disable for all GPU direct supported backends

Sure. I've changed the check condition to "(context->isIntraNode() && !context->getDevice()->hasGPUDirect())".

d4l3k

LGTM

gaopengff · 2025-09-05T12:33:37Z

I have made some small changes to fix timeout ut error. Now it could throw timeout exception as expected. Could help trigger ci and later merge?，thanks. @d4l3k

d4l3k · 2025-09-09T00:00:19Z

@gaopengff looks like a legitimate failure: AllreduceNewRing/AllreduceNewTest

gaopengff · 2025-09-12T02:21:16Z

@d4l3k This failure is due to an existing issue of getting hostname in store at link. It should use store->wait_get() rather than store->wait() like in tcp context. It will pass after this change.
Also, I have added finalize method to free work buffer and close shared memory file descriptor, which will make our code more complete. Could help trigger ci? Thanks.

gaopengff · 2025-09-16T03:15:45Z

@d4l3k I could not reproduce current ut failure in our machine. Could you please help retrigger it and provide more detailed information about your CI environment?

d4l3k · 2025-09-30T23:14:57Z

@gaopengff any updates on this? Looks like CI is still failing.

In terms of environment, this is using the standard GitHub ubuntu-24.04 runner. This is a pretty standard environment built on ubuntu 24.04

gaopengff · 2025-10-10T09:12:01Z

Hi @d4l3k Do you know the detailed CI running configs? I think this failure may be caused by out of memory in docker images. Our shared memory allreduce will allocate some fixing buffers in initializing, which may be too large for CI environment. I couldn't reproduce this CI failure on our machine, which has sufficient large memory.

gaopengff added 6 commits July 11, 2025 01:46

add shm allreduce

1738e8e

add bf16 and half support

76d1114

remove bf16 support

2d152a3

add bf16 support

8c29eeb

use reduce function to do reduce job

554d317

refine format

0fdde35

jianan-gu mentioned this pull request Jul 31, 2025

[RFC] Intra-node shared memory (SHM) optimizations for communication operators on CPUs #455

Open

4 tasks

fix accuracy issue

be7da7c

d4l3k requested changes Aug 11, 2025

View reviewed changes

move intro-node check to allreduce()

5b698dc

gaopengff marked this pull request as ready for review August 20, 2025 07:46

remove debug code

3564e95

meta-cla bot added the CLA Signed label Aug 20, 2025

gaopengff requested a review from d4l3k August 21, 2025 01:42

d4l3k requested changes Aug 21, 2025

View reviewed changes

use local_rank to check intra-node condition

3ac8065

gaopengff requested a review from d4l3k August 25, 2025 11:51

add support for multi-thread and code for local reduction

8d3ae22

d4l3k requested changes Aug 29, 2025

View reviewed changes

gaopengff added 2 commits September 1, 2025 04:05

move intra-node check to createAndConnectAllPairs

2985c3e

update to main

704d35d

gaopengff requested a review from d4l3k September 1, 2025 08:26

d4l3k requested changes Sep 2, 2025

View reviewed changes

gaopengff added 2 commits September 3, 2025 04:21

add check for gpu input and fix format issue

9ae2842

fix format issue

1b7660a

gaopengff requested a review from d4l3k September 3, 2025 08:36

Valentine233 mentioned this pull request Sep 4, 2025

[Flex Attn][CPU] support flash decoding for cpu pytorch/pytorch#159835

Open

d4l3k approved these changes Sep 4, 2025

View reviewed changes

fix timeout ut

ab9c63b

gaopengff requested a review from d4l3k September 5, 2025 12:33

gaopengff mentioned this pull request Sep 5, 2025

[Draft] Update gloo commit pytorch/pytorch#162260

Draft

add fininalize method and fix ut

9f726e9

Intra-node shared memory (SHM) optimizations for CPU primitives #458

Are you sure you want to change the base?

Intra-node shared memory (SHM) optimizations for CPU primitives #458

Uh oh!

Conversation

gaopengff commented Jul 31, 2025

Uh oh!

meta-cla bot commented Jul 31, 2025

Action Required

Process

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d4l3k left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaopengff Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

gaopengff commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4l3k commented Sep 9, 2025

Uh oh!

gaopengff commented Sep 12, 2025

Uh oh!

gaopengff commented Sep 16, 2025

Uh oh!

d4l3k commented Sep 30, 2025

Uh oh!

gaopengff commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d4l3k left a comment •

edited

Loading

gaopengff Sep 1, 2025 •

edited

Loading

gaopengff commented Sep 5, 2025 •

edited

Loading