[WIP] enable rdma for weight sync #314

casteryh · 2025-10-05T01:24:34Z

this needs relevant torchstore update to be landed first for the monarch api change, and we also have to target a newer version of monarch.

codecov-commenter · 2025-10-05T01:28:41Z

Codecov Report

❌ Patch coverage is 13.51351% with 32 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@63e5beb). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/forge/actors/policy.py	3.22%	30 Missing ⚠️
src/forge/actors/trainer.py	0.00%	1 Missing ⚠️
src/forge/controller/provisioner.py	80.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #314   +/-   ##
=======================================
  Coverage        ?   63.37%           
=======================================
  Files           ?       78           
  Lines           ?     7717           
  Branches        ?        0           
=======================================
  Hits            ?     4891           
  Misses          ?     2826           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

allenwang28 · 2025-10-06T22:06:32Z

src/forge/actors/trainer.py

                key = get_param_key(policy_version, name)
-                await ts.put(key, param)
+                # RDMA is still broken on GPU, so we need to copy to CPU
+                await ts.put(key, param.detach().cpu())


is this a bad thing? I thought we were writing it to CPU memory anyways on the trainer put side

theoretically we don't need this extra copy if RDMA on GPU is working. yes we are writing to cpu memory, but currently it is local gpu-> local cpu-> remote cpu.

Why is the GPU RDMA not working?

What the perf penalty is with the GPU-CPU copy?

Also, what's the corresponding access on the read side?

Why is the GPU RDMA not working?

Memory registration error. This could be a build issue but I doubt we have enough time to debug. CPU should work now.

What the perf penalty is with the GPU-CPU copy?

Not sure, need profiling. I'd guess anywhere between 30% - 100% increased latency.

Also, what's the corresponding access on the read side?

On the read side, it's basically remote cpu -> local cpu -> gpu (vllm worker).

enable rdma for weight sync

001a6b6

casteryh requested review from JenniferWang, allenwang28 and joecummings October 5, 2025 01:24

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 5, 2025

casteryh marked this pull request as draft October 5, 2025 01:26

allenwang28 reviewed Oct 6, 2025

View reviewed changes

yuxuanh and others added 4 commits October 8, 2025 17:38

stash

4f73bb3

Merge branch 'main' into rdma

b389c4d

Merge branch 'main' into rdma

98c849c

inherit env variables in provisioner.py

0e2fef6

casteryh marked this pull request as ready for review October 9, 2025 21:34

casteryh added 2 commits October 10, 2025 02:06

modified config

64421c4

change config

197d89e

casteryh force-pushed the rdma branch from f305f54 to 197d89e Compare October 10, 2025 02:35

casteryh added 6 commits October 9, 2025 20:37

add profiling

b783ceb

GPU -> CUDA

6ff0297

fix profiler

9beab76

fix profiler path

1005314

parallelize

f9491db

fix

b5f307b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] enable rdma for weight sync #314

[WIP] enable rdma for weight sync #314

casteryh commented Oct 5, 2025

Uh oh!

codecov-commenter commented Oct 5, 2025 •

edited

Loading

Uh oh!

allenwang28 Oct 6, 2025

Uh oh!

casteryh Oct 7, 2025

Uh oh!

vidhyav Oct 7, 2025 •

edited

Loading

Uh oh!

casteryh Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[WIP] enable rdma for weight sync #314

Are you sure you want to change the base?

[WIP] enable rdma for weight sync #314

Conversation

casteryh commented Oct 5, 2025

Uh oh!

codecov-commenter commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

allenwang28 Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

casteryh Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

vidhyav Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casteryh Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Oct 5, 2025 •

edited

Loading

vidhyav Oct 7, 2025 •

edited

Loading