Skip to content

Conversation

casteryh
Copy link
Contributor

@casteryh casteryh commented Oct 5, 2025

this needs relevant torchstore update to be landed first for the monarch api change, and we also have to target a newer version of monarch.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 5, 2025
@casteryh casteryh marked this pull request as draft October 5, 2025 01:26
@codecov-commenter
Copy link

codecov-commenter commented Oct 5, 2025

Codecov Report

❌ Patch coverage is 13.51351% with 32 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@63e5beb). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/forge/actors/policy.py 3.22% 30 Missing ⚠️
src/forge/actors/trainer.py 0.00% 1 Missing ⚠️
src/forge/controller/provisioner.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #314   +/-   ##
=======================================
  Coverage        ?   63.37%           
=======================================
  Files           ?       78           
  Lines           ?     7717           
  Branches        ?        0           
=======================================
  Hits            ?     4891           
  Misses          ?     2826           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

key = get_param_key(policy_version, name)
await ts.put(key, param)
# RDMA is still broken on GPU, so we need to copy to CPU
await ts.put(key, param.detach().cpu())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a bad thing? I thought we were writing it to CPU memory anyways on the trainer put side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theoretically we don't need this extra copy if RDMA on GPU is working. yes we are writing to cpu memory, but currently it is local gpu-> local cpu-> remote cpu.

Copy link
Contributor

@vidhyav vidhyav Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why is the GPU RDMA not working?

  2. What the perf penalty is with the GPU-CPU copy?

  3. Also, what's the corresponding access on the read side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why is the GPU RDMA not working?

Memory registration error. This could be a build issue but I doubt we have enough time to debug. CPU should work now.

  1. What the perf penalty is with the GPU-CPU copy?

Not sure, need profiling. I'd guess anywhere between 30% - 100% increased latency.

  1. Also, what's the corresponding access on the read side?

On the read side, it's basically remote cpu -> local cpu -> gpu (vllm worker).

@casteryh casteryh marked this pull request as ready for review October 9, 2025 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants