-
Notifications
You must be signed in to change notification settings - Fork 15
[WIP] enable rdma for weight sync #314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #314 +/- ##
=======================================
Coverage ? 63.37%
=======================================
Files ? 78
Lines ? 7717
Branches ? 0
=======================================
Hits ? 4891
Misses ? 2826
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
src/forge/actors/trainer.py
Outdated
key = get_param_key(policy_version, name) | ||
await ts.put(key, param) | ||
# RDMA is still broken on GPU, so we need to copy to CPU | ||
await ts.put(key, param.detach().cpu()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a bad thing? I thought we were writing it to CPU memory anyways on the trainer put side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
theoretically we don't need this extra copy if RDMA on GPU is working. yes we are writing to cpu memory, but currently it is local gpu-> local cpu-> remote cpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Why is the GPU RDMA not working?
-
What the perf penalty is with the GPU-CPU copy?
-
Also, what's the corresponding access on the read side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why is the GPU RDMA not working?
Memory registration error. This could be a build issue but I doubt we have enough time to debug. CPU should work now.
- What the perf penalty is with the GPU-CPU copy?
Not sure, need profiling. I'd guess anywhere between 30% - 100% increased latency.
- Also, what's the corresponding access on the read side?
On the read side, it's basically remote cpu -> local cpu -> gpu (vllm worker).
this needs relevant torchstore update to be landed first for the monarch api change, and we also have to target a newer version of monarch.