You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello dora-rs maintainers and community,
I am currently architecting a dataflow for a dynamic Gaussian Splatting SLAM (GS-SLAM) system using dora-rs.
While reviewing historical performance profiling within the framework, I analyzed Issue #651, where PCL alignment caused abnormal CPU spikes. This highlighted the performance ceilings—specifically memory alignment and cache misses—when interfacing heavy external libraries with Dora's Arrow-based shared memory on the CPU side.
While the CPU-side PCL bottleneck has been resolved, my GS-SLAM implementation introduces a different architectural stress test on the GPU side. We are streaming massive, high-frequency updates of 3D Gaussian primitives, which are essentially large parameter matrices (typically Nx14 dimensions containing covariance and spherical harmonics coefficients) that heavily rely on CUDA tensors.
Building upon the lessons from #651, I would like to discuss the current architectural boundaries regarding GPU memory in dora-rs:
1.Cross-Language VRAM Mapping Overhead: When bridging Arrow-backed shared memory from the Rust core into Python (e.g., a PyTorch node for GS rendering/optimization), what is the expected overhead when these arrays are mapped into GPU VRAM? Does the current PyArrow FFI implementation trigger an implicit Host-to-Device memory copy that negates the Arrow zero-copy advantage?
2.CUDA IPC Roadmap: Are there any existing paradigms, experimental features, or roadmap plans (perhaps relevant to GSoC 2026 ideas) for supporting direct GPU-to-GPU memory sharing (such as CUDA IPC) across nodes? Ideally, for high-frequency tensor data streams, we would want to bypass the CPU host memory allocation entirely.
I am very interested in exploring how dora-rs can be pushed to handle high-bandwidth neural rendering dataflows, and whether this direction aligns with the community's future priorities. Any architectural insights or pointers would be highly appreciated.
Thank you.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hello dora-rs maintainers and community,
I am currently architecting a dataflow for a dynamic Gaussian Splatting SLAM (GS-SLAM) system using dora-rs.
While reviewing historical performance profiling within the framework, I analyzed Issue #651, where PCL alignment caused abnormal CPU spikes. This highlighted the performance ceilings—specifically memory alignment and cache misses—when interfacing heavy external libraries with Dora's Arrow-based shared memory on the CPU side.
While the CPU-side PCL bottleneck has been resolved, my GS-SLAM implementation introduces a different architectural stress test on the GPU side. We are streaming massive, high-frequency updates of 3D Gaussian primitives, which are essentially large parameter matrices (typically Nx14 dimensions containing covariance and spherical harmonics coefficients) that heavily rely on CUDA tensors.
Building upon the lessons from #651, I would like to discuss the current architectural boundaries regarding GPU memory in dora-rs:
1.Cross-Language VRAM Mapping Overhead: When bridging Arrow-backed shared memory from the Rust core into Python (e.g., a PyTorch node for GS rendering/optimization), what is the expected overhead when these arrays are mapped into GPU VRAM? Does the current PyArrow FFI implementation trigger an implicit Host-to-Device memory copy that negates the Arrow zero-copy advantage?
2.CUDA IPC Roadmap: Are there any existing paradigms, experimental features, or roadmap plans (perhaps relevant to GSoC 2026 ideas) for supporting direct GPU-to-GPU memory sharing (such as CUDA IPC) across nodes? Ideally, for high-frequency tensor data streams, we would want to bypass the CPU host memory allocation entirely.
I am very interested in exploring how dora-rs can be pushed to handle high-bandwidth neural rendering dataflows, and whether this direction aligns with the community's future priorities. Any architectural insights or pointers would be highly appreciated.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions