Skip to content

Commit 42e8954

Browse files
authored
fix(ep): mlx5 collapsed CQ + dedicated dispatch send buffer for internode-v1 (#366)
- Fix mlx5-RoCE internode-v1 silent corruption: combine overwrote dispatch's in-flight send source `staging`; dispatch now uses a dedicated `dispatchStaging` buffer. - Add mlx5 collapsed CQ (cc=1/oi=1): track completions via `CQE[0].wqe_counter`; final per-pe quiet waits for live `postIdx`, recycle gate keeps a snapshot. - Allocate GPU control structures (CQ/QP/doorbell/atomic ibuf) uncached; device-scope fences in the CQ drain. - Drop dead `outstandingWqe` writes on mlx5/psd.
1 parent d87651c commit 42e8954

5 files changed

Lines changed: 97 additions & 137 deletions

File tree

include/mori/ops/dispatch_combine/dispatch_combine.hpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,8 @@ struct ShmemBufsInterNodeV1 {
182182
mori::application::SymmMemObjPtr dispatchOut;
183183
mori::application::SymmMemObjPtr combineOut;
184184
mori::application::SymmMemObjPtr staging;
185+
// Dispatch send source, separate from `staging` so combine can't overwrite it.
186+
mori::application::SymmMemObjPtr dispatchStaging;
185187
};
186188

187189
// InterNode / AsyncLL: full 5-buffer set used by the non-V1 RDMA paths.

0 commit comments

Comments
 (0)