Summary
nixlLibfabricEngine::releaseReqH() does not delete the nixlLibfabricBackendH object allocated in prepXfer(), causing a memory leak on every transfer request lifecycle.
Environment
- NIXL main branch (latest commit as of 2026-06-03)
- libfabric backend with EFA provider
- Multi-GPU benchmark (8x GPU, 1024 requests/iteration)
Root Cause
In src/plugins/libfabric/libfabric_backend.cpp, releaseReqH() simply returns without freeing the handle:
nixl_status_t
nixlLibfabricEngine::releaseReqH(nixlBackendReqH *handle) const {
if (!handle) {
return NIXL_SUCCESS;
}
// Let NIXL framework handle the deletion <-- INCORRECT: framework does NOT delete
NIXL_DEBUG << "releaseReqH completed successfully";
return NIXL_SUCCESS;
}
The comment "Let NIXL framework handle the deletion" is incorrect. The framework's ~nixlXferReqH() only calls engine->releaseReqH(backendHandle) — it never calls delete backendHandle. This is confirmed by:
src/core/transfer_request.h — destructor only calls releaseReqH, no delete
docs/BackendGuide.md line 93: "releaseReqH(): Releases a transfer request handle... freeing resources"
- Every other backend (UCX, Mooncake, HiXL, HF3FS, OBJ, CUDA GDS) performs
delete handle inside releaseReqH()
Impact
- Every
postXfer() call allocates a new nixlLibfabricBackendH() (in prepXfer)
- The handle is never freed → unbounded memory growth proportional to transfer count
- In production disaggregated inference: thousands of transfers/second → rapid OOM
Proposed Fix
nixl_status_t
nixlLibfabricEngine::releaseReqH(nixlBackendReqH *handle) const {
if (!handle) {
return NIXL_SUCCESS;
}
nixlLibfabricBackendH *lf_handle = static_cast<nixlLibfabricBackendH *>(handle);
delete lf_handle;
NIXL_DEBUG << "releaseReqH completed successfully";
return NIXL_SUCCESS;
}
Additional Related Issues Found
During the same code audit, we also identified several other resource management issues in the libfabric backend:
-
Control request leak in postXfer() — a ControlRequestPool request is allocated unconditionally at the start of postXfer() but only released (via send completion callback) when hasNotif == true. Without notification, the request permanently leaks from the 256-entry pool.
-
received_remote_writes_ unbounded growth — the unordered_set<uint32_t> tracking received XFER_IDs is never cleared, growing indefinitely and causing false positives after XFER_ID wraparound (20-bit counter).
-
CQ error path does not release failed requests — when fi_cq_read returns error, fi_cq_readerr is called for logging but the errored request (identifiable via err_entry.op_context) is never returned to the pool.
Happy to submit PRs for any/all of these.
References
- UCX backend correctly deletes:
src/plugins/ucx/ucx_backend.cpp (releaseReqH → delete intHandle)
- Mooncake backend correctly deletes:
src/plugins/mooncake/mooncake_backend.cpp (releaseReqH → delete priv)
- BackendGuide documentation:
docs/BackendGuide.md line 93
Summary
nixlLibfabricEngine::releaseReqH()does notdeletethenixlLibfabricBackendHobject allocated inprepXfer(), causing a memory leak on every transfer request lifecycle.Environment
Root Cause
In
src/plugins/libfabric/libfabric_backend.cpp,releaseReqH()simply returns without freeing the handle:The comment "Let NIXL framework handle the deletion" is incorrect. The framework's
~nixlXferReqH()only callsengine->releaseReqH(backendHandle)— it never callsdelete backendHandle. This is confirmed by:src/core/transfer_request.h— destructor only callsreleaseReqH, nodeletedocs/BackendGuide.mdline 93: "releaseReqH(): Releases a transfer request handle... freeing resources"delete handleinsidereleaseReqH()Impact
postXfer()call allocates anew nixlLibfabricBackendH()(inprepXfer)Proposed Fix
Additional Related Issues Found
During the same code audit, we also identified several other resource management issues in the libfabric backend:
Control request leak in
postXfer()— aControlRequestPoolrequest is allocated unconditionally at the start ofpostXfer()but only released (via send completion callback) whenhasNotif == true. Without notification, the request permanently leaks from the 256-entry pool.received_remote_writes_unbounded growth — theunordered_set<uint32_t>tracking received XFER_IDs is never cleared, growing indefinitely and causing false positives after XFER_ID wraparound (20-bit counter).CQ error path does not release failed requests — when
fi_cq_readreturns error,fi_cq_readerris called for logging but the errored request (identifiable viaerr_entry.op_context) is never returned to the pool.Happy to submit PRs for any/all of these.
References
src/plugins/ucx/ucx_backend.cpp(releaseReqH →delete intHandle)src/plugins/mooncake/mooncake_backend.cpp(releaseReqH →delete priv)docs/BackendGuide.mdline 93