Summary
nixl-cu13==0.10.1 (pip wheel) bundles UCX 1.20.0 and a libplugin_UCX.so linked against it. When two NIXL agents try to initialize concurrently on the same host (e.g. prefill + decode workers in a disaggregated-serving setup), each agent's nixlUcxContext constructor enters a runaway-realloc loop inside uct_md_query_tl_resources and never returns. The md_resources buffer grows past 1 GiB and keeps growing at ~7 MB/s with no terminating condition.
The same bug reproduces with two workers on a single GPU (so no multi-GPU hardware needed to reproduce).
Reproduction
Run two NIXL agents in two processes on the same host. Minimal pattern (using TRT-LLM's native disagg transceiver as the consumer; same hang reproduces in any two-agent scenario):
docker run --rm --gpus '"device=0"' nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc14 bash -c '
pip install nixl-cu13==0.10.1
# ... launch two python processes each creating an nixl.Agent with backend=UCX ...
'
Both processes wedge during nixlAgent::createBackend("UCX", ...). Single-agent runs work fine; only the concurrent two-agent case fires the bug.
Stack trace (captured with gdb -p <pid> on the hung process)
#0 syscall ()
#1 ucm_event_call_orig (event_type=UCM_EVENT_MREMAP, ...) at event/event.c:80
#2 ucm_event_dispatch (event_type=UCM_EVENT_MREMAP, ...) at event/event.c:145
#3 ucm_mremap (old_address=0x7833c17cf000, old_size=1182994432, new_size=1182998528, ...) at event/event.c:277
#4 realloc () from /lib/x86_64-linux-gnu/libc.so.6
#5 ucs_realloc (size=1182994436, name="md_resources") at debug/memtrack.c:347
#6 uct_md_query_tl_resources (md=0x... <md>, ...) at base/uct_md.c:106
#7 ucp_add_tl_resources (...) at core/ucp_context.c:1299
#8 ucp_add_component_resources (...) at core/ucp_context.c:1710
#9 ucp_fill_resources (...) at core/ucp_context.c:2003
#10 ucp_init_version (...) at core/ucp_context.c:2505
#11 nixlUcxContext::nixlUcxContext (...) at .../nixl_cu13.mesonpy.libs/plugins/libplugin_UCX.so
#12 nixlUcxEngine::nixlUcxEngine (...)
#13 nixlUcxThreadEngine::nixlUcxThreadEngine (...)
#14 nixlUcxEngine::create (...)
#15 nixlBackendPluginCreator<nixlUcxEngine>::createEngine (...)
#16 nixlAgent::createBackend ("UCX", ...) at .../nixl_cu13.mesonpy.libs/libnixl.so
Sampled buffer size grows monotonically: 906 MB → 1.18 GiB → 1.40 GiB across separate gdb snapshots ~5 s apart.
Environment
nixl==0.10.1, nixl-cu13==0.10.1 (installed from PyPI)
- Bundled UCX:
1.20.0 (loaded from /opt/dynamo/venv/lib/python3.12/.../nixl_cu13.mesonpy.libs/plugins/...)
- Tested in
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc14 runtime image, amd64
Plugin enumeration before hang (info-level logs)
nixl_plugin_manager.cpp:303] Loading plugins from: .../nixl_cu13.mesonpy.libs/plugins
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: GUSLI
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: GDS
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: UCX
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: AZURE_BLOB
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: POSIX
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: OBJ
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: GDS_MT
config.cpp:45] Modified UCX config: ADDRESS_VERSION=v2
config.cpp:45] Modified UCX config: RNDV_THRESH=inf
config.cpp:45] Modified UCX config: MAX_RMA_RAILS=2
config.cpp:42] Failed to modify UCX config: IB_PCI_RELAXED_ORDERING=try: Invalid parameter
config.cpp:45] Modified UCX config: RCACHE_MAX_UNRELEASED=1024
config.cpp:42] Failed to modify UCX config: RC_GDA_NUM_CHANNELS=4: Invalid parameter
config.cpp:45] Modified UCX config: MAX_COMPONENT_MDS=32
ucp_context.c:2463 UCX INFO Version 1.20.0 (loaded from /usr/local/ucx//lib/libucp.so.0)
ucp_context.c:2463 UCX INFO Version 1.20.0 (loaded from /usr/local/ucx//lib/libucp.so.0)
# <--- hangs here, no further output --->
UCX_TLS=cuda_ipc,self,tcp does not help — the filter is applied after uct_md_query_tl_resources, so the explosion happens upstream of TLS filtering.
Workaround
Force-load TRT-LLM's bundled libnixl.so 0.9.0 (which links against system UCX) ahead of nixl-cu13's DT_RPATH:
ENV LD_PRELOAD=/usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/nixl/libnixl.so
NIXL 0.9.0 + system UCX 1.20.0 does not trigger the bug; the same UCX version works fine, suggesting the regression is in how nixl-cu13's plugin uses UCX (possibly the modified config or ucp_init argument shape) rather than in UCX itself.
Ask
- Is the
md_resources enumeration loop a known bug in either nixl-cu13's UCX plugin or UCX 1.20.0?
- Can
nixl-cu13 0.10.x ship with a fixed UCX or avoid the codepath?
- Documented workaround beyond LD_PRELOAD?
Tracked downstream in ai-dynamo/dynamo#9654 — once a fix lands, we can drop the LD_PRELOAD.
Summary
nixl-cu13==0.10.1(pip wheel) bundles UCX 1.20.0 and alibplugin_UCX.solinked against it. When two NIXL agents try to initialize concurrently on the same host (e.g. prefill + decode workers in a disaggregated-serving setup), each agent'snixlUcxContextconstructor enters a runaway-realloc loop insideuct_md_query_tl_resourcesand never returns. Themd_resourcesbuffer grows past 1 GiB and keeps growing at ~7 MB/s with no terminating condition.The same bug reproduces with two workers on a single GPU (so no multi-GPU hardware needed to reproduce).
Reproduction
Run two NIXL agents in two processes on the same host. Minimal pattern (using TRT-LLM's native disagg transceiver as the consumer; same hang reproduces in any two-agent scenario):
Both processes wedge during
nixlAgent::createBackend("UCX", ...). Single-agent runs work fine; only the concurrent two-agent case fires the bug.Stack trace (captured with
gdb -p <pid>on the hung process)Sampled buffer size grows monotonically: 906 MB → 1.18 GiB → 1.40 GiB across separate gdb snapshots ~5 s apart.
Environment
nixl==0.10.1,nixl-cu13==0.10.1(installed from PyPI)1.20.0(loaded from/opt/dynamo/venv/lib/python3.12/.../nixl_cu13.mesonpy.libs/plugins/...)nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc14runtime image, amd64Plugin enumeration before hang (info-level logs)
UCX_TLS=cuda_ipc,self,tcpdoes not help — the filter is applied afteruct_md_query_tl_resources, so the explosion happens upstream of TLS filtering.Workaround
Force-load TRT-LLM's bundled
libnixl.so0.9.0 (which links against system UCX) ahead ofnixl-cu13'sDT_RPATH:NIXL 0.9.0 + system UCX 1.20.0 does not trigger the bug; the same UCX version works fine, suggesting the regression is in how
nixl-cu13's plugin uses UCX (possibly the modified config orucp_initargument shape) rather than in UCX itself.Ask
md_resourcesenumeration loop a known bug in eithernixl-cu13's UCX plugin or UCX 1.20.0?nixl-cu130.10.x ship with a fixed UCX or avoid the codepath?Tracked downstream in ai-dynamo/dynamo#9654 — once a fix lands, we can drop the LD_PRELOAD.