Skip to content

UCX 1.20.0 (bundled in nixl-cu13==0.10.1) hangs in uct_md_query_tl_resources with concurrent NIXL agents #1668

@tanmayv25

Description

@tanmayv25

Summary

nixl-cu13==0.10.1 (pip wheel) bundles UCX 1.20.0 and a libplugin_UCX.so linked against it. When two NIXL agents try to initialize concurrently on the same host (e.g. prefill + decode workers in a disaggregated-serving setup), each agent's nixlUcxContext constructor enters a runaway-realloc loop inside uct_md_query_tl_resources and never returns. The md_resources buffer grows past 1 GiB and keeps growing at ~7 MB/s with no terminating condition.

The same bug reproduces with two workers on a single GPU (so no multi-GPU hardware needed to reproduce).

Reproduction

Run two NIXL agents in two processes on the same host. Minimal pattern (using TRT-LLM's native disagg transceiver as the consumer; same hang reproduces in any two-agent scenario):

docker run --rm --gpus '"device=0"' nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc14 bash -c '
  pip install nixl-cu13==0.10.1
  # ... launch two python processes each creating an nixl.Agent with backend=UCX ...
'

Both processes wedge during nixlAgent::createBackend("UCX", ...). Single-agent runs work fine; only the concurrent two-agent case fires the bug.

Stack trace (captured with gdb -p <pid> on the hung process)

#0  syscall ()
#1  ucm_event_call_orig (event_type=UCM_EVENT_MREMAP, ...) at event/event.c:80
#2  ucm_event_dispatch (event_type=UCM_EVENT_MREMAP, ...) at event/event.c:145
#3  ucm_mremap (old_address=0x7833c17cf000, old_size=1182994432, new_size=1182998528, ...) at event/event.c:277
#4  realloc () from /lib/x86_64-linux-gnu/libc.so.6
#5  ucs_realloc (size=1182994436, name="md_resources") at debug/memtrack.c:347
#6  uct_md_query_tl_resources (md=0x... <md>, ...) at base/uct_md.c:106
#7  ucp_add_tl_resources (...) at core/ucp_context.c:1299
#8  ucp_add_component_resources (...) at core/ucp_context.c:1710
#9  ucp_fill_resources (...) at core/ucp_context.c:2003
#10 ucp_init_version (...) at core/ucp_context.c:2505
#11 nixlUcxContext::nixlUcxContext (...) at .../nixl_cu13.mesonpy.libs/plugins/libplugin_UCX.so
#12 nixlUcxEngine::nixlUcxEngine (...)
#13 nixlUcxThreadEngine::nixlUcxThreadEngine (...)
#14 nixlUcxEngine::create (...)
#15 nixlBackendPluginCreator<nixlUcxEngine>::createEngine (...)
#16 nixlAgent::createBackend ("UCX", ...) at .../nixl_cu13.mesonpy.libs/libnixl.so

Sampled buffer size grows monotonically: 906 MB → 1.18 GiB → 1.40 GiB across separate gdb snapshots ~5 s apart.

Environment

  • nixl==0.10.1, nixl-cu13==0.10.1 (installed from PyPI)
  • Bundled UCX: 1.20.0 (loaded from /opt/dynamo/venv/lib/python3.12/.../nixl_cu13.mesonpy.libs/plugins/...)
  • Tested in nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc14 runtime image, amd64

Plugin enumeration before hang (info-level logs)

nixl_plugin_manager.cpp:303] Loading plugins from: .../nixl_cu13.mesonpy.libs/plugins
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: GUSLI
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: GDS
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: UCX
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: AZURE_BLOB
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: POSIX
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: OBJ
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: GDS_MT
config.cpp:45] Modified UCX config: ADDRESS_VERSION=v2
config.cpp:45] Modified UCX config: RNDV_THRESH=inf
config.cpp:45] Modified UCX config: MAX_RMA_RAILS=2
config.cpp:42] Failed to modify UCX config: IB_PCI_RELAXED_ORDERING=try: Invalid parameter
config.cpp:45] Modified UCX config: RCACHE_MAX_UNRELEASED=1024
config.cpp:42] Failed to modify UCX config: RC_GDA_NUM_CHANNELS=4: Invalid parameter
config.cpp:45] Modified UCX config: MAX_COMPONENT_MDS=32
ucp_context.c:2463 UCX INFO Version 1.20.0 (loaded from /usr/local/ucx//lib/libucp.so.0)
ucp_context.c:2463 UCX INFO Version 1.20.0 (loaded from /usr/local/ucx//lib/libucp.so.0)
# <--- hangs here, no further output --->

UCX_TLS=cuda_ipc,self,tcp does not help — the filter is applied after uct_md_query_tl_resources, so the explosion happens upstream of TLS filtering.

Workaround

Force-load TRT-LLM's bundled libnixl.so 0.9.0 (which links against system UCX) ahead of nixl-cu13's DT_RPATH:

ENV LD_PRELOAD=/usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/nixl/libnixl.so

NIXL 0.9.0 + system UCX 1.20.0 does not trigger the bug; the same UCX version works fine, suggesting the regression is in how nixl-cu13's plugin uses UCX (possibly the modified config or ucp_init argument shape) rather than in UCX itself.

Ask

  • Is the md_resources enumeration loop a known bug in either nixl-cu13's UCX plugin or UCX 1.20.0?
  • Can nixl-cu13 0.10.x ship with a fixed UCX or avoid the codepath?
  • Documented workaround beyond LD_PRELOAD?

Tracked downstream in ai-dynamo/dynamo#9654 — once a fix lands, we can drop the LD_PRELOAD.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions