UCX 1.20.0 (bundled in nixl-cu13==0.10.1) hangs in uct_md_query_tl_resources with concurrent NIXL agents

## Summary

`nixl-cu13==0.10.1` (pip wheel) bundles UCX 1.20.0 and a `libplugin_UCX.so` linked against it. When two NIXL agents try to initialize concurrently on the same host (e.g. prefill + decode workers in a disaggregated-serving setup), each agent's `nixlUcxContext` constructor enters a runaway-realloc loop inside `uct_md_query_tl_resources` and never returns. The `md_resources` buffer grows past 1 GiB and keeps growing at ~7 MB/s with no terminating condition.

The same bug reproduces with two workers on a **single GPU** (so no multi-GPU hardware needed to reproduce).

## Reproduction

Run two NIXL agents in two processes on the same host. Minimal pattern (using TRT-LLM's native disagg transceiver as the consumer; same hang reproduces in any two-agent scenario):

```bash
docker run --rm --gpus '"device=0"' nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc14 bash -c '
  pip install nixl-cu13==0.10.1
  # ... launch two python processes each creating an nixl.Agent with backend=UCX ...
'
```

Both processes wedge during `nixlAgent::createBackend("UCX", ...)`. Single-agent runs work fine; only the concurrent two-agent case fires the bug.

## Stack trace (captured with `gdb -p <pid>` on the hung process)

```
#0  syscall ()
#1  ucm_event_call_orig (event_type=UCM_EVENT_MREMAP, ...) at event/event.c:80
#2  ucm_event_dispatch (event_type=UCM_EVENT_MREMAP, ...) at event/event.c:145
#3  ucm_mremap (old_address=0x7833c17cf000, old_size=1182994432, new_size=1182998528, ...) at event/event.c:277
#4  realloc () from /lib/x86_64-linux-gnu/libc.so.6
#5  ucs_realloc (size=1182994436, name="md_resources") at debug/memtrack.c:347
#6  uct_md_query_tl_resources (md=0x... <md>, ...) at base/uct_md.c:106
#7  ucp_add_tl_resources (...) at core/ucp_context.c:1299
#8  ucp_add_component_resources (...) at core/ucp_context.c:1710
#9  ucp_fill_resources (...) at core/ucp_context.c:2003
#10 ucp_init_version (...) at core/ucp_context.c:2505
#11 nixlUcxContext::nixlUcxContext (...) at .../nixl_cu13.mesonpy.libs/plugins/libplugin_UCX.so
#12 nixlUcxEngine::nixlUcxEngine (...)
#13 nixlUcxThreadEngine::nixlUcxThreadEngine (...)
#14 nixlUcxEngine::create (...)
#15 nixlBackendPluginCreator<nixlUcxEngine>::createEngine (...)
#16 nixlAgent::createBackend ("UCX", ...) at .../nixl_cu13.mesonpy.libs/libnixl.so
```

Sampled buffer size grows monotonically: 906 MB → 1.18 GiB → 1.40 GiB across separate gdb snapshots ~5 s apart.

## Environment

- `nixl==0.10.1`, `nixl-cu13==0.10.1` (installed from PyPI)
- Bundled UCX: `1.20.0` (loaded from `/opt/dynamo/venv/lib/python3.12/.../nixl_cu13.mesonpy.libs/plugins/...`)
- Tested in `nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc14` runtime image, amd64

## Plugin enumeration before hang (info-level logs)

```
nixl_plugin_manager.cpp:303] Loading plugins from: .../nixl_cu13.mesonpy.libs/plugins
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: GUSLI
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: GDS
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: UCX
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: AZURE_BLOB
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: POSIX
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: OBJ
nixl_plugin_manager.cpp:460] Discovered and loaded backend plugin: GDS_MT
config.cpp:45] Modified UCX config: ADDRESS_VERSION=v2
config.cpp:45] Modified UCX config: RNDV_THRESH=inf
config.cpp:45] Modified UCX config: MAX_RMA_RAILS=2
config.cpp:42] Failed to modify UCX config: IB_PCI_RELAXED_ORDERING=try: Invalid parameter
config.cpp:45] Modified UCX config: RCACHE_MAX_UNRELEASED=1024
config.cpp:42] Failed to modify UCX config: RC_GDA_NUM_CHANNELS=4: Invalid parameter
config.cpp:45] Modified UCX config: MAX_COMPONENT_MDS=32
ucp_context.c:2463 UCX INFO Version 1.20.0 (loaded from /usr/local/ucx//lib/libucp.so.0)
ucp_context.c:2463 UCX INFO Version 1.20.0 (loaded from /usr/local/ucx//lib/libucp.so.0)
# <--- hangs here, no further output --->
```

`UCX_TLS=cuda_ipc,self,tcp` does **not** help — the filter is applied *after* `uct_md_query_tl_resources`, so the explosion happens upstream of TLS filtering.

## Workaround

Force-load TRT-LLM's bundled `libnixl.so` 0.9.0 (which links against system UCX) ahead of `nixl-cu13`'s `DT_RPATH`:

```
ENV LD_PRELOAD=/usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/nixl/libnixl.so
```

NIXL 0.9.0 + system UCX 1.20.0 does **not** trigger the bug; the same UCX version works fine, suggesting the regression is in how `nixl-cu13`'s plugin uses UCX (possibly the modified config or `ucp_init` argument shape) rather than in UCX itself.

## Ask

- Is the `md_resources` enumeration loop a known bug in either `nixl-cu13`'s UCX plugin or UCX 1.20.0?
- Can `nixl-cu13` 0.10.x ship with a fixed UCX or avoid the codepath?
- Documented workaround beyond LD_PRELOAD?

Tracked downstream in https://github.com/ai-dynamo/dynamo/pull/9654 — once a fix lands, we can drop the LD_PRELOAD.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCX 1.20.0 (bundled in nixl-cu13==0.10.1) hangs in uct_md_query_tl_resources with concurrent NIXL agents #1668

Summary

Reproduction

Stack trace (captured with `gdb -p <pid>` on the hung process)

Environment

Plugin enumeration before hang (info-level logs)

Workaround

Ask

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

UCX 1.20.0 (bundled in nixl-cu13==0.10.1) hangs in uct_md_query_tl_resources with concurrent NIXL agents #1668

Description

Summary

Reproduction

Stack trace (captured with gdb -p <pid> on the hung process)

Environment

Plugin enumeration before hang (info-level logs)

Workaround

Ask

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Stack trace (captured with `gdb -p <pid>` on the hung process)