GPU training crashes with CUDA_ERROR_NOT_SUPPORTED on NVIDIA A40-48C vGPU profile — CheckVmAlloc's driver-version heuristic doesn't detect VMM-blocked vGPU profiles

XGBoost 3.x (every version from `3.0.0` through `3.2.0`, including the fixes from #11391 / #11434) crashes at the very first GPU allocation when run on an NVIDIA A40 GPU exposed to the guest VM through the **A40-48C** vGPU profile, under a modern host driver (`570.158.01` / CUDA 12.8 native). The failing driver call is `cuMemGetAllocationGranularity`, which the vGPU host scheduler refuses with `CUDA_ERROR_NOT_SUPPORTED` even though both the CUDA driver API and `nvidia-smi` report a fully modern stack.

Because both heuristics inside `dh::CheckVmAlloc()` pass, XGBoost selects its CUDA Virtual Memory Management (VMM) allocator and dies at first allocation. The legacy `cudaMalloc` fallback path **does** work on this vGPU — it just isn't reachable through configuration.

I have a working local workaround (a tiny `nvidia-smi` shim that returns a fabricated old driver to fool `GetVersionFromSmi`), but that's fragile and not suitable for a public project. Filing this issue so the maintainers are aware.

## Environment

| Component | Version |
|---|---|
| XGBoost | bisected — fails on 3.0.0, 3.0.5, 3.1.0, 3.1.3, **3.2.0** (latest) |
| Python | 3.12.13 (CPython, conda-forge) |
| OS | Ubuntu 24.04.4 LTS (kernel 6.8.0-117-generic) |
| glibc | 2.39 |
| gcc | 13.3.0 |
| GPU | NVIDIA A40 (Ampere, compute capability 8.6), **passed through to a single VM** (the card is dedicated — no multi-tenant sharing). It is nonetheless attached through NVIDIA's vGPU stack with the full-frame **A40-48C** profile (`Product Brand: NVIDIA Virtual Compute Server`, `Virtualization Mode: VGPU` in `nvidia-smi -q`). So the guest gets all 48 GB and the only consumer of the card, but the vGPU layer is still in the path and vGPU API restrictions (specifically the CUDA VMM family) still apply. |
| NVIDIA driver (host & guest) | 570.158.01 |
| NVML | 570.158 |
| nvidia-smi reported CUDA version | 12.8 |
| `cuDriverGetVersion` (libcuda.so) | 12080 (CUDA 12.8) |
| CUDA runtime libs in env (from `torch==2.11.0+cu129`) | 12.9 |
| Install command | `pip install xgboost==3.2.0` |
| Container | none (bare conda env on Ubuntu 24.04) |

The same hardware previously ran **xgboost 2.1.4** GPU training successfully on the same vGPU profile and driver. The regression is specific to the 3.x line.

### `nvidia-smi` output

Default `nvidia-smi`:

```
Mon May 18 23:28:52 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01             Driver Version: 570.158.01     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40-48C                 On  |   00000000:00:06.0 Off |                    0 |
| N/A   N/A    P0            N/A  /  N/A  |      24MiB /  49152MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
```

CSV summary:

```
$ nvidia-smi --query-gpu=name,driver_version,vbios_version,compute_cap,memory.total --format=csv
name, driver_version, vbios_version, compute_cap, memory.total [MiB]
NVIDIA A40-48C, 570.158.01, 00.00.00.00.00, 8.6, 49152 MiB
```

The `--query-gpu=driver_version` form (which is what `dh::CheckVmAlloc` shells out to via `xgboost::cudr::GetVersionFromSmi`):

```
$ nvidia-smi --query-gpu=driver_version --format=csv
driver_version
570.158.01
```

Relevant excerpts from `nvidia-smi -q` confirming this is a vCS vGPU profile, not bare-metal:

```
Driver Version                            : 570.158.01
CUDA Version                              : 12.8

    Product Name                          : NVIDIA A40-48C
    Product Brand                         : NVIDIA Virtual Compute Server
    Product Architecture                  : Ampere
    Display Mode                          : Enabled

    GPU Virtualization Mode
        Virtualization Mode               : VGPU
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    vGPU Software Licensed Product
        Product Name                      : NVIDIA Virtual Compute Server
        License Status                    : Licensed (Expiry: 2026-5-19 19:43:1 GMT)

    Compute Mode                          : Default
```

## Steps to Reproduce

Minimum reproducible script (`mre.py`):

```python
import numpy as np
import xgboost as xgb

print(f"xgboost            = {xgb.__version__}")
print(f"build_info.USE_CUDA = {xgb.build_info()['USE_CUDA']}")

# A trivial 16-row dataset is enough — failure is at allocator init,
# not at any actual training step.
X = np.zeros((16, 4), dtype=np.float32)
X[:, 0] = np.arange(16)
y = (np.arange(16) > 7).astype(np.float32)
dm = xgb.DMatrix(X, label=y)

xgb.train(
    {
        "objective": "binary:logistic",
        "tree_method": "hist",
        "device": "cuda",
        "verbosity": 1,
    },
    dm,
    num_boost_round=1,
)
print("OK")
```

Run on a VM that's been assigned an A40-48C vGPU profile:

```
$ python mre.py
xgboost            = 3.2.0
build_info.USE_CUDA = True
[traceback below]
```

## Error Message

```
[23:12:42] /__w/xgboost/xgboost/src/common/cuda_dr_utils.cc:92:
GetGlobalCuDriverApi().cuMemGetAllocationGranularity(
    &granularity, prop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED)
[/__w/xgboost/xgboost/src/common/cuda_dr_utils.h:107]:
CUDA driver error: CUDA_ERROR_NOT_SUPPORTED. operation not supported

Stack trace:
  [bt] (0) .../xgboost/lib/libxgboost.so(+0x2c1a8c) [0x...]
  [bt] (1) .../xgboost/lib/libxgboost.so(+0x3a7c1d) [0x...]
  [bt] (2) .../xgboost/lib/libxgboost.so(+0xb7d4b9) [0x...]
  [bt] (3) .../xgboost/lib/libxgboost.so(+0xb7c3bc) [0x...]
  [bt] (4) .../xgboost/lib/libxgboost.so(+0x141cc23) [0x...]
  [bt] (5) .../xgboost/lib/libxgboost.so(+0x141dd76) [0x...]
  [bt] (6) .../xgboost/lib/libxgboost.so(+0x142a2b9) [0x...]
  [bt] (7) .../xgboost/lib/libxgboost.so(+0x6dc8a1) [0x...]
  [bt] (8) .../xgboost/lib/libxgboost.so(+0x6dd972) [0x...]
```

Crash is at the **first** GPU allocation; no training round has started yet. The driver call returning `CUDA_ERROR_NOT_SUPPORTED` is `cuMemGetAllocationGranularity`, which is part of the CUDA Virtual Memory Management API family.

## Analysis — why the PR #11391 fallback doesn't trigger here

The gate in `src/common/device_helpers.cu` (post-PR-#11391):

```cpp
[[nodiscard]] bool IsSupportedDrVer(std::int32_t major, std::int32_t minor) {
  return major > 12 || (major == 12 && minor >= 5);
}

[[nodiscard]] bool CheckVmAlloc() {
  std::call_once(once, [] {
    std::int32_t major{0}, minor{0};
    xgboost::curt::DrVersion(&major, &minor);
    if (IsSupportedDrVer(major, minor)) {
      vm_flag = xgboost::cudr::GetVersionFromSmi(&major, &minor) && major >= 555;
    } else {
      vm_flag = false;
    }
  });
  return vm_flag;
}
```

On this system both gates pass:

1. `xgboost::curt::DrVersion` (via `cuDriverGetVersion`) reports `12.8` → `IsSupportedDrVer(12, 8)` is **true**.
2. `xgboost::cudr::GetVersionFromSmi` parses `nvidia-smi --query-gpu=driver_version --format=csv` → `570.158.01` → `major = 570 >= 555` is **true**.

So `vm_flag = true`, XGBoost picks the VMM allocator, and the very first `cuMemGetAllocationGranularity` call returns `NOT_SUPPORTED`.

The PR #11391 fallback is correctly addressing the **libcuda.so / system-driver mismatch** case (#11397) — but the vGPU case is fundamentally different: the driver version *is* modern, the libcuda *is* modern, the API surface advertises VMM, yet the **host-side vGPU scheduler refuses VMM calls** at runtime. No version comparison can detect this; it requires either a feature probe (try-and-catch on a tiny allocation), or an explicit operator-controlled opt-out (env var / config field).

Notably, the **cudaMalloc fallback path is healthy on this vGPU** — when I trick `CheckVmAlloc` into returning `false`, training works end to end at full speed.

## Workaround currently in use (please don't recommend this — it's why I'm filing)

I ship a small bash shim that intercepts only the one specific `nvidia-smi --query-gpu=driver_version --format=csv` call XGBoost uses, returns a fabricated `535.161.08` (`< 555`), and forwards every other invocation to `/usr/bin/nvidia-smi` unchanged:

```bash
#!/bin/bash
case " $* " in
  *" --query-gpu=driver_version "*)
    echo "driver_version"
    echo "535.161.08"
    exit 0
    ;;
esac
exec /usr/bin/nvidia-smi "$@"
```

The shim is prepended to `PATH` before any `import xgboost`, so `CheckVmAlloc`'s `std::call_once` caches `vm_flag = false`. Training runs to completion on GPU. torch / ctranslate2 / monitoring tools call the CUDA driver API directly and never query `nvidia-smi`, so they see the real driver and are unaffected.

This works but is brittle: it relies on intercepting a specific shell command at a specific point in the process lifetime, and it's not something I'd want every operator of every vGPU-deployed XGBoost installation to have to reproduce.

## Suggested fixes (in increasing order of effort)

1. **Environment-variable opt-out**, e.g. `XGBOOST_CUDA_DISABLE_VMM=1`. Smallest possible patch — read the env var at the top of `CheckVmAlloc`'s `call_once` lambda and force `vm_flag = false` if set. This alone would let operators choose the legacy allocator without shell tricks.
2. **Runtime probe**: replace (or augment) the `nvidia-smi` major-version check with an actual `cuMemGetAllocationGranularity` call on a trivial property, wrapped in a try/catch. If it returns `CUDA_ERROR_NOT_SUPPORTED`, set `vm_flag = false`. Cost is one driver-API call at first GPU use; covers vGPU cases without needing per-host configuration.
3. **vGPU detection heuristic**: parse `nvidia-smi --query-gpu=name` (or `Product Brand` from `nvidia-smi -q`) and pattern-match the `-NNC` (vCS) vGPU suffixes / the `Virtual Compute Server` brand. Less general than (1) or (2), but doesn't require running another CUDA call.

## Closing note

For context: I tested the existing fallback explicitly by spoofing the `nvidia-smi` output. With `vm_flag = false`, XGBoost 3.2.0 trains a `rank:ndcg` model with 50 groups × 10 candidates × 2,373 features in 0.28 s on this vGPU. The legacy path is fast and stable; the only barrier is reaching it from configuration.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU training crashes with CUDA_ERROR_NOT_SUPPORTED on NVIDIA A40-48C vGPU profile — CheckVmAlloc's driver-version heuristic doesn't detect VMM-blocked vGPU profiles #12217

Environment

`nvidia-smi` output

Steps to Reproduce

Error Message

Analysis — why the PR #11391 fallback doesn't trigger here

Workaround currently in use (please don't recommend this — it's why I'm filing)

Suggested fixes (in increasing order of effort)

Closing note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Version
XGBoost	bisected — fails on 3.0.0, 3.0.5, 3.1.0, 3.1.3, 3.2.0 (latest)
Python	3.12.13 (CPython, conda-forge)
OS	Ubuntu 24.04.4 LTS (kernel 6.8.0-117-generic)
glibc	2.39
gcc	13.3.0
GPU	NVIDIA A40 (Ampere, compute capability 8.6), passed through to a single VM (the card is dedicated — no multi-tenant sharing). It is nonetheless attached through NVIDIA's vGPU stack with the full-frame A40-48C profile (`Product Brand: NVIDIA Virtual Compute Server`, `Virtualization Mode: VGPU` in `nvidia-smi -q`). So the guest gets all 48 GB and the only consumer of the card, but the vGPU layer is still in the path and vGPU API restrictions (specifically the CUDA VMM family) still apply.
NVIDIA driver (host & guest)	570.158.01
NVML	570.158
nvidia-smi reported CUDA version	12.8
`cuDriverGetVersion` (libcuda.so)	12080 (CUDA 12.8)
CUDA runtime libs in env (from `torch==2.11.0+cu129`)	12.9
Install command	`pip install xgboost==3.2.0`
Container	none (bare conda env on Ubuntu 24.04)

Uh oh!

GPU training crashes with CUDA_ERROR_NOT_SUPPORTED on NVIDIA A40-48C vGPU profile — CheckVmAlloc's driver-version heuristic doesn't detect VMM-blocked vGPU profiles #12217

Description

Environment

nvidia-smi output

Steps to Reproduce

Error Message

Analysis — why the PR #11391 fallback doesn't trigger here

Workaround currently in use (please don't recommend this — it's why I'm filing)

Suggested fixes (in increasing order of effort)

Closing note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`nvidia-smi` output