Skip to content

GPU training crashes with CUDA_ERROR_NOT_SUPPORTED on NVIDIA A40-48C vGPU profile — CheckVmAlloc's driver-version heuristic doesn't detect VMM-blocked vGPU profiles #12217

@maged-mostafa-ds

Description

@maged-mostafa-ds

XGBoost 3.x (every version from 3.0.0 through 3.2.0, including the fixes from #11391 / #11434) crashes at the very first GPU allocation when run on an NVIDIA A40 GPU exposed to the guest VM through the A40-48C vGPU profile, under a modern host driver (570.158.01 / CUDA 12.8 native). The failing driver call is cuMemGetAllocationGranularity, which the vGPU host scheduler refuses with CUDA_ERROR_NOT_SUPPORTED even though both the CUDA driver API and nvidia-smi report a fully modern stack.

Because both heuristics inside dh::CheckVmAlloc() pass, XGBoost selects its CUDA Virtual Memory Management (VMM) allocator and dies at first allocation. The legacy cudaMalloc fallback path does work on this vGPU — it just isn't reachable through configuration.

I have a working local workaround (a tiny nvidia-smi shim that returns a fabricated old driver to fool GetVersionFromSmi), but that's fragile and not suitable for a public project. Filing this issue so the maintainers are aware.

Environment

Component Version
XGBoost bisected — fails on 3.0.0, 3.0.5, 3.1.0, 3.1.3, 3.2.0 (latest)
Python 3.12.13 (CPython, conda-forge)
OS Ubuntu 24.04.4 LTS (kernel 6.8.0-117-generic)
glibc 2.39
gcc 13.3.0
GPU NVIDIA A40 (Ampere, compute capability 8.6), passed through to a single VM (the card is dedicated — no multi-tenant sharing). It is nonetheless attached through NVIDIA's vGPU stack with the full-frame A40-48C profile (Product Brand: NVIDIA Virtual Compute Server, Virtualization Mode: VGPU in nvidia-smi -q). So the guest gets all 48 GB and the only consumer of the card, but the vGPU layer is still in the path and vGPU API restrictions (specifically the CUDA VMM family) still apply.
NVIDIA driver (host & guest) 570.158.01
NVML 570.158
nvidia-smi reported CUDA version 12.8
cuDriverGetVersion (libcuda.so) 12080 (CUDA 12.8)
CUDA runtime libs in env (from torch==2.11.0+cu129) 12.9
Install command pip install xgboost==3.2.0
Container none (bare conda env on Ubuntu 24.04)

The same hardware previously ran xgboost 2.1.4 GPU training successfully on the same vGPU profile and driver. The regression is specific to the 3.x line.

nvidia-smi output

Default nvidia-smi:

Mon May 18 23:28:52 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01             Driver Version: 570.158.01     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40-48C                 On  |   00000000:00:06.0 Off |                    0 |
| N/A   N/A    P0            N/A  /  N/A  |      24MiB /  49152MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

CSV summary:

$ nvidia-smi --query-gpu=name,driver_version,vbios_version,compute_cap,memory.total --format=csv
name, driver_version, vbios_version, compute_cap, memory.total [MiB]
NVIDIA A40-48C, 570.158.01, 00.00.00.00.00, 8.6, 49152 MiB

The --query-gpu=driver_version form (which is what dh::CheckVmAlloc shells out to via xgboost::cudr::GetVersionFromSmi):

$ nvidia-smi --query-gpu=driver_version --format=csv
driver_version
570.158.01

Relevant excerpts from nvidia-smi -q confirming this is a vCS vGPU profile, not bare-metal:

Driver Version                            : 570.158.01
CUDA Version                              : 12.8

    Product Name                          : NVIDIA A40-48C
    Product Brand                         : NVIDIA Virtual Compute Server
    Product Architecture                  : Ampere
    Display Mode                          : Enabled

    GPU Virtualization Mode
        Virtualization Mode               : VGPU
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    vGPU Software Licensed Product
        Product Name                      : NVIDIA Virtual Compute Server
        License Status                    : Licensed (Expiry: 2026-5-19 19:43:1 GMT)

    Compute Mode                          : Default

Steps to Reproduce

Minimum reproducible script (mre.py):

import numpy as np
import xgboost as xgb

print(f"xgboost            = {xgb.__version__}")
print(f"build_info.USE_CUDA = {xgb.build_info()['USE_CUDA']}")

# A trivial 16-row dataset is enough — failure is at allocator init,
# not at any actual training step.
X = np.zeros((16, 4), dtype=np.float32)
X[:, 0] = np.arange(16)
y = (np.arange(16) > 7).astype(np.float32)
dm = xgb.DMatrix(X, label=y)

xgb.train(
    {
        "objective": "binary:logistic",
        "tree_method": "hist",
        "device": "cuda",
        "verbosity": 1,
    },
    dm,
    num_boost_round=1,
)
print("OK")

Run on a VM that's been assigned an A40-48C vGPU profile:

$ python mre.py
xgboost            = 3.2.0
build_info.USE_CUDA = True
[traceback below]

Error Message

[23:12:42] /__w/xgboost/xgboost/src/common/cuda_dr_utils.cc:92:
GetGlobalCuDriverApi().cuMemGetAllocationGranularity(
    &granularity, prop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED)
[/__w/xgboost/xgboost/src/common/cuda_dr_utils.h:107]:
CUDA driver error: CUDA_ERROR_NOT_SUPPORTED. operation not supported

Stack trace:
  [bt] (0) .../xgboost/lib/libxgboost.so(+0x2c1a8c) [0x...]
  [bt] (1) .../xgboost/lib/libxgboost.so(+0x3a7c1d) [0x...]
  [bt] (2) .../xgboost/lib/libxgboost.so(+0xb7d4b9) [0x...]
  [bt] (3) .../xgboost/lib/libxgboost.so(+0xb7c3bc) [0x...]
  [bt] (4) .../xgboost/lib/libxgboost.so(+0x141cc23) [0x...]
  [bt] (5) .../xgboost/lib/libxgboost.so(+0x141dd76) [0x...]
  [bt] (6) .../xgboost/lib/libxgboost.so(+0x142a2b9) [0x...]
  [bt] (7) .../xgboost/lib/libxgboost.so(+0x6dc8a1) [0x...]
  [bt] (8) .../xgboost/lib/libxgboost.so(+0x6dd972) [0x...]

Crash is at the first GPU allocation; no training round has started yet. The driver call returning CUDA_ERROR_NOT_SUPPORTED is cuMemGetAllocationGranularity, which is part of the CUDA Virtual Memory Management API family.

Analysis — why the PR #11391 fallback doesn't trigger here

The gate in src/common/device_helpers.cu (post-PR-#11391):

[[nodiscard]] bool IsSupportedDrVer(std::int32_t major, std::int32_t minor) {
  return major > 12 || (major == 12 && minor >= 5);
}

[[nodiscard]] bool CheckVmAlloc() {
  std::call_once(once, [] {
    std::int32_t major{0}, minor{0};
    xgboost::curt::DrVersion(&major, &minor);
    if (IsSupportedDrVer(major, minor)) {
      vm_flag = xgboost::cudr::GetVersionFromSmi(&major, &minor) && major >= 555;
    } else {
      vm_flag = false;
    }
  });
  return vm_flag;
}

On this system both gates pass:

  1. xgboost::curt::DrVersion (via cuDriverGetVersion) reports 12.8IsSupportedDrVer(12, 8) is true.
  2. xgboost::cudr::GetVersionFromSmi parses nvidia-smi --query-gpu=driver_version --format=csv570.158.01major = 570 >= 555 is true.

So vm_flag = true, XGBoost picks the VMM allocator, and the very first cuMemGetAllocationGranularity call returns NOT_SUPPORTED.

The PR #11391 fallback is correctly addressing the libcuda.so / system-driver mismatch case (#11397) — but the vGPU case is fundamentally different: the driver version is modern, the libcuda is modern, the API surface advertises VMM, yet the host-side vGPU scheduler refuses VMM calls at runtime. No version comparison can detect this; it requires either a feature probe (try-and-catch on a tiny allocation), or an explicit operator-controlled opt-out (env var / config field).

Notably, the cudaMalloc fallback path is healthy on this vGPU — when I trick CheckVmAlloc into returning false, training works end to end at full speed.

Workaround currently in use (please don't recommend this — it's why I'm filing)

I ship a small bash shim that intercepts only the one specific nvidia-smi --query-gpu=driver_version --format=csv call XGBoost uses, returns a fabricated 535.161.08 (< 555), and forwards every other invocation to /usr/bin/nvidia-smi unchanged:

#!/bin/bash
case " $* " in
  *" --query-gpu=driver_version "*)
    echo "driver_version"
    echo "535.161.08"
    exit 0
    ;;
esac
exec /usr/bin/nvidia-smi "$@"

The shim is prepended to PATH before any import xgboost, so CheckVmAlloc's std::call_once caches vm_flag = false. Training runs to completion on GPU. torch / ctranslate2 / monitoring tools call the CUDA driver API directly and never query nvidia-smi, so they see the real driver and are unaffected.

This works but is brittle: it relies on intercepting a specific shell command at a specific point in the process lifetime, and it's not something I'd want every operator of every vGPU-deployed XGBoost installation to have to reproduce.

Suggested fixes (in increasing order of effort)

  1. Environment-variable opt-out, e.g. XGBOOST_CUDA_DISABLE_VMM=1. Smallest possible patch — read the env var at the top of CheckVmAlloc's call_once lambda and force vm_flag = false if set. This alone would let operators choose the legacy allocator without shell tricks.
  2. Runtime probe: replace (or augment) the nvidia-smi major-version check with an actual cuMemGetAllocationGranularity call on a trivial property, wrapped in a try/catch. If it returns CUDA_ERROR_NOT_SUPPORTED, set vm_flag = false. Cost is one driver-API call at first GPU use; covers vGPU cases without needing per-host configuration.
  3. vGPU detection heuristic: parse nvidia-smi --query-gpu=name (or Product Brand from nvidia-smi -q) and pattern-match the -NNC (vCS) vGPU suffixes / the Virtual Compute Server brand. Less general than (1) or (2), but doesn't require running another CUDA call.

Closing note

For context: I tested the existing fallback explicitly by spoofing the nvidia-smi output. With vm_flag = false, XGBoost 3.2.0 trains a rank:ndcg model with 50 groups × 10 candidates × 2,373 features in 0.28 s on this vGPU. The legacy path is fast and stable; the only barrier is reaching it from configuration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions