XGBoost 3.x (every version from 3.0.0 through 3.2.0, including the fixes from #11391 / #11434) crashes at the very first GPU allocation when run on an NVIDIA A40 GPU exposed to the guest VM through the A40-48C vGPU profile, under a modern host driver (570.158.01 / CUDA 12.8 native). The failing driver call is cuMemGetAllocationGranularity, which the vGPU host scheduler refuses with CUDA_ERROR_NOT_SUPPORTED even though both the CUDA driver API and nvidia-smi report a fully modern stack.
Because both heuristics inside dh::CheckVmAlloc() pass, XGBoost selects its CUDA Virtual Memory Management (VMM) allocator and dies at first allocation. The legacy cudaMalloc fallback path does work on this vGPU — it just isn't reachable through configuration.
I have a working local workaround (a tiny nvidia-smi shim that returns a fabricated old driver to fool GetVersionFromSmi), but that's fragile and not suitable for a public project. Filing this issue so the maintainers are aware.
Environment
| Component |
Version |
| XGBoost |
bisected — fails on 3.0.0, 3.0.5, 3.1.0, 3.1.3, 3.2.0 (latest) |
| Python |
3.12.13 (CPython, conda-forge) |
| OS |
Ubuntu 24.04.4 LTS (kernel 6.8.0-117-generic) |
| glibc |
2.39 |
| gcc |
13.3.0 |
| GPU |
NVIDIA A40 (Ampere, compute capability 8.6), passed through to a single VM (the card is dedicated — no multi-tenant sharing). It is nonetheless attached through NVIDIA's vGPU stack with the full-frame A40-48C profile (Product Brand: NVIDIA Virtual Compute Server, Virtualization Mode: VGPU in nvidia-smi -q). So the guest gets all 48 GB and the only consumer of the card, but the vGPU layer is still in the path and vGPU API restrictions (specifically the CUDA VMM family) still apply. |
| NVIDIA driver (host & guest) |
570.158.01 |
| NVML |
570.158 |
| nvidia-smi reported CUDA version |
12.8 |
cuDriverGetVersion (libcuda.so) |
12080 (CUDA 12.8) |
CUDA runtime libs in env (from torch==2.11.0+cu129) |
12.9 |
| Install command |
pip install xgboost==3.2.0 |
| Container |
none (bare conda env on Ubuntu 24.04) |
The same hardware previously ran xgboost 2.1.4 GPU training successfully on the same vGPU profile and driver. The regression is specific to the 3.x line.
nvidia-smi output
Default nvidia-smi:
Mon May 18 23:28:52 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01 Driver Version: 570.158.01 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40-48C On | 00000000:00:06.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 24MiB / 49152MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
CSV summary:
$ nvidia-smi --query-gpu=name,driver_version,vbios_version,compute_cap,memory.total --format=csv
name, driver_version, vbios_version, compute_cap, memory.total [MiB]
NVIDIA A40-48C, 570.158.01, 00.00.00.00.00, 8.6, 49152 MiB
The --query-gpu=driver_version form (which is what dh::CheckVmAlloc shells out to via xgboost::cudr::GetVersionFromSmi):
$ nvidia-smi --query-gpu=driver_version --format=csv
driver_version
570.158.01
Relevant excerpts from nvidia-smi -q confirming this is a vCS vGPU profile, not bare-metal:
Driver Version : 570.158.01
CUDA Version : 12.8
Product Name : NVIDIA A40-48C
Product Brand : NVIDIA Virtual Compute Server
Product Architecture : Ampere
Display Mode : Enabled
GPU Virtualization Mode
Virtualization Mode : VGPU
Host VGPU Mode : N/A
vGPU Heterogeneous Mode : N/A
vGPU Software Licensed Product
Product Name : NVIDIA Virtual Compute Server
License Status : Licensed (Expiry: 2026-5-19 19:43:1 GMT)
Compute Mode : Default
Steps to Reproduce
Minimum reproducible script (mre.py):
import numpy as np
import xgboost as xgb
print(f"xgboost = {xgb.__version__}")
print(f"build_info.USE_CUDA = {xgb.build_info()['USE_CUDA']}")
# A trivial 16-row dataset is enough — failure is at allocator init,
# not at any actual training step.
X = np.zeros((16, 4), dtype=np.float32)
X[:, 0] = np.arange(16)
y = (np.arange(16) > 7).astype(np.float32)
dm = xgb.DMatrix(X, label=y)
xgb.train(
{
"objective": "binary:logistic",
"tree_method": "hist",
"device": "cuda",
"verbosity": 1,
},
dm,
num_boost_round=1,
)
print("OK")
Run on a VM that's been assigned an A40-48C vGPU profile:
$ python mre.py
xgboost = 3.2.0
build_info.USE_CUDA = True
[traceback below]
Error Message
[23:12:42] /__w/xgboost/xgboost/src/common/cuda_dr_utils.cc:92:
GetGlobalCuDriverApi().cuMemGetAllocationGranularity(
&granularity, prop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED)
[/__w/xgboost/xgboost/src/common/cuda_dr_utils.h:107]:
CUDA driver error: CUDA_ERROR_NOT_SUPPORTED. operation not supported
Stack trace:
[bt] (0) .../xgboost/lib/libxgboost.so(+0x2c1a8c) [0x...]
[bt] (1) .../xgboost/lib/libxgboost.so(+0x3a7c1d) [0x...]
[bt] (2) .../xgboost/lib/libxgboost.so(+0xb7d4b9) [0x...]
[bt] (3) .../xgboost/lib/libxgboost.so(+0xb7c3bc) [0x...]
[bt] (4) .../xgboost/lib/libxgboost.so(+0x141cc23) [0x...]
[bt] (5) .../xgboost/lib/libxgboost.so(+0x141dd76) [0x...]
[bt] (6) .../xgboost/lib/libxgboost.so(+0x142a2b9) [0x...]
[bt] (7) .../xgboost/lib/libxgboost.so(+0x6dc8a1) [0x...]
[bt] (8) .../xgboost/lib/libxgboost.so(+0x6dd972) [0x...]
Crash is at the first GPU allocation; no training round has started yet. The driver call returning CUDA_ERROR_NOT_SUPPORTED is cuMemGetAllocationGranularity, which is part of the CUDA Virtual Memory Management API family.
Analysis — why the PR #11391 fallback doesn't trigger here
The gate in src/common/device_helpers.cu (post-PR-#11391):
[[nodiscard]] bool IsSupportedDrVer(std::int32_t major, std::int32_t minor) {
return major > 12 || (major == 12 && minor >= 5);
}
[[nodiscard]] bool CheckVmAlloc() {
std::call_once(once, [] {
std::int32_t major{0}, minor{0};
xgboost::curt::DrVersion(&major, &minor);
if (IsSupportedDrVer(major, minor)) {
vm_flag = xgboost::cudr::GetVersionFromSmi(&major, &minor) && major >= 555;
} else {
vm_flag = false;
}
});
return vm_flag;
}
On this system both gates pass:
xgboost::curt::DrVersion (via cuDriverGetVersion) reports 12.8 → IsSupportedDrVer(12, 8) is true.
xgboost::cudr::GetVersionFromSmi parses nvidia-smi --query-gpu=driver_version --format=csv → 570.158.01 → major = 570 >= 555 is true.
So vm_flag = true, XGBoost picks the VMM allocator, and the very first cuMemGetAllocationGranularity call returns NOT_SUPPORTED.
The PR #11391 fallback is correctly addressing the libcuda.so / system-driver mismatch case (#11397) — but the vGPU case is fundamentally different: the driver version is modern, the libcuda is modern, the API surface advertises VMM, yet the host-side vGPU scheduler refuses VMM calls at runtime. No version comparison can detect this; it requires either a feature probe (try-and-catch on a tiny allocation), or an explicit operator-controlled opt-out (env var / config field).
Notably, the cudaMalloc fallback path is healthy on this vGPU — when I trick CheckVmAlloc into returning false, training works end to end at full speed.
Workaround currently in use (please don't recommend this — it's why I'm filing)
I ship a small bash shim that intercepts only the one specific nvidia-smi --query-gpu=driver_version --format=csv call XGBoost uses, returns a fabricated 535.161.08 (< 555), and forwards every other invocation to /usr/bin/nvidia-smi unchanged:
#!/bin/bash
case " $* " in
*" --query-gpu=driver_version "*)
echo "driver_version"
echo "535.161.08"
exit 0
;;
esac
exec /usr/bin/nvidia-smi "$@"
The shim is prepended to PATH before any import xgboost, so CheckVmAlloc's std::call_once caches vm_flag = false. Training runs to completion on GPU. torch / ctranslate2 / monitoring tools call the CUDA driver API directly and never query nvidia-smi, so they see the real driver and are unaffected.
This works but is brittle: it relies on intercepting a specific shell command at a specific point in the process lifetime, and it's not something I'd want every operator of every vGPU-deployed XGBoost installation to have to reproduce.
Suggested fixes (in increasing order of effort)
- Environment-variable opt-out, e.g.
XGBOOST_CUDA_DISABLE_VMM=1. Smallest possible patch — read the env var at the top of CheckVmAlloc's call_once lambda and force vm_flag = false if set. This alone would let operators choose the legacy allocator without shell tricks.
- Runtime probe: replace (or augment) the
nvidia-smi major-version check with an actual cuMemGetAllocationGranularity call on a trivial property, wrapped in a try/catch. If it returns CUDA_ERROR_NOT_SUPPORTED, set vm_flag = false. Cost is one driver-API call at first GPU use; covers vGPU cases without needing per-host configuration.
- vGPU detection heuristic: parse
nvidia-smi --query-gpu=name (or Product Brand from nvidia-smi -q) and pattern-match the -NNC (vCS) vGPU suffixes / the Virtual Compute Server brand. Less general than (1) or (2), but doesn't require running another CUDA call.
Closing note
For context: I tested the existing fallback explicitly by spoofing the nvidia-smi output. With vm_flag = false, XGBoost 3.2.0 trains a rank:ndcg model with 50 groups × 10 candidates × 2,373 features in 0.28 s on this vGPU. The legacy path is fast and stable; the only barrier is reaching it from configuration.
XGBoost 3.x (every version from
3.0.0through3.2.0, including the fixes from #11391 / #11434) crashes at the very first GPU allocation when run on an NVIDIA A40 GPU exposed to the guest VM through the A40-48C vGPU profile, under a modern host driver (570.158.01/ CUDA 12.8 native). The failing driver call iscuMemGetAllocationGranularity, which the vGPU host scheduler refuses withCUDA_ERROR_NOT_SUPPORTEDeven though both the CUDA driver API andnvidia-smireport a fully modern stack.Because both heuristics inside
dh::CheckVmAlloc()pass, XGBoost selects its CUDA Virtual Memory Management (VMM) allocator and dies at first allocation. The legacycudaMallocfallback path does work on this vGPU — it just isn't reachable through configuration.I have a working local workaround (a tiny
nvidia-smishim that returns a fabricated old driver to foolGetVersionFromSmi), but that's fragile and not suitable for a public project. Filing this issue so the maintainers are aware.Environment
Product Brand: NVIDIA Virtual Compute Server,Virtualization Mode: VGPUinnvidia-smi -q). So the guest gets all 48 GB and the only consumer of the card, but the vGPU layer is still in the path and vGPU API restrictions (specifically the CUDA VMM family) still apply.cuDriverGetVersion(libcuda.so)torch==2.11.0+cu129)pip install xgboost==3.2.0The same hardware previously ran xgboost 2.1.4 GPU training successfully on the same vGPU profile and driver. The regression is specific to the 3.x line.
nvidia-smioutputDefault
nvidia-smi:CSV summary:
The
--query-gpu=driver_versionform (which is whatdh::CheckVmAllocshells out to viaxgboost::cudr::GetVersionFromSmi):Relevant excerpts from
nvidia-smi -qconfirming this is a vCS vGPU profile, not bare-metal:Steps to Reproduce
Minimum reproducible script (
mre.py):Run on a VM that's been assigned an A40-48C vGPU profile:
Error Message
Crash is at the first GPU allocation; no training round has started yet. The driver call returning
CUDA_ERROR_NOT_SUPPORTEDiscuMemGetAllocationGranularity, which is part of the CUDA Virtual Memory Management API family.Analysis — why the PR #11391 fallback doesn't trigger here
The gate in
src/common/device_helpers.cu(post-PR-#11391):On this system both gates pass:
xgboost::curt::DrVersion(viacuDriverGetVersion) reports12.8→IsSupportedDrVer(12, 8)is true.xgboost::cudr::GetVersionFromSmiparsesnvidia-smi --query-gpu=driver_version --format=csv→570.158.01→major = 570 >= 555is true.So
vm_flag = true, XGBoost picks the VMM allocator, and the very firstcuMemGetAllocationGranularitycall returnsNOT_SUPPORTED.The PR #11391 fallback is correctly addressing the libcuda.so / system-driver mismatch case (#11397) — but the vGPU case is fundamentally different: the driver version is modern, the libcuda is modern, the API surface advertises VMM, yet the host-side vGPU scheduler refuses VMM calls at runtime. No version comparison can detect this; it requires either a feature probe (try-and-catch on a tiny allocation), or an explicit operator-controlled opt-out (env var / config field).
Notably, the cudaMalloc fallback path is healthy on this vGPU — when I trick
CheckVmAllocinto returningfalse, training works end to end at full speed.Workaround currently in use (please don't recommend this — it's why I'm filing)
I ship a small bash shim that intercepts only the one specific
nvidia-smi --query-gpu=driver_version --format=csvcall XGBoost uses, returns a fabricated535.161.08(< 555), and forwards every other invocation to/usr/bin/nvidia-smiunchanged:The shim is prepended to
PATHbefore anyimport xgboost, soCheckVmAlloc'sstd::call_oncecachesvm_flag = false. Training runs to completion on GPU. torch / ctranslate2 / monitoring tools call the CUDA driver API directly and never querynvidia-smi, so they see the real driver and are unaffected.This works but is brittle: it relies on intercepting a specific shell command at a specific point in the process lifetime, and it's not something I'd want every operator of every vGPU-deployed XGBoost installation to have to reproduce.
Suggested fixes (in increasing order of effort)
XGBOOST_CUDA_DISABLE_VMM=1. Smallest possible patch — read the env var at the top ofCheckVmAlloc'scall_oncelambda and forcevm_flag = falseif set. This alone would let operators choose the legacy allocator without shell tricks.nvidia-smimajor-version check with an actualcuMemGetAllocationGranularitycall on a trivial property, wrapped in a try/catch. If it returnsCUDA_ERROR_NOT_SUPPORTED, setvm_flag = false. Cost is one driver-API call at first GPU use; covers vGPU cases without needing per-host configuration.nvidia-smi --query-gpu=name(orProduct Brandfromnvidia-smi -q) and pattern-match the-NNC(vCS) vGPU suffixes / theVirtual Compute Serverbrand. Less general than (1) or (2), but doesn't require running another CUDA call.Closing note
For context: I tested the existing fallback explicitly by spoofing the
nvidia-smioutput. Withvm_flag = false, XGBoost 3.2.0 trains arank:ndcgmodel with 50 groups × 10 candidates × 2,373 features in 0.28 s on this vGPU. The legacy path is fast and stable; the only barrier is reaching it from configuration.