You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Published CUDA v0.60.3 bundle crashes at first matmul on A30 (SM 8.0) — SM 8.0 cubins ARE present, cause is unclear
Summary
The mesh-llm-x86_64-unknown-linux-gnu-cuda.tar.gz bundle published for v0.60.3 crashes at the first matmul with CUDA error: device kernel image is invalid when run against an NVIDIA A30 (Ampere, compute capability 8.0) on an R535-series driver. By the same apparent mechanism, we'd expect the same failure on A100 (also compute 8.0), though that hasn't been verified on our hardware.
Note — I originally hypothesized this was a missing-SM-8.0-cubin problem. That hypothesis is wrong — cuobjdump --list-elf on the bundled llama-server-cuda binary shows SM 8.0 cubins present for every kernel (alongside 75, 86, 89, 90, 120a). The crash is real and reproducible, but the underlying cause is something else (see "Revised hypotheses" below).
GPU: NVIDIA A30 (24 GB), accessed via a MIG 4g.24gb compute instance (full GPU)
Host driver: 535.288.01 (R535 series, CUDA 12.2 native)
Container base for runtime: nvidia/cuda:12.6.3-runtime-ubuntu24.04
/usr/local/cuda-12.6/compat/ removed so the host's libcuda (535) is used instead of the forward-compat shim (the shim silently breaks cudaGetDeviceCount() on this driver — a separate issue).
Deployment: docker run --gpus '"device=MIG-<uuid>"' ... --llama-flavor cuda (not --gpus all; under MIG the parent GPU UUID doesn't work and must be the MIG instance UUID).
Steps to reproduce
# Host: A30 with nvidia-container-toolkit, MIG enabled and one 4g.24gb instance created# (sudo nvidia-smi mig -cgi 0 -C if no instance exists)
MIG_UUID=$(nvidia-smi -L | grep -oP 'MIG-[a-f0-9-]+'| head -1)
curl -fsSL https://github.com/Mesh-LLM/mesh-llm/releases/download/v0.60.3/mesh-llm-x86_64-unknown-linux-gnu-cuda.tar.gz \
| tar xz
docker run --rm --gpus "\"device=$MIG_UUID\"" \
-v $PWD/mesh-bundle:/opt/mesh-bundle:ro \
-v $PWD/models:/models:ro \
--entrypoint /bin/bash \
nvidia/cuda:12.6.3-runtime-ubuntu24.04 \
-c 'apt-get update && apt-get install -y libssl3t64 libgomp1 curl && \ rm -rf /usr/local/cuda-12.6/compat && \ /opt/mesh-bundle/mesh-llm serve \ --gguf /models/<any-gguf>.gguf \ --listen-all --llama-flavor cuda'
Expected
Model loads and inference serves normally.
Actual
rpc-server-cuda comes up, reports the GPU correctly, claims ~512 MB VRAM. llama-server-cuda loads the model weights into VRAM (~21 GB of 24 GB used), then crashes on the first forward pass during the warm-up run:
sched_reserve: CUDA0 compute buffer size = 300.75 MiB
sched_reserve: CUDA_Host compute buffer size = 136.09 MiB
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
ggml_cuda_compute_forward: MUL_MAT failed
/__w/mesh-llm/mesh-llm/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:98: CUDA error
CUDA error: device kernel image is invalid
current device: 0, in function ggml_cuda_compute_forward at
/__w/mesh-llm/mesh-llm/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2825
Evidence that SM 8.0 is NOT missing from the bundle
Ran cuobjdump --list-elf /opt/mesh-bundle/llama-server-cuda inside nvidia/cuda:12.8.0-devel-ubuntu22.04. SM 8.0 cubins are present, 60 of them, one per kernel variant alongside 75/86/89/90/120a:
Differences between the failing upstream build and a from-source build that does work on the same A30:
Upstream release (fails on A30)
Source build (works on A30)
CUDA toolkit
12.8 (nvidia/cuda:12.8.0-devel-ubuntu22.04 per .github/workflows/release.yml)
12.6
GGML_CUDA_FA_ALL_QUANTS
ON (default)
OFF (CI opt-out)
CMAKE_CUDA_ARCHITECTURES
75;80;86;87;89;90;100;103;120
80;86;89
Targets built
all
llama-server rpc-server llama-moe-split llama-moe-analyze only
Host driver, MIG config, container base, and model are identical across both tests.
Plausible causes, in decreasing order of likelihood:
CUDA-toolkit / driver minor-version mismatch. Binary assembled by CUDA 12.8 nvcc; host libcuda is 535-series (CUDA 12.2 native). With the forward-compat shim removed (required to even get past cudaGetDeviceCount on this driver), cubins tagged as CUDA 12.8 may be rejected at load by the CUDA 12.2 driver for a specific kernel set. The load failure at a late stage (first matmul, not context creation) is consistent with a lazy kernel-load path finding the first unloadable cubin on the hot path.
A FlashAttention kernel variation that requires features only present under CUDA 12.8 runtime. Ruling out by rebuilding upstream's recipe with MESH_LLM_CUDA_FA_ALL_QUANTS=off and seeing if it runs would confirm or disprove.
A kernel that legitimately requires sm_89/sm_90 features slipping into a code path executed on SM 8.0 (e.g. an #ifdef that doesn't correctly guard against 8.0, or __CUDA_ARCH__ >= 890 being the default branch). Would manifest as an sm_80.cubin that's technically present but non-functional for that specific kernel.
Concrete experiments that would disambiguate
Rebuild upstream's exact recipe (CUDA 12.8 devel, FA_ALL_QUANTS=on, full arch list, all targets) and test on A30 + driver 535. If it still fails identically → not a build-config issue, likely macos menu app #1.
Same as macos menu app #1 but CUDA 12.6 devel (everything else identical). If that works → confirms CUDA-toolkit/driver mismatch is the root cause and the fix is pinning the release build to CUDA ≤ 12.6, OR baking the appropriate forward-compat libs into the release.
Same as macos menu app #1 but FA_ALL_QUANTS=off only. If that works → points at a specific FA kernel family as the offender.
Workaround (confirmed working on A30 / driver 535)
Build from source with CUDA ≤ 12.6 and the narrower config:
git clone https://github.com/Mesh-LLM/mesh-llm
cd mesh-llm
MESH_LLM_CUDA_FA_ALL_QUANTS=off \
MESH_LLM_LLAMA_TARGETS="llama-server rpc-server llama-moe-split llama-moe-analyze" \
just release-build-cuda "80;86;89"
Suggested fix (symptom-based — cause-based fix pending the experiments above)
Regardless of which hypothesis turns out right, the release pipeline currently has no functional smoke test that would catch this class of bug. The existing ./mesh-llm --version check does not exercise the CUDA matmul path. A minimum fix that protects users from the current failure mode:
Add an A30 or A100 runner step to .github/workflows/release.yml that actually serves a tiny GGUF through mesh-llm ... --llama-flavor cuda and hits /v1/chat/completions for one token. Would have caught this bug before the v0.60.3 release shipped.
Happy to submit a PR for the smoke-test wiring if you can tell me what runner type is available to the project (A30/A100/other).
Related
Detected during a private-LAN deployment of mesh-llm on 3 NVIDIA A30 GPUs.
Published CUDA v0.60.3 bundle crashes at first matmul on A30 (SM 8.0) — SM 8.0 cubins ARE present, cause is unclear
Summary
The
mesh-llm-x86_64-unknown-linux-gnu-cuda.tar.gzbundle published forv0.60.3crashes at the first matmul withCUDA error: device kernel image is invalidwhen run against an NVIDIA A30 (Ampere, compute capability 8.0) on an R535-series driver. By the same apparent mechanism, we'd expect the same failure on A100 (also compute 8.0), though that hasn't been verified on our hardware.Note — I originally hypothesized this was a missing-SM-8.0-cubin problem. That hypothesis is wrong —
cuobjdump --list-elfon the bundledllama-server-cudabinary shows SM 8.0 cubins present for every kernel (alongside 75, 86, 89, 90, 120a). The crash is real and reproducible, but the underlying cause is something else (see "Revised hypotheses" below).Environment
v0.60.3release bundle (mesh-llm-x86_64-unknown-linux-gnu-cuda.tar.gz)nvidia/cuda:12.6.3-runtime-ubuntu24.04/usr/local/cuda-12.6/compat/removed so the host's libcuda (535) is used instead of the forward-compat shim (the shim silently breakscudaGetDeviceCount()on this driver — a separate issue).docker run --gpus '"device=MIG-<uuid>"' ... --llama-flavor cuda(not--gpus all; under MIG the parent GPU UUID doesn't work and must be the MIG instance UUID).Steps to reproduce
Expected
Model loads and inference serves normally.
Actual
rpc-server-cudacomes up, reports the GPU correctly, claims ~512 MB VRAM.llama-server-cudaloads the model weights into VRAM (~21 GB of 24 GB used), then crashes on the first forward pass during the warm-up run:Evidence that SM 8.0 is NOT missing from the bundle
Ran
cuobjdump --list-elf /opt/mesh-bundle/llama-server-cudainsidenvidia/cuda:12.8.0-devel-ubuntu22.04. SM 8.0 cubins are present, 60 of them, one per kernel variant alongside 75/86/89/90/120a:Same pattern holds for
rpc-server-cuda.Revised hypotheses (testable)
Differences between the failing upstream build and a from-source build that does work on the same A30:
nvidia/cuda:12.8.0-devel-ubuntu22.04per.github/workflows/release.yml)GGML_CUDA_FA_ALL_QUANTSCMAKE_CUDA_ARCHITECTURES75;80;86;87;89;90;100;103;12080;86;89llama-server rpc-server llama-moe-split llama-moe-analyzeonlyHost driver, MIG config, container base, and model are identical across both tests.
Plausible causes, in decreasing order of likelihood:
CUDA-toolkit / driver minor-version mismatch. Binary assembled by CUDA 12.8 nvcc; host libcuda is 535-series (CUDA 12.2 native). With the forward-compat shim removed (required to even get past
cudaGetDeviceCounton this driver), cubins tagged as CUDA 12.8 may be rejected at load by the CUDA 12.2 driver for a specific kernel set. The load failure at a late stage (first matmul, not context creation) is consistent with a lazy kernel-load path finding the first unloadable cubin on the hot path.A FlashAttention kernel variation that requires features only present under CUDA 12.8 runtime. Ruling out by rebuilding upstream's recipe with
MESH_LLM_CUDA_FA_ALL_QUANTS=offand seeing if it runs would confirm or disprove.A kernel that legitimately requires
sm_89/sm_90features slipping into a code path executed on SM 8.0 (e.g. an#ifdefthat doesn't correctly guard against 8.0, or__CUDA_ARCH__ >= 890being the default branch). Would manifest as ansm_80.cubinthat's technically present but non-functional for that specific kernel.Concrete experiments that would disambiguate
FA_ALL_QUANTS=on, full arch list, all targets) and test on A30 + driver 535. If it still fails identically → not a build-config issue, likely macos menu app #1.FA_ALL_QUANTS=offonly. If that works → points at a specific FA kernel family as the offender.Workaround (confirmed working on A30 / driver 535)
Build from source with CUDA ≤ 12.6 and the narrower config:
Suggested fix (symptom-based — cause-based fix pending the experiments above)
Regardless of which hypothesis turns out right, the release pipeline currently has no functional smoke test that would catch this class of bug. The existing
./mesh-llm --versioncheck does not exercise the CUDA matmul path. A minimum fix that protects users from the current failure mode:.github/workflows/release.ymlthat actually serves a tiny GGUF throughmesh-llm ... --llama-flavor cudaand hits/v1/chat/completionsfor one token. Would have caught this bug before the v0.60.3 release shipped.Related
configure_compiler_cache()silent-kill (filed separately as a build-script robustness issue).