Skip to content

A30 / driver 535: v0.60.3 CUDA bundle fails at matmul despite sm_80 cubins being present #304

@ventz

Description

@ventz

Published CUDA v0.60.3 bundle crashes at first matmul on A30 (SM 8.0) — SM 8.0 cubins ARE present, cause is unclear

Summary

The mesh-llm-x86_64-unknown-linux-gnu-cuda.tar.gz bundle published for v0.60.3 crashes at the first matmul with CUDA error: device kernel image is invalid when run against an NVIDIA A30 (Ampere, compute capability 8.0) on an R535-series driver. By the same apparent mechanism, we'd expect the same failure on A100 (also compute 8.0), though that hasn't been verified on our hardware.

Note — I originally hypothesized this was a missing-SM-8.0-cubin problem. That hypothesis is wrongcuobjdump --list-elf on the bundled llama-server-cuda binary shows SM 8.0 cubins present for every kernel (alongside 75, 86, 89, 90, 120a). The crash is real and reproducible, but the underlying cause is something else (see "Revised hypotheses" below).

Environment

  • Version: v0.60.3 release bundle (mesh-llm-x86_64-unknown-linux-gnu-cuda.tar.gz)
  • GPU: NVIDIA A30 (24 GB), accessed via a MIG 4g.24gb compute instance (full GPU)
  • Host driver: 535.288.01 (R535 series, CUDA 12.2 native)
  • Container base for runtime: nvidia/cuda:12.6.3-runtime-ubuntu24.04
    • /usr/local/cuda-12.6/compat/ removed so the host's libcuda (535) is used instead of the forward-compat shim (the shim silently breaks cudaGetDeviceCount() on this driver — a separate issue).
  • Deployment: docker run --gpus '"device=MIG-<uuid>"' ... --llama-flavor cuda (not --gpus all; under MIG the parent GPU UUID doesn't work and must be the MIG instance UUID).

Steps to reproduce

# Host: A30 with nvidia-container-toolkit, MIG enabled and one 4g.24gb instance created
# (sudo nvidia-smi mig -cgi 0 -C if no instance exists)
MIG_UUID=$(nvidia-smi -L | grep -oP 'MIG-[a-f0-9-]+' | head -1)

curl -fsSL https://github.com/Mesh-LLM/mesh-llm/releases/download/v0.60.3/mesh-llm-x86_64-unknown-linux-gnu-cuda.tar.gz \
  | tar xz
docker run --rm --gpus "\"device=$MIG_UUID\"" \
  -v $PWD/mesh-bundle:/opt/mesh-bundle:ro \
  -v $PWD/models:/models:ro \
  --entrypoint /bin/bash \
  nvidia/cuda:12.6.3-runtime-ubuntu24.04 \
  -c 'apt-get update && apt-get install -y libssl3t64 libgomp1 curl && \
      rm -rf /usr/local/cuda-12.6/compat && \
      /opt/mesh-bundle/mesh-llm serve \
        --gguf /models/<any-gguf>.gguf \
        --listen-all --llama-flavor cuda'

Expected

Model loads and inference serves normally.

Actual

rpc-server-cuda comes up, reports the GPU correctly, claims ~512 MB VRAM. llama-server-cuda loads the model weights into VRAM (~21 GB of 24 GB used), then crashes on the first forward pass during the warm-up run:

sched_reserve:      CUDA0 compute buffer size =   300.75 MiB
sched_reserve:  CUDA_Host compute buffer size =   136.09 MiB
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
ggml_cuda_compute_forward: MUL_MAT failed
/__w/mesh-llm/mesh-llm/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:98: CUDA error
CUDA error: device kernel image is invalid
  current device: 0, in function ggml_cuda_compute_forward at
  /__w/mesh-llm/mesh-llm/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2825

Evidence that SM 8.0 is NOT missing from the bundle

Ran cuobjdump --list-elf /opt/mesh-bundle/llama-server-cuda inside nvidia/cuda:12.8.0-devel-ubuntu22.04. SM 8.0 cubins are present, 60 of them, one per kernel variant alongside 75/86/89/90/120a:

ELF file  1:  llama-server-cuda.1.sm_75.cubin
ELF file  2:  llama-server-cuda.2.sm_80.cubin    ← present
ELF file  3:  llama-server-cuda.3.sm_86.cubin
ELF file  4:  llama-server-cuda.4.sm_89.cubin
ELF file  5:  llama-server-cuda.5.sm_90.cubin
ELF file  6:  llama-server-cuda.6.sm_120a.cubin
ELF file  7:  llama-server-cuda.7.sm_75.cubin
ELF file  8:  llama-server-cuda.8.sm_80.cubin    ← present
ELF file  9:  llama-server-cuda.9.sm_86.cubin
...
(pattern repeats; 60+ kernels × 6 arches each, including sm_80 consistently)

Same pattern holds for rpc-server-cuda.

Revised hypotheses (testable)

Differences between the failing upstream build and a from-source build that does work on the same A30:

Upstream release (fails on A30) Source build (works on A30)
CUDA toolkit 12.8 (nvidia/cuda:12.8.0-devel-ubuntu22.04 per .github/workflows/release.yml) 12.6
GGML_CUDA_FA_ALL_QUANTS ON (default) OFF (CI opt-out)
CMAKE_CUDA_ARCHITECTURES 75;80;86;87;89;90;100;103;120 80;86;89
Targets built all llama-server rpc-server llama-moe-split llama-moe-analyze only

Host driver, MIG config, container base, and model are identical across both tests.

Plausible causes, in decreasing order of likelihood:

  1. CUDA-toolkit / driver minor-version mismatch. Binary assembled by CUDA 12.8 nvcc; host libcuda is 535-series (CUDA 12.2 native). With the forward-compat shim removed (required to even get past cudaGetDeviceCount on this driver), cubins tagged as CUDA 12.8 may be rejected at load by the CUDA 12.2 driver for a specific kernel set. The load failure at a late stage (first matmul, not context creation) is consistent with a lazy kernel-load path finding the first unloadable cubin on the hot path.

  2. A FlashAttention kernel variation that requires features only present under CUDA 12.8 runtime. Ruling out by rebuilding upstream's recipe with MESH_LLM_CUDA_FA_ALL_QUANTS=off and seeing if it runs would confirm or disprove.

  3. A kernel that legitimately requires sm_89/sm_90 features slipping into a code path executed on SM 8.0 (e.g. an #ifdef that doesn't correctly guard against 8.0, or __CUDA_ARCH__ >= 890 being the default branch). Would manifest as an sm_80.cubin that's technically present but non-functional for that specific kernel.

Concrete experiments that would disambiguate

  1. Rebuild upstream's exact recipe (CUDA 12.8 devel, FA_ALL_QUANTS=on, full arch list, all targets) and test on A30 + driver 535. If it still fails identically → not a build-config issue, likely macos menu app #1.
  2. Same as macos menu app #1 but CUDA 12.6 devel (everything else identical). If that works → confirms CUDA-toolkit/driver mismatch is the root cause and the fix is pinning the release build to CUDA ≤ 12.6, OR baking the appropriate forward-compat libs into the release.
  3. Same as macos menu app #1 but FA_ALL_QUANTS=off only. If that works → points at a specific FA kernel family as the offender.

Workaround (confirmed working on A30 / driver 535)

Build from source with CUDA ≤ 12.6 and the narrower config:

git clone https://github.com/Mesh-LLM/mesh-llm
cd mesh-llm
MESH_LLM_CUDA_FA_ALL_QUANTS=off \
MESH_LLM_LLAMA_TARGETS="llama-server rpc-server llama-moe-split llama-moe-analyze" \
just release-build-cuda "80;86;89"

Suggested fix (symptom-based — cause-based fix pending the experiments above)

Regardless of which hypothesis turns out right, the release pipeline currently has no functional smoke test that would catch this class of bug. The existing ./mesh-llm --version check does not exercise the CUDA matmul path. A minimum fix that protects users from the current failure mode:

  • Add an A30 or A100 runner step to .github/workflows/release.yml that actually serves a tiny GGUF through mesh-llm ... --llama-flavor cuda and hits /v1/chat/completions for one token. Would have caught this bug before the v0.60.3 release shipped.
  • Happy to submit a PR for the smoke-test wiring if you can tell me what runner type is available to the project (A30/A100/other).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions