A30 / driver 535: v0.60.3 CUDA bundle fails at matmul despite sm_80 cubins being present

# Published CUDA v0.60.3 bundle crashes at first matmul on A30 (SM 8.0) — SM 8.0 cubins ARE present, cause is unclear

## Summary

The `mesh-llm-x86_64-unknown-linux-gnu-cuda.tar.gz` bundle published for `v0.60.3` crashes at the first matmul with `CUDA error: device kernel image is invalid` when run against an NVIDIA A30 (Ampere, compute capability 8.0) on an R535-series driver. By the same apparent mechanism, we'd expect the same failure on A100 (also compute 8.0), though that hasn't been verified on our hardware.

**Note — I originally hypothesized this was a missing-SM-8.0-cubin problem. That hypothesis is wrong** — `cuobjdump --list-elf` on the bundled `llama-server-cuda` binary shows SM 8.0 cubins present for every kernel (alongside 75, 86, 89, 90, 120a). The crash is real and reproducible, but the underlying cause is something else (see "Revised hypotheses" below).

## Environment

- **Version**: `v0.60.3` release bundle (`mesh-llm-x86_64-unknown-linux-gnu-cuda.tar.gz`)
- **GPU**: NVIDIA A30 (24 GB), accessed via a MIG 4g.24gb compute instance (full GPU)
- **Host driver**: 535.288.01 (R535 series, CUDA 12.2 native)
- **Container base for runtime**: `nvidia/cuda:12.6.3-runtime-ubuntu24.04`
  - `/usr/local/cuda-12.6/compat/` removed so the host's libcuda (535) is used instead of the forward-compat shim (the shim silently breaks `cudaGetDeviceCount()` on this driver — a separate issue).
- **Deployment**: `docker run --gpus '"device=MIG-<uuid>"' ... --llama-flavor cuda` (not `--gpus all`; under MIG the parent GPU UUID doesn't work and must be the MIG instance UUID).

## Steps to reproduce

```bash
# Host: A30 with nvidia-container-toolkit, MIG enabled and one 4g.24gb instance created
# (sudo nvidia-smi mig -cgi 0 -C if no instance exists)
MIG_UUID=$(nvidia-smi -L | grep -oP 'MIG-[a-f0-9-]+' | head -1)

curl -fsSL https://github.com/Mesh-LLM/mesh-llm/releases/download/v0.60.3/mesh-llm-x86_64-unknown-linux-gnu-cuda.tar.gz \
  | tar xz
docker run --rm --gpus "\"device=$MIG_UUID\"" \
  -v $PWD/mesh-bundle:/opt/mesh-bundle:ro \
  -v $PWD/models:/models:ro \
  --entrypoint /bin/bash \
  nvidia/cuda:12.6.3-runtime-ubuntu24.04 \
  -c 'apt-get update && apt-get install -y libssl3t64 libgomp1 curl && \
      rm -rf /usr/local/cuda-12.6/compat && \
      /opt/mesh-bundle/mesh-llm serve \
        --gguf /models/<any-gguf>.gguf \
        --listen-all --llama-flavor cuda'
```

## Expected

Model loads and inference serves normally.

## Actual

`rpc-server-cuda` comes up, reports the GPU correctly, claims ~512 MB VRAM. `llama-server-cuda` loads the model weights into VRAM (~21 GB of 24 GB used), then crashes on the first forward pass during the warm-up run:

```
sched_reserve:      CUDA0 compute buffer size =   300.75 MiB
sched_reserve:  CUDA_Host compute buffer size =   136.09 MiB
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
ggml_cuda_compute_forward: MUL_MAT failed
/__w/mesh-llm/mesh-llm/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:98: CUDA error
CUDA error: device kernel image is invalid
  current device: 0, in function ggml_cuda_compute_forward at
  /__w/mesh-llm/mesh-llm/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2825
```

## Evidence that SM 8.0 is NOT missing from the bundle

Ran `cuobjdump --list-elf /opt/mesh-bundle/llama-server-cuda` inside `nvidia/cuda:12.8.0-devel-ubuntu22.04`. SM 8.0 cubins are present, 60 of them, one per kernel variant alongside 75/86/89/90/120a:

```
ELF file  1:  llama-server-cuda.1.sm_75.cubin
ELF file  2:  llama-server-cuda.2.sm_80.cubin    ← present
ELF file  3:  llama-server-cuda.3.sm_86.cubin
ELF file  4:  llama-server-cuda.4.sm_89.cubin
ELF file  5:  llama-server-cuda.5.sm_90.cubin
ELF file  6:  llama-server-cuda.6.sm_120a.cubin
ELF file  7:  llama-server-cuda.7.sm_75.cubin
ELF file  8:  llama-server-cuda.8.sm_80.cubin    ← present
ELF file  9:  llama-server-cuda.9.sm_86.cubin
...
(pattern repeats; 60+ kernels × 6 arches each, including sm_80 consistently)
```

Same pattern holds for `rpc-server-cuda`.

## Revised hypotheses (testable)

Differences between the failing upstream build and a from-source build that *does* work on the same A30:

| | Upstream release (fails on A30) | Source build (works on A30) |
|---|---|---|
| CUDA toolkit | 12.8 (`nvidia/cuda:12.8.0-devel-ubuntu22.04` per `.github/workflows/release.yml`) | 12.6 |
| `GGML_CUDA_FA_ALL_QUANTS` | ON (default) | OFF (CI opt-out) |
| `CMAKE_CUDA_ARCHITECTURES` | `75;80;86;87;89;90;100;103;120` | `80;86;89` |
| Targets built | all | `llama-server rpc-server llama-moe-split llama-moe-analyze` only |

Host driver, MIG config, container base, and model are identical across both tests.

Plausible causes, in decreasing order of likelihood:

1. **CUDA-toolkit / driver minor-version mismatch.** Binary assembled by CUDA 12.8 nvcc; host libcuda is 535-series (CUDA 12.2 native). With the forward-compat shim removed (required to even get past `cudaGetDeviceCount` on this driver), cubins tagged as CUDA 12.8 may be rejected at load by the CUDA 12.2 driver for a specific kernel set. The load failure at a *late* stage (first matmul, not context creation) is consistent with a lazy kernel-load path finding the first unloadable cubin on the hot path.

2. **A FlashAttention kernel variation that requires features only present under CUDA 12.8 runtime.** Ruling out by rebuilding upstream's recipe with `MESH_LLM_CUDA_FA_ALL_QUANTS=off` and seeing if it runs would confirm or disprove.

3. **A kernel that legitimately requires `sm_89`/`sm_90` features** slipping into a code path executed on SM 8.0 (e.g. an `#ifdef` that doesn't correctly guard against 8.0, or `__CUDA_ARCH__ >= 890` being the default branch). Would manifest as an `sm_80.cubin` that's technically present but non-functional for that specific kernel.

## Concrete experiments that would disambiguate

1. Rebuild upstream's exact recipe (CUDA 12.8 devel, `FA_ALL_QUANTS=on`, full arch list, all targets) and test on A30 + driver 535. If it still fails identically → not a build-config issue, likely #1.
2. Same as #1 but CUDA 12.6 devel (everything else identical). If that works → confirms CUDA-toolkit/driver mismatch is the root cause and the fix is pinning the release build to CUDA ≤ 12.6, OR baking the appropriate forward-compat libs into the release.
3. Same as #1 but `FA_ALL_QUANTS=off` only. If that works → points at a specific FA kernel family as the offender.

## Workaround (confirmed working on A30 / driver 535)

Build from source with CUDA ≤ 12.6 and the narrower config:

```bash
git clone https://github.com/Mesh-LLM/mesh-llm
cd mesh-llm
MESH_LLM_CUDA_FA_ALL_QUANTS=off \
MESH_LLM_LLAMA_TARGETS="llama-server rpc-server llama-moe-split llama-moe-analyze" \
just release-build-cuda "80;86;89"
```

## Suggested fix (symptom-based — cause-based fix pending the experiments above)

Regardless of which hypothesis turns out right, the release pipeline currently has no functional smoke test that would catch this class of bug. The existing `./mesh-llm --version` check does not exercise the CUDA matmul path. A minimum fix that protects users from the current failure mode:

- Add an A30 or A100 runner step to `.github/workflows/release.yml` that actually serves a tiny GGUF through `mesh-llm ... --llama-flavor cuda` and hits `/v1/chat/completions` for one token. Would have caught this bug before the v0.60.3 release shipped.
- Happy to submit a PR for the smoke-test wiring if you can tell me what runner type is available to the project (A30/A100/other).

## Related

- Detected during a private-LAN deployment of mesh-llm on 3 NVIDIA A30 GPUs.
- Not related to [#303 / PR #305 — `configure_compiler_cache()` silent-kill](https://github.com/Mesh-LLM/mesh-llm/issues/303) (filed separately as a build-script robustness issue).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A30 / driver 535: v0.60.3 CUDA bundle fails at matmul despite sm_80 cubins being present #304

Published CUDA v0.60.3 bundle crashes at first matmul on A30 (SM 8.0) — SM 8.0 cubins ARE present, cause is unclear

Summary

Environment

Steps to reproduce

Expected

Actual

Evidence that SM 8.0 is NOT missing from the bundle

Revised hypotheses (testable)

Concrete experiments that would disambiguate

Workaround (confirmed working on A30 / driver 535)

Suggested fix (symptom-based — cause-based fix pending the experiments above)

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Upstream release (fails on A30)	Source build (works on A30)
CUDA toolkit	12.8 (`nvidia/cuda:12.8.0-devel-ubuntu22.04` per `.github/workflows/release.yml`)	12.6
`GGML_CUDA_FA_ALL_QUANTS`	ON (default)	OFF (CI opt-out)
`CMAKE_CUDA_ARCHITECTURES`	`75;80;86;87;89;90;100;103;120`	`80;86;89`
Targets built	all	`llama-server rpc-server llama-moe-split llama-moe-analyze` only

A30 / driver 535: v0.60.3 CUDA bundle fails at matmul despite sm_80 cubins being present #304

Description

Published CUDA v0.60.3 bundle crashes at first matmul on A30 (SM 8.0) — SM 8.0 cubins ARE present, cause is unclear

Summary

Environment

Steps to reproduce

Expected

Actual

Evidence that SM 8.0 is NOT missing from the bundle

Revised hypotheses (testable)

Concrete experiments that would disambiguate

Workaround (confirmed working on A30 / driver 535)

Suggested fix (symptom-based — cause-based fix pending the experiments above)

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions