Add vLLM as a wrapped server backend (ROCm) by ramkrishna2910 · Pull Request #1537 · lemonade-sdk/lemonade

ramkrishna2910 · 2026-04-04T07:13:41Z

Status: early rough draft

Tested only on gfx1151 (Strix Halo / Radeon 8060S). The backend does support gfx1150, gfx110X, and gfx120X as well but havent been tested yet. Feedback from people with other AMD GPUs is the main thing we need before treating this PR as ready to merge.

Summary

Adds vLLM as a WrappedServer backend using AMD's pre-built ROCm wheels from lemonade-sdk/vllm-rocm. Follows the same install/load/unload pattern as llama.cpp-ROCm.

Scope of testing

Verified end-to-end (install → load → inference → benchmark) on exactly one configuration:

GPU: Ryzen AI MAX+ PRO 395 w/ Radeon 8060S (gfx1151, Strix Halo)
OS: Ubuntu 24.04
Kernel: 6.18.22 mainline from kernel.ubuntu.com/mainline
amdgpu: built-in driver from the 6.18.22 kernel package; amdgpu-dkms 6.19.0 from ROCm 31.20 also installed but not actively loading
Models: Qwen3 (0.6B–8B), Qwen3.5 (0.8B–9B), Llama-3.2 (1B/3B), Phi-4-mini, Gemma-3-4B, both FP16 and AWQ variants

Not tested:

gfx1150 / gfx110X / gfx120X — the install flow fetches per-arch assets that exist but have never been exercised. Other architectures may have their own kernel/driver gotchas we haven't discovered.
Other distros — everything below assumes Ubuntu 24.04. The kernel and amdgpu setup on Fedora/Arch/openSUSE likely works but is untested.
Multi-user / batched serving — all benchmarks are single-user, one request at a time. vLLM's scheduler strengths (paged attention, prefix caching, continuous batching) are not exercised here.

Prerequisites for testing

Hardware

AMD GPU in one of: gfx1151 (Strix Halo), gfx1150 (Strix Point), gfx110X (RDNA3), gfx120X (RDNA4)
Kernel — strict requirement
You need a kernel with the CWSR (Context Wave Save/Restore) fix. Without it, any GPU dispatch triggers a GCVM_L2_PROTECTION_FAULT and the backend hangs. The verified path is mainline 6.18.4+.

Full doc: docs/gfx1151_linux.html.

If amdgpu-dkms is installed
The default Radeon repo (amdgpu/30.30) ships amdgpu-dkms 6.16.13 which overrides the kernel's built-in driver with a broken version. Either switch to amdgpu/31.20:

sudo sed -i 's|amdgpu/30\.30/|amdgpu/31.20/|g' /etc/apt/sources.list.d/amdgpu*.list
sudo apt update
sudo apt install -y amdgpu-dkms amdgpu-dkms-firmware
sudo reboot

Or uninstall amdgpu-dkms entirely — vLLM ships its own ROCm user-space, you don't need the DKMS package unless you also want to run other ROCm tools outside Lemonade.

Verify prerequisites

Kernel version

uname -r # expect 6.18.4 or newer

CWSR properties exported

grep -E "cwsr_size|ctl_stack_size" /sys/class/kfd/kfd/topology/nodes/*/properties

Test plan

1. Build

git fetch origin
git checkout test-vllm
./setup.sh
cmake --preset default
cmake --build --preset default -j$(nproc)

Test plan

Build
git fetch origin
git checkout test-vllm
./setup.sh
cmake --preset default
cmake --build --preset default -j$(nproc)
Start server

./build/lemond --port 8083

In another terminal, verify it's up:

curl -s http://localhost:8083/v1/health | python3 -m json.tool

Install vLLM backend
First install pulls ~5 GB (split into two ~2.5 GB parts). Expect ~60 s on a fast link.

curl -s -X POST http://localhost:8083/v1/install \
  -H "Content-Type: application/json" \
  -d '{"recipe": "vllm", "backend": "rocm"}'
  {"backend":"rocm","recipe":"vllm","status":"success"}

Load a small model

time curl -s -X POST http://localhost:8083/v1/load \
  -H "Content-Type: application/json" \
  -d '{"model_name": "Qwen3-0.6B-vllm"}'
{"checkpoint":"Qwen/Qwen3-0.6B","model_name":"Qwen3-0.6B-vllm","recipe":"vllm","status":"success"}

First load takes 20–30 s (Triton JIT compile for the architecture). Subsequent loads are faster.

Run inference

curl -s -X POST http://localhost:8083/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen3-0.6B-vllm","messages":[{"role":"user","content":"Say hi in 3 words"}],"max_tokens":30,"temperature":0}' \

Known gotchas

First-run Triton JIT: cold load of a new model size compiles kernels for your GPU, taking 20–350 s. Subsequent loads hit the on-disk cache.
huggingface-hub version conflict: fixed by forcing PYTHONNOUSERSITE=1 when launching vllm-server. If you still hit this, make sure ~/.local/lib/python3.12/site-packages/huggingface_hub isn't shadowing the bundled one.
Transformers version lag: the bundled vLLM 0.19.0 pins transformers <5. Models whose config.json declares model_type: qwen3_5_text (only some newer Qwen3.5 variants) won't load until a vLLM release that bumps the transformers pin. The model registry in this PR avoids those repos.
amdgpu-dkms 6.16.13 masking the built-in driver: see prerequisites. Uninstall or upgrade.

Integrates vLLM as a new WrappedServer backend using AMD's pre-built ROCm wheels from lemonade-sdk/vllm-rocm. Follows the same patterns as the llama.cpp backend. New backend: VLLMServer (recipe: "vllm", binary: "vllm-server") - Linux-only, ROCm backend (gfx1150, gfx1151, gfx120X) - Uses HuggingFace model IDs directly (no GGUF) - Split archive download support for >2GB GitHub release assets Models: OPT-125M-vllm, Qwen3-0.6B-vllm, Llama-3.2-1B-Instruct-vllm Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

superm1 · 2026-04-04T17:21:09Z

Why nightly TheRock SDK? I would think you are better off taking the tagged release (and then you can re-use with different backends like I'm doing for SD...)

ramkrishna2910 · 2026-04-04T17:33:49Z

Yeah, plan is to move to tagged release. This PR should be in draft 😅

Without this, Lemonade forwards requests with its model name (e.g. "OPT-125M-vllm") but vLLM only accepts the HuggingFace ID (e.g. "facebook/opt-125m"), causing 404 errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add 'vllm' to RECIPE_ORDER and RECIPE_DISPLAY_NAMES ("vLLM ROCm") so it appears properly in the Backend Manager - Set suggested=true on all 3 vLLM models so they appear in the Model Manager Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove the OPT-125M test model. Add Qwen3 family (0.6B, 1.7B, 4B, 8B) and Qwen3.5 MoE models (3B-A1B, 7B-A3B) for a range of sizes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Qwen3.5 models are multimodal (vision+text) with different naming. Keep only verified text-only Qwen3 models. Fix Qwen3-8B size to 16.6GB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

b1006 includes bundled portable Python (no system Python dependency), pre-compiled Triton HIP utils, Python headers for Triton JIT, and all 3 GPU targets (gfx1150, gfx1151, gfx120X). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Release archives are now named {tag}-{arch}-x64.tar.gz instead of vllm-{tag}-ubuntu-rocm-{arch}-x64.tar.gz. The tag itself contains the version info (e.g. vllm0.19.0-rocm7.12.0-b1). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New naming scheme with version info. Release includes bundled portable Python, bundled clang (no system gcc needed), all 3 GPU targets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tar archives lose execute permissions. Previously only the found binary (vllm-server) was chmod'd, but bundled python3 also needs it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add version_override to InstallParams so backends can specify a release tag different from backend_versions.json. vLLM uses this to append the GPU target to the version, creating per-target release tags (e.g. vllm0.19.0-rocm7.12.0-gfx1150). Update backend_versions.json to base version without target suffix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Enables vLLM on RX 7900/7800/7700 series discrete GPUs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --enforce-eager, --dtype float16, --max-model-len 4096 as defaults in vllm_server.cpp (needed for consumer GPU inference) - Add AWQ quantized models: Qwen3-4B-AWQ, Qwen3-8B-AWQ - Add more models: Llama-3.2-1B/3B (AWQ), Gemma-3-4b-it, Phi-4-mini - Use AWQ checkpoints for Llama (casperhansen) - vLLM auto-detects AWQ from model config, no flag needed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vLLM auto-selects awq_marlin which is extremely slow on consumer AMD GPUs (gfx1150: 2 tok/s). Force --quantization awq (GEMM kernel) when model name contains AWQ, which runs at 12 tok/s. Also reduce default max-model-len to 2048. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add PYTHONNOUSERSITE=1 env var when launching vllm-server to prevent system/user Python packages from leaking into the bundled environment (fixes ImportError from huggingface-hub version mismatch) - Include vllm in the gfx1151 CWSR action URL check so users get the proper guidance when the kernel fix is missing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The previous filename-based check forced --quantization awq on any repo with "AWQ" in the name. Some repos (e.g. cyankiwi/Qwen3.5-4B-AWQ-4bit) actually use compressed-tensors format, which caused vLLM to fail the load with a quantization method mismatch. Read quantization_config.quant_method from the model's config.json: - Fast path: HF hub cache (no network on subsequent loads) - Fallback: HTTP GET from huggingface.co on first load, so detection works before vLLM has downloaded anything Still force --quantization awq when the method is AWQ, to keep the existing workaround for slow awq_marlin on consumer GPUs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Qwen3.5 vLLM FP16: 0.8B, 2B, 4B, 9B - BF16 GGUF (llamacpp) for: Qwen3, Qwen3.5, Llama-3.2-1B/3B, Phi-4-mini, Gemma-3-4b-it - AWQ vLLM for: Qwen3-0.6B/1.7B, Qwen3.5-0.8B/2B/4B/9B, Phi-4-mini Enables a full throughput matrix across (fp16 / int4) × (vulkan / rocm / vllm) for benchmarking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Switch to the same release layout as lemonade-sdk/llamacpp-rocm: one release per version (e.g. vllm0.19.0-rocm7.12.0) with one asset per GPU target (e.g. vllm0.19.0-rocm7.12.0-gfx1151-x64.tar.gz), instead of a separate release per architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The download loop retries every failure 5 times with exponential backoff. That burns 30+s on permanent 404s that will never succeed — notably when probing for a single-file asset that's stored as split parts, and when detecting the end of the parts list. Fast-exit the retry loop on 4xx client errors. 408 (Request Timeout) and 429 (Too Many Requests) are treated as transient and still retried. Cuts vLLM install time from 2:55 to 1:00 on a cold cache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The existing CWSR sysfs check is insufficient on gfx1151: some OEM/DKMS combos (e.g. linux-oem-24.04 6.17 + amdgpu-dkms 6.19) expose cwsr_size and ctl_stack_size but still page-fault on any GPU dispatch. Add a kernel version guard so /v1/install fails fast on pre-6.18.4 kernels and points users to docs/gfx1151_linux.html instead of letting them install a backend that will hang at first inference. Provides an opt-out (LEMONADE_SKIP_KERNEL_CHECK=1) for users on a vendor kernel with a known-good backport. Rewrite docs/gfx1151_linux.html with a tested recipe: - Recommend mainline 6.18.4+ (linux-oem-24.04 alone is unreliable) - Call out the amdgpu-dkms 6.16.13 masking problem - Add a HIP-level verification test, since sysfs properties alone don't prove the fix is active - Document the new LEMONADE_SKIP_KERNEL_CHECK escape hatch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The 19 extra entries (12 BF16/F16 GGUFs, 7 AWQ vLLM repos from third-party uploaders) were added to build a full benchmark matrix but aren't core to the vLLM backend itself. Keep only the four Qwen3.5 FP16 vLLM entries for test coverage of the new family. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2K was arbitrary and surprises users coming from long-context models. 16K is large enough to be useful without risking OOM on 32–96 GB VRAM systems and without blowing up first-run Triton JIT compile time. Override per-load with vllm_args="--max-model-len 32768" if you need more, or lower if you're memory-constrained. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Resolved two conflicts, both additive: - backend_utils.cpp: keep both VLLMServer::SPEC (ours) and FastFlowLMServer::SPEC (main) in try_get_spec_for_recipe - recipe_options.cpp: keep the vllm option entries (ours) alongside main's sampling_method / flow_shift additions for sd-cpp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- validate_vllm.py: iterate all vllm models labeled "hot", install backend once, load/test/unload each, emit per-model JSON matching llamacpp_validation_*.json shape (model, pass, response, input_tokens, output_tokens, time_to_first_token, tokens_per_second). Adds --lite, --output, --logs-dir, --skip-install flags mirroring validate_llamacpp.py. - validate_vllm.yml: four-job pipeline matching validate_llamacpp.yml: get-latest-releases (auto-discover vllm-rocm tag), build (update backend_versions.json + cmake build), validate (self-hosted stx-halo Linux runner, rich artifact upload), create-pr (auto-open bump PR on schedule/workflow_dispatch success). Adds pull_request trigger that runs in LITE_MODE, upgrades to checkout@v5 and upload-artifact@v7. - server_models.json: mark Qwen3-0.6B-vllm, Qwen3-4B-AWQ-vllm, Qwen3.5-4B-vllm, Llama-3.2-1B-Instruct-vllm as "hot" so validation finds them via the same label filter used for llamacpp. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The web-app compile step can hit intermittent EACCES errors when webpack tries to overwrite KaTeX font files placed into build/resources/web-app by CMake. CI only validates the backend, so opt out via -DBUILD_WEB_APP=OFF. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The self-hosted stx-halo runner's kernel tripped the preflight check we added, blocking /install even though the runner may have a vendor backport that makes vLLM work. Set LEMONADE_SKIP_KERNEL_CHECK=1 in the validate job so the pipeline attempts the real install/load; if the runner is actually broken, vLLM's own failure will surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Blocking install at the preflight produced false negatives: users on vendor kernels with working backports got stranded, and the self-hosted CI runner (kernel heuristic said 'no', reality said 'yes') had to set a magic env var to get past our own guard. Change the contract: - is_recipe_installed() no longer returns false based on CWSR state - build_recipes_info() no longer redirects the action to the help URL - vllm_server.cpp enriches the wait_for_ready timeout error with the help URL when needs_gfx1151_cwsr_fix() is true, so users who do hit the real page-fault symptom still see the pointer to the docs - needs_gfx1151_cwsr_fix() goes back to a pure sysfs check; the kernel-version heuristic and LEMONADE_SKIP_KERNEL_CHECK env var are gone (no longer needed since we don't block) Docs: drop the now-defunct "Escape hatch" section. CI: drop the LEMONADE_SKIP_KERNEL_CHECK workaround from the vllm job. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ghost · 2026-04-14T20:53:22Z

vLLM ROCm Burn Test — All 14 Models on Strix Halo (gfx1151)

Hardware: AMD Ryzen AI MAX+ 395, 128GB unified RAM, Radeon 8060S (gfx1151)
OS: Arch Linux (CachyOS kernel 7.0.0-1-mainline)
Backend: vLLM ROCm via this PR (built from source, test-vllm branch)
Method: 5 runs per model, 2 warmup runs, 200 max tokens, mean ± stddev
Date: 2026-04-14

Results — 12/14 PASSED

Model	Status	Mean tok/s	±StdDev	Min	Max	Load Time
Qwen3-0.6B-vllm	✅ PASS	116.7	±0.2	116.4	116.9	17.0s
Qwen3.5-0.8B-vllm	❌ FAIL	-	-	-	-	0.5s
Qwen3-1.7B-vllm	✅ PASS	25.1	±26.0	2.5	47.7	19.7s
Qwen3.5-2B-vllm	✅ PASS	44.0	±0.0	44.0	44.0	216.2s
Llama-3.2-1B-Instruct-vllm	✅ PASS	110.4	±0.0	110.3	110.5	38.4s
Llama-3.2-3B-Instruct-vllm	✅ PASS	50.5	±0.1	50.3	50.5	42.8s
Qwen3-4B-vllm	✅ PASS	25.4	±0.0	25.4	25.4	21.8s
Qwen3-4B-AWQ-vllm	✅ PASS	42.8	±0.0	42.8	42.8	75.6s
Qwen3.5-4B-vllm	✅ PASS	23.8	±0.0	23.8	23.8	331.7s
Gemma-3-4b-it-vllm	❌ FAIL	-	-	-	-	0.4s
Phi-4-mini-instruct-vllm	✅ PASS	25.1	±0.0	25.1	25.1	102.0s
Qwen3-8B-vllm	✅ PASS	12.3	±0.0	12.3	12.3	24.7s
Qwen3-8B-AWQ-vllm	✅ PASS	22.8	±0.0	22.8	22.8	81.2s
Qwen3.5-9B-vllm	✅ PASS	11.6	±0.0	11.6	11.7	309.3s

Notes

First load per model size includes Triton JIT compilation (20-350s). Subsequent loads of same architecture hit on-disk cache.
All measurements are wall-clock (request → response).
--enforce-eager set by Lemonade's vLLM wrapper (no CUDA graphs).
Server ran on port 8083 per PR test plan.
Qwen3-1.7B shows high variance (±26.0) — may warrant investigation.
AWQ quantized models show ~1.8x speedup over FP16 counterparts (Qwen3-4B: 25.4 → 42.8, Qwen3-8B: 12.3 → 22.8).

Failed Models

Qwen3.5-0.8B: Immediate failure (0.5s load), likely unsupported config
Gemma-3-4b-it: Immediate failure (0.4s load), likely architecture not yet supported in vLLM ROCm wrapper

Monster Burn (Queued)

Next up: Qwen2.5-72B-AWQ, Mixtral-8x22B, Qwen3-235B-A22B — testing large model viability on 128GB unified.

Tested on bare metal Arch Linux, no containers. Happy to run additional models or reproduce with different parameters. Great work on this PR — the wrapper cleanly solves the Triton HIP "invalid device ordinal" crash we were hitting running vLLM directly.

superm1 · 2026-04-14T21:39:47Z

+// Generic installation check.
+// Note: we intentionally do NOT block install here based on kernel/CWSR heuristics.
+// The signal is unreliable (vendor backports and older kernels can still work), and
+// a false negative strands users. If the kernel is genuinely broken, vLLM/llama.cpp
+// will surface a clear failure at load time (see vllm_server.cpp wait_for_ready).


No. It's not a false negative. There are real problems that are incredibly difficult to debug. We need to flag it.

superm1

Please don't change any gfx1151 CWSR detection changes. The problems that can happen are very difficult to debug and will "appear randomly". The fixed kernels are all rolled out in Ubuntu 24.04 6.17 HWE, Ubuntu 24.04 6.14 OEM, Ubuntu 26.04. They're fixed in Arch and Fedora.

If someone is on something unique they need to upgrade.

ghost · 2026-04-15T01:14:24Z

Test Report — CachyOS / Kernel 7.0 / gfx1151 (Strix Halo)

Tester: @stampby
Hardware: AMD Strix Halo, Radeon 8060S (gfx1151), 128GB unified
OS: Arch Linux (CachyOS), Kernel 7.0.0-1-mainline
Build: ./setup.sh && cmake --build --preset default from test-vllm branch
Server: ./build/lemond --port 8083 per PR test plan
Method: 5 runs per model, 2 warmup discarded, 200 max tokens, mean ± stddev

All 14 Catalog vLLM Models — vLLM ROCm on gfx1151

Model	Status	Mean tok/s	±StdDev	Min	Max	Load Time
Qwen3-0.6B-vllm	PASS	116.7	±0.2	116.4	116.9	17.0s
Qwen3.5-0.8B-vllm	FAIL	-	-	-	-	0.5s
Qwen3-1.7B-vllm	PASS	25.1	±26.0*	2.5	47.7	19.7s
Qwen3.5-2B-vllm	PASS	44.0	±0.0	44.0	44.0	216.2s
Llama-3.2-1B-Instruct-vllm	PASS	110.4	±0.0	110.3	110.5	38.4s
Llama-3.2-3B-Instruct-vllm	PASS	50.5	±0.1	50.3	50.5	42.8s
Qwen3-4B-vllm	PASS	25.4	±0.0	25.4	25.4	21.8s
Qwen3-4B-AWQ-vllm	PASS	42.8	±0.0	42.8	42.8	75.6s
Qwen3.5-4B-vllm	PASS	23.8	±0.0	23.8	23.8	331.7s
Gemma-3-4b-it-vllm	FAIL	-	-	-	-	0.4s
Phi-4-mini-instruct-vllm	PASS	25.1	±0.0	25.1	25.1	102.0s
Qwen3-8B-vllm	PASS	12.3	±0.0	12.3	12.3	24.7s
Qwen3-8B-AWQ-vllm	PASS	22.8	±0.0	22.8	22.8	81.2s
Qwen3.5-9B-vllm	PASS	11.6	±0.0	11.6	11.7	309.3s

12/14 models passed.

Failures

Qwen3.5-0.8B-vllm — load failed immediately (0.5s). May be a model config or transformers version issue.
Gemma-3-4b-it-vllm — load failed. HuggingFace download error on .gitattributes. Likely needs HF auth token for gated model access.

Notes

*Qwen3-1.7B had one cold outlier run at 2.5 tok/s (first bench run after JIT), subsequent runs were 47.4-47.7. The ±26.0 stddev reflects that single outlier.
AWQ models show ~1.7-1.9x speedup over FP16 counterparts (e.g. Qwen3-4B: 25.4 FP16 vs 42.8 AWQ, Qwen3-8B: 12.3 FP16 vs 22.8 AWQ).
Qwen3.5 models have significantly longer first-load times (216-331s) due to Triton JIT compilation for a new architecture. Subsequent loads from cache are fast.
All variance is essentially zero (±0.0-0.2) once warmed up. Rock solid.
Qwen2.5-72B-Instruct-AWQ is currently running as a bonus test (72B dense on 128GB unified memory, ~2.4 tok/s generation). Results will follow.

Environment

Hardware:     AMD Strix Halo, 128GB unified, Radeon 8060S (gfx1151)
OS:           Arch Linux (CachyOS)
Kernel:       7.0.0-1-mainline
lemond:       Built from test-vllm branch
vllm-rocm:    Installed via /v1/install (rocm backend)
Server port:  8083

Happy to retest anything or run additional models. Hardware is available.

ghost · 2026-04-15T19:41:26Z

vLLM ROCm gfx1151 Results — Strix Halo, 128GB unified

14 models tested, 12 passed. 5 runs, 2 warmup, 200 max tokens.

Model	tok/s	±stddev	Type
Qwen3-0.6B	116.7	±0.2	FP16
Llama-3.2-1B-Instruct	110.4	±0.0	AWQ
Llama-3.2-3B-Instruct	50.5	±0.1	AWQ
Qwen3.5-2B	44.0	±0.0	FP16
Qwen3-4B-AWQ	42.8	±0.0	AWQ
Qwen3-4B	25.4	±0.0	FP16
Phi-4-mini	25.1	±0.0	FP16
Qwen3-8B-AWQ	22.8	±0.0	AWQ
Qwen3-8B	12.3	±0.0	FP16
Qwen3.5-9B	11.6	±0.0	FP16
Qwen2.5-72B-AWQ	2.3	±0.1	AWQ (72B dense)

Full results + comparison with MLX Engine ROCm: https://github.com/stampby/bleeding-edge

eddierichter-amd · 2026-04-16T16:33:17Z

@ramkrishna2910

I followed the manual smoke-test portion of the vllm test plan against a local Lemonade server, and it worked out of the box without any extra code changes or request tweaking.

Command used:

curl -s -X POST http://localhost:8083/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen3-0.6B-vllm","messages":[{"role":"user","content":"Tell me a story"}],"max_tokens":30,"temperature":0}'

Response:

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "message": {
        "annotations": null,
        "audio": null,
        "content": "<think>\nOkay, the user asked for a story. I need to come up with something engaging. Let me think about a
simple yet memorable story.",
        "function_call": null,
        "reasoning": null,
        "refusal": null,
        "role": "assistant",
        "tool_calls": []
      },
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "created": 1776357013,
  "id": "chatcmpl-8704fee7a2990b81",
  "kv_transfer_params": null,
  "model": "Qwen3-0.6B-vllm",
  "object": "chat.completion",
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 30,
    "prompt_tokens": 12,
    "prompt_tokens_details": null,
    "total_tokens": 42
  }
}

The request completed successfully and returned a response from Qwen3-0.6B-vllm.

Revert the preflight softening per PR feedback. The strict guard (blocking /install on kernels < 6.18.4 for gfx1151 ROCm) is the intended behavior: sysfs cwsr_size/ctl_stack_size alone isn't enough on some OEM/DKMS combos, and letting an install succeed on a broken kernel leads to a hang at first inference with no actionable pointer to the docs. Users on a vendor kernel with a known-good backport opt out with LEMONADE_SKIP_KERNEL_CHECK=1 (already wired up for the self-hosted vllm CI runner in 0c093d2). This reverts commit 2be57b2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Keep this PR scoped to adding vLLM support; stop touching the CWSR detection logic and docs. Reviewer note on PR #1537: the fixed kernels are already rolled out in Ubuntu 24.04 (6.17 HWE, 6.14 OEM), Ubuntu 26.04, Arch, and Fedora. The pre-existing sysfs-only detection passes on all of those. Adding a >=6.18.4 version guard on top wrongly blocked users on known-good backported kernels (e.g. linux-oem-24.04's current 6.17.0-1017 build), inconsistent with the old doc's "apt install linux-oem-24.04" instruction. Changes: - src/cpp/server/system_info.cpp: drop kernel_version_lacks_cwsr_fix and the LEMONADE_SKIP_KERNEL_CHECK escape hatch; needs_gfx1151_cwsr_fix returns to pure sysfs check (cwsr_size and ctl_stack_size both exported). - docs/gfx1151_linux.html: restore origin/main version (concise "upgrade your kernel" guidance, no mainline-install recipe, no DKMS sed hack, no HIP verification step, no Escape hatch section). - .github/workflows/validate_vllm.yml: remove LEMONADE_SKIP_KERNEL_CHECK env block (env var no longer exists; runner's sysfs check passes on its own on Ubuntu 6.17 OEM). - src/cpp/server/backends/vllm_server.cpp: keep the load-time "kernel may be missing CWSR fix" hint in the wait_for_ready timeout message as strictly-additive UX for the rare sysfs-reports-OK-but-GPU-still- page-faults case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop the Qwen3, Llama-3.2, Gemma-3, and Phi-4-mini vLLM entries; keep only the four Qwen3.5 sizes (0.8B / 2B / 4B / 9B) which are the set this PR actually exercises and benchmarks end-to-end. Other model families can be re-added in follow-ups once they're validated on the target hardware. Net: 14 vllm entries -> 4. Total registry: 175 -> 165. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # src/cpp/server/backend_manager.cpp # src/cpp/server/system_info.cpp

Main consolidated AMD GPU device types from amd_igpu/amd_dgpu into a single amd_gpu (consistent with sd-cpp/llamacpp ROCm entries). vLLM's RECIPE_DEF still used the old split, so the device-type lookup never matched and every model load returned 404 with "Requires Radeon RX 7000 series (RDNA3)" even on supported gfx1151 hardware. Switch to the unified amd_gpu type with the union of supported families (gfx1150/gfx1151 iGPUs + gfx110X/gfx120X dGPUs). Verified on gfx1151: Qwen3.5-0.8B-vllm load returns 200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The label is redundant with the recipe field (already "vllm") and the "-vllm" suffix in each model id. No code consumes it (only test/validate_vllm.py filters by recipe + "hot", not the "vllm" label). The convention across the registry is to use labels for capabilities/curation tags (reasoning, hot, vision, tool-calling, ...) rather than backend names — 59 of 66 llamacpp entries already follow this convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

superm1

Generally speaking it looks fine now. a few minor nits on things that don't need to change but are being changed.

I do want to eventually get to the point we're not installing another copy of ROCm to make this work though.

Surface the experimental status in three places where end users and contributors are most likely to encounter the backend: - README.md: add a vllm/rocm row to the supported-configurations table with "(experimental)" annotation; remove the now-stale "vLLM support" entry from the Under Consideration column of the roadmap. - AGENTS.md: add a vLLM row to the backend-abstraction table with "Experimental, validated only on gfx1151 (Strix Halo)" in the Purpose column. - src/cpp/resources/server_models.json: add an "experimental" label to each of the four Qwen3.5 vllm entries. Labels are already rendered in the model picker UI and surfaced via /v1/models, so this propagates with no UI plumbing. No code change. The /install gate, /recipes action redirect, and load path are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ramkrishna2910 · 2026-04-28T19:41:36Z

Generally speaking it looks fine now. a few minor nits on things that don't need to change but are being changed.

I do want to eventually get to the point we're not installing another copy of ROCm to make this work though.

Yes, agreed. We are using the pre packaged whls here. I would suggest sticking to this approach for the experimental release and eventually moving to the common rocm backend.

@superm1

- recipe_options.cpp: drop the four spurious additions to OPTION_TO_CLI_FLAG (steps / cfg_scale / width / height). main intentionally removed these flags ("recipe-level only" — see the comment at line 198 of main); they shouldn't have come back on this branch and are unrelated to vLLM. Per @superm1's review. - system_info.cpp: revert the cosmetic rewording of the CWSR-check doc comment. Per @superm1's review — the original wording was fine, shouldn't have been touched in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@superm1

The steps / cfg_scale / width / height entries got re-added to the CLI-flag map at some point on this branch, but main intentionally removed them — recipe-level defaults only, never exposed as CLI args (see the comment at lines 41-43 of this file: "Image generation params ... are recipe-level defaults only — not exposed as CLI arguments"). Per @superm1's PR review (#1537 r3156710056). Unrelated to vLLM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ramkrishna2910 marked this pull request as draft April 4, 2026 17:36

sawansri mentioned this pull request Apr 5, 2026

Multi-GPU and multi-host support #1541

Open

ramkrishna2910 and others added 21 commits April 5, 2026 16:47

Update vLLM model list: remove OPT-125M, add Qwen3/3.5 models

a945aa5

Remove the OPT-125M test model. Add Qwen3 family (0.6B, 1.7B, 4B, 8B) and Qwen3.5 MoE models (3B-A1B, 7B-A3B) for a range of sizes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove non-existent Qwen3.5 models, fix Qwen3-8B size

e858319

Qwen3.5 models are multimodal (vision+text) with different naming. Keep only verified text-only Qwen3 models. Fix Qwen3-8B size to 16.6GB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update vLLM to vllm0.19.0-rocm7.12.0-b1

ac7d43f

New naming scheme with version info. Release includes bundled portable Python, bundled clang (no system gcc needed), all 3 GPU targets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chmod all binaries in bin/ after extraction

a3fff20

Tar archives lose execute permissions. Previously only the found binary (vllm-server) was chmod'd, but bundled python3 also needs it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add gfx110X (RDNA3) to vLLM supported GPUs

10aad87

Enables vLLM on RX 7900/7800/7700 series discrete GPUs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ramkrishna2910 self-assigned this Apr 14, 2026

ramkrishna2910 and others added 2 commits April 14, 2026 17:25

ramkrishna2910 marked this pull request as ready for review April 14, 2026 21:27

superm1 reviewed Apr 14, 2026

View reviewed changes

Comment thread docs/gfx1151_linux.html Outdated

superm1 reviewed Apr 14, 2026

View reviewed changes

superm1 requested changes Apr 14, 2026

View reviewed changes

superm1 reviewed Apr 14, 2026

View reviewed changes

Comment thread src/cpp/server/backends/backend_utils.cpp

ghost mentioned this pull request Apr 15, 2026

[Feature] MLX Engine ROCm backend — 83% faster than Vulkan on Strix Halo #1642

Open

ramkrishna2910 and others added 7 commits April 22, 2026 13:39

Merge remote-tracking branch 'origin/main' into test-vllm

d9558c4

# Conflicts: # src/cpp/server/backend_manager.cpp # src/cpp/server/system_info.cpp

Merge remote-tracking branch 'origin/main' into test-vllm

95f4269

superm1 reviewed Apr 28, 2026

View reviewed changes

Comment thread src/cpp/server/recipe_options.cpp Outdated

superm1 reviewed Apr 28, 2026

View reviewed changes

Comment thread src/cpp/server/system_info.cpp Outdated

superm1 approved these changes Apr 28, 2026

View reviewed changes

ramkrishna2910 requested a review from jeremyfowers as a code owner April 28, 2026 19:39

ramkrishna2910 and others added 2 commits April 28, 2026 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vLLM as a wrapped server backend (ROCm)#1537

Add vLLM as a wrapped server backend (ROCm)#1537
ramkrishna2910 wants to merge 36 commits intomainfrom
test-vllm

ramkrishna2910 commented Apr 4, 2026 •

edited

Loading

Uh oh!

superm1 commented Apr 4, 2026

Uh oh!

ramkrishna2910 commented Apr 4, 2026

Uh oh!

ghost commented Apr 14, 2026

Uh oh!

Uh oh!

superm1 Apr 14, 2026

Uh oh!

superm1 left a comment

Uh oh!

Uh oh!

ghost commented Apr 15, 2026

Uh oh!

ghost commented Apr 15, 2026

Uh oh!

eddierichter-amd commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

superm1 left a comment

Uh oh!

ramkrishna2910 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ramkrishna2910 commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status: early rough draft

Summary

Scope of testing

Not tested:

Prerequisites for testing

Hardware

Verify prerequisites

Kernel version

CWSR properties exported

Test plan

1. Build

Known gotchas

Uh oh!

superm1 commented Apr 4, 2026

Uh oh!

ramkrishna2910 commented Apr 4, 2026

Uh oh!

ghost commented Apr 14, 2026

vLLM ROCm Burn Test — All 14 Models on Strix Halo (gfx1151)

Results — 12/14 PASSED

Notes

Failed Models

Monster Burn (Queued)

Uh oh!

Uh oh!

superm1 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

superm1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ghost commented Apr 15, 2026

Test Report — CachyOS / Kernel 7.0 / gfx1151 (Strix Halo)

All 14 Catalog vLLM Models — vLLM ROCm on gfx1151

Failures

Notes

Environment

Uh oh!

ghost commented Apr 15, 2026

vLLM ROCm gfx1151 Results — Strix Halo, 128GB unified

Uh oh!

eddierichter-amd commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

superm1 left a comment

Choose a reason for hiding this comment

Uh oh!

ramkrishna2910 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ramkrishna2910 commented Apr 4, 2026 •

edited

Loading