Skip to content

Add vLLM as a wrapped server backend (ROCm)#1537

Open
ramkrishna2910 wants to merge 36 commits intomainfrom
test-vllm
Open

Add vLLM as a wrapped server backend (ROCm)#1537
ramkrishna2910 wants to merge 36 commits intomainfrom
test-vllm

Conversation

@ramkrishna2910
Copy link
Copy Markdown
Contributor

@ramkrishna2910 ramkrishna2910 commented Apr 4, 2026

Status: early rough draft

Tested only on gfx1151 (Strix Halo / Radeon 8060S). The backend does support gfx1150, gfx110X, and gfx120X as well but havent been tested yet. Feedback from people with other AMD GPUs is the main thing we need before treating this PR as ready to merge.

Summary

Adds vLLM as a WrappedServer backend using AMD's pre-built ROCm wheels from lemonade-sdk/vllm-rocm. Follows the same install/load/unload pattern as llama.cpp-ROCm.

Scope of testing

Verified end-to-end (install → load → inference → benchmark) on exactly one configuration:

GPU: Ryzen AI MAX+ PRO 395 w/ Radeon 8060S (gfx1151, Strix Halo)
OS: Ubuntu 24.04
Kernel: 6.18.22 mainline from kernel.ubuntu.com/mainline
amdgpu: built-in driver from the 6.18.22 kernel package; amdgpu-dkms 6.19.0 from ROCm 31.20 also installed but not actively loading
Models: Qwen3 (0.6B–8B), Qwen3.5 (0.8B–9B), Llama-3.2 (1B/3B), Phi-4-mini, Gemma-3-4B, both FP16 and AWQ variants

Not tested:

gfx1150 / gfx110X / gfx120X — the install flow fetches per-arch assets that exist but have never been exercised. Other architectures may have their own kernel/driver gotchas we haven't discovered.
Other distros — everything below assumes Ubuntu 24.04. The kernel and amdgpu setup on Fedora/Arch/openSUSE likely works but is untested.
Multi-user / batched serving — all benchmarks are single-user, one request at a time. vLLM's scheduler strengths (paged attention, prefix caching, continuous batching) are not exercised here.

Prerequisites for testing

Hardware

AMD GPU in one of: gfx1151 (Strix Halo), gfx1150 (Strix Point), gfx110X (RDNA3), gfx120X (RDNA4)
Kernel — strict requirement
You need a kernel with the CWSR (Context Wave Save/Restore) fix. Without it, any GPU dispatch triggers a GCVM_L2_PROTECTION_FAULT and the backend hangs. The verified path is mainline 6.18.4+.

Full doc: docs/gfx1151_linux.html.

If amdgpu-dkms is installed
The default Radeon repo (amdgpu/30.30) ships amdgpu-dkms 6.16.13 which overrides the kernel's built-in driver with a broken version. Either switch to amdgpu/31.20:

sudo sed -i 's|amdgpu/30\.30/|amdgpu/31.20/|g' /etc/apt/sources.list.d/amdgpu*.list
sudo apt update
sudo apt install -y amdgpu-dkms amdgpu-dkms-firmware
sudo reboot

Or uninstall amdgpu-dkms entirely — vLLM ships its own ROCm user-space, you don't need the DKMS package unless you also want to run other ROCm tools outside Lemonade.

Verify prerequisites

Kernel version

uname -r # expect 6.18.4 or newer

CWSR properties exported

grep -E "cwsr_size|ctl_stack_size" /sys/class/kfd/kfd/topology/nodes/*/properties


Test plan

1. Build

git fetch origin
git checkout test-vllm
./setup.sh
cmake --preset default
cmake --build --preset default -j$(nproc)

Test plan

  1. Build
    git fetch origin
    git checkout test-vllm
    ./setup.sh
    cmake --preset default
    cmake --build --preset default -j$(nproc)

  2. Start server

./build/lemond --port 8083

In another terminal, verify it's up:

curl -s http://localhost:8083/v1/health | python3 -m json.tool
  1. Install vLLM backend
    First install pulls ~5 GB (split into two ~2.5 GB parts). Expect ~60 s on a fast link.
curl -s -X POST http://localhost:8083/v1/install \
  -H "Content-Type: application/json" \
  -d '{"recipe": "vllm", "backend": "rocm"}'
  {"backend":"rocm","recipe":"vllm","status":"success"}
  1. Load a small model
time curl -s -X POST http://localhost:8083/v1/load \
  -H "Content-Type: application/json" \
  -d '{"model_name": "Qwen3-0.6B-vllm"}'
{"checkpoint":"Qwen/Qwen3-0.6B","model_name":"Qwen3-0.6B-vllm","recipe":"vllm","status":"success"}

First load takes 20–30 s (Triton JIT compile for the architecture). Subsequent loads are faster.

  1. Run inference
curl -s -X POST http://localhost:8083/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen3-0.6B-vllm","messages":[{"role":"user","content":"Say hi in 3 words"}],"max_tokens":30,"temperature":0}' \

Known gotchas

First-run Triton JIT: cold load of a new model size compiles kernels for your GPU, taking 20–350 s. Subsequent loads hit the on-disk cache.
huggingface-hub version conflict: fixed by forcing PYTHONNOUSERSITE=1 when launching vllm-server. If you still hit this, make sure ~/.local/lib/python3.12/site-packages/huggingface_hub isn't shadowing the bundled one.
Transformers version lag: the bundled vLLM 0.19.0 pins transformers <5. Models whose config.json declares model_type: qwen3_5_text (only some newer Qwen3.5 variants) won't load until a vLLM release that bumps the transformers pin. The model registry in this PR avoids those repos.
amdgpu-dkms 6.16.13 masking the built-in driver: see prerequisites. Uninstall or upgrade.

Integrates vLLM as a new WrappedServer backend using AMD's pre-built
ROCm wheels from lemonade-sdk/vllm-rocm. Follows the same patterns
as the llama.cpp backend.

New backend: VLLMServer (recipe: "vllm", binary: "vllm-server")
- Linux-only, ROCm backend (gfx1150, gfx1151, gfx120X)
- Uses HuggingFace model IDs directly (no GGUF)
- Split archive download support for >2GB GitHub release assets

Models: OPT-125M-vllm, Qwen3-0.6B-vllm, Llama-3.2-1B-Instruct-vllm

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@superm1
Copy link
Copy Markdown
Member

superm1 commented Apr 4, 2026

Why nightly TheRock SDK? I would think you are better off taking the tagged release (and then you can re-use with different backends like I'm doing for SD...)

@ramkrishna2910
Copy link
Copy Markdown
Contributor Author

Yeah, plan is to move to tagged release. This PR should be in draft 😅

@ramkrishna2910 ramkrishna2910 marked this pull request as draft April 4, 2026 17:36
Without this, Lemonade forwards requests with its model name
(e.g. "OPT-125M-vllm") but vLLM only accepts the HuggingFace ID
(e.g. "facebook/opt-125m"), causing 404 errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ramkrishna2910 and others added 21 commits April 5, 2026 16:47
- Add 'vllm' to RECIPE_ORDER and RECIPE_DISPLAY_NAMES ("vLLM ROCm")
  so it appears properly in the Backend Manager
- Set suggested=true on all 3 vLLM models so they appear in the
  Model Manager

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the OPT-125M test model. Add Qwen3 family (0.6B, 1.7B, 4B, 8B)
and Qwen3.5 MoE models (3B-A1B, 7B-A3B) for a range of sizes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen3.5 models are multimodal (vision+text) with different naming.
Keep only verified text-only Qwen3 models. Fix Qwen3-8B size to 16.6GB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b1006 includes bundled portable Python (no system Python dependency),
pre-compiled Triton HIP utils, Python headers for Triton JIT, and
all 3 GPU targets (gfx1150, gfx1151, gfx120X).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Release archives are now named {tag}-{arch}-x64.tar.gz instead of
vllm-{tag}-ubuntu-rocm-{arch}-x64.tar.gz. The tag itself contains
the version info (e.g. vllm0.19.0-rocm7.12.0-b1).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New naming scheme with version info. Release includes bundled portable
Python, bundled clang (no system gcc needed), all 3 GPU targets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tar archives lose execute permissions. Previously only the found
binary (vllm-server) was chmod'd, but bundled python3 also needs it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add version_override to InstallParams so backends can specify a
release tag different from backend_versions.json. vLLM uses this to
append the GPU target to the version, creating per-target release
tags (e.g. vllm0.19.0-rocm7.12.0-gfx1150).

Update backend_versions.json to base version without target suffix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enables vLLM on RX 7900/7800/7700 series discrete GPUs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --enforce-eager, --dtype float16, --max-model-len 4096 as
  defaults in vllm_server.cpp (needed for consumer GPU inference)
- Add AWQ quantized models: Qwen3-4B-AWQ, Qwen3-8B-AWQ
- Add more models: Llama-3.2-1B/3B (AWQ), Gemma-3-4b-it, Phi-4-mini
- Use AWQ checkpoints for Llama (casperhansen)
- vLLM auto-detects AWQ from model config, no flag needed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vLLM auto-selects awq_marlin which is extremely slow on consumer
AMD GPUs (gfx1150: 2 tok/s). Force --quantization awq (GEMM kernel)
when model name contains AWQ, which runs at 12 tok/s.

Also reduce default max-model-len to 2048.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add PYTHONNOUSERSITE=1 env var when launching vllm-server to prevent
  system/user Python packages from leaking into the bundled environment
  (fixes ImportError from huggingface-hub version mismatch)
- Include vllm in the gfx1151 CWSR action URL check so users get the
  proper guidance when the kernel fix is missing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous filename-based check forced --quantization awq on any repo
with "AWQ" in the name. Some repos (e.g. cyankiwi/Qwen3.5-4B-AWQ-4bit)
actually use compressed-tensors format, which caused vLLM to fail the
load with a quantization method mismatch.

Read quantization_config.quant_method from the model's config.json:
- Fast path: HF hub cache (no network on subsequent loads)
- Fallback: HTTP GET from huggingface.co on first load, so detection
  works before vLLM has downloaded anything

Still force --quantization awq when the method is AWQ, to keep the
existing workaround for slow awq_marlin on consumer GPUs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Qwen3.5 vLLM FP16: 0.8B, 2B, 4B, 9B
- BF16 GGUF (llamacpp) for: Qwen3, Qwen3.5, Llama-3.2-1B/3B, Phi-4-mini,
  Gemma-3-4b-it
- AWQ vLLM for: Qwen3-0.6B/1.7B, Qwen3.5-0.8B/2B/4B/9B, Phi-4-mini

Enables a full throughput matrix across (fp16 / int4) × (vulkan / rocm / vllm)
for benchmarking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch to the same release layout as lemonade-sdk/llamacpp-rocm:
one release per version (e.g. vllm0.19.0-rocm7.12.0) with one asset
per GPU target (e.g. vllm0.19.0-rocm7.12.0-gfx1151-x64.tar.gz),
instead of a separate release per architecture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The download loop retries every failure 5 times with exponential backoff.
That burns 30+s on permanent 404s that will never succeed — notably when
probing for a single-file asset that's stored as split parts, and when
detecting the end of the parts list.

Fast-exit the retry loop on 4xx client errors. 408 (Request Timeout) and
429 (Too Many Requests) are treated as transient and still retried.

Cuts vLLM install time from 2:55 to 1:00 on a cold cache.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing CWSR sysfs check is insufficient on gfx1151: some OEM/DKMS
combos (e.g. linux-oem-24.04 6.17 + amdgpu-dkms 6.19) expose cwsr_size
and ctl_stack_size but still page-fault on any GPU dispatch. Add a
kernel version guard so /v1/install fails fast on pre-6.18.4 kernels
and points users to docs/gfx1151_linux.html instead of letting them
install a backend that will hang at first inference.

Provides an opt-out (LEMONADE_SKIP_KERNEL_CHECK=1) for users on a
vendor kernel with a known-good backport.

Rewrite docs/gfx1151_linux.html with a tested recipe:
- Recommend mainline 6.18.4+ (linux-oem-24.04 alone is unreliable)
- Call out the amdgpu-dkms 6.16.13 masking problem
- Add a HIP-level verification test, since sysfs properties alone
  don't prove the fix is active
- Document the new LEMONADE_SKIP_KERNEL_CHECK escape hatch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 19 extra entries (12 BF16/F16 GGUFs, 7 AWQ vLLM repos from
third-party uploaders) were added to build a full benchmark matrix
but aren't core to the vLLM backend itself. Keep only the four
Qwen3.5 FP16 vLLM entries for test coverage of the new family.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2K was arbitrary and surprises users coming from long-context models.
16K is large enough to be useful without risking OOM on 32–96 GB VRAM
systems and without blowing up first-run Triton JIT compile time.

Override per-load with vllm_args="--max-model-len 32768" if you need
more, or lower if you're memory-constrained.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolved two conflicts, both additive:
- backend_utils.cpp: keep both VLLMServer::SPEC (ours) and
  FastFlowLMServer::SPEC (main) in try_get_spec_for_recipe
- recipe_options.cpp: keep the vllm option entries (ours) alongside
  main's sampling_method / flow_shift additions for sd-cpp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- validate_vllm.py: iterate all vllm models labeled "hot", install backend
  once, load/test/unload each, emit per-model JSON matching
  llamacpp_validation_*.json shape (model, pass, response, input_tokens,
  output_tokens, time_to_first_token, tokens_per_second). Adds --lite,
  --output, --logs-dir, --skip-install flags mirroring validate_llamacpp.py.
- validate_vllm.yml: four-job pipeline matching validate_llamacpp.yml:
  get-latest-releases (auto-discover vllm-rocm tag), build (update
  backend_versions.json + cmake build), validate (self-hosted stx-halo
  Linux runner, rich artifact upload), create-pr (auto-open bump PR on
  schedule/workflow_dispatch success). Adds pull_request trigger that
  runs in LITE_MODE, upgrades to checkout@v5 and upload-artifact@v7.
- server_models.json: mark Qwen3-0.6B-vllm, Qwen3-4B-AWQ-vllm,
  Qwen3.5-4B-vllm, Llama-3.2-1B-Instruct-vllm as "hot" so validation
  finds them via the same label filter used for llamacpp.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ramkrishna2910 ramkrishna2910 self-assigned this Apr 14, 2026
ramkrishna2910 and others added 2 commits April 14, 2026 17:25
The web-app compile step can hit intermittent EACCES errors when webpack
tries to overwrite KaTeX font files placed into build/resources/web-app
by CMake. CI only validates the backend, so opt out via -DBUILD_WEB_APP=OFF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The self-hosted stx-halo runner's kernel tripped the preflight check we
added, blocking /install even though the runner may have a vendor
backport that makes vLLM work. Set LEMONADE_SKIP_KERNEL_CHECK=1 in the
validate job so the pipeline attempts the real install/load; if the
runner is actually broken, vLLM's own failure will surface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Blocking install at the preflight produced false negatives: users on
vendor kernels with working backports got stranded, and the
self-hosted CI runner (kernel heuristic said 'no', reality said 'yes')
had to set a magic env var to get past our own guard.

Change the contract:
- is_recipe_installed() no longer returns false based on CWSR state
- build_recipes_info() no longer redirects the action to the help URL
- vllm_server.cpp enriches the wait_for_ready timeout error with the
  help URL when needs_gfx1151_cwsr_fix() is true, so users who do hit
  the real page-fault symptom still see the pointer to the docs
- needs_gfx1151_cwsr_fix() goes back to a pure sysfs check; the
  kernel-version heuristic and LEMONADE_SKIP_KERNEL_CHECK env var are
  gone (no longer needed since we don't block)

Docs: drop the now-defunct "Escape hatch" section.
CI: drop the LEMONADE_SKIP_KERNEL_CHECK workaround from the vllm job.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ghost
Copy link
Copy Markdown

ghost commented Apr 14, 2026

vLLM ROCm Burn Test — All 14 Models on Strix Halo (gfx1151)

Hardware: AMD Ryzen AI MAX+ 395, 128GB unified RAM, Radeon 8060S (gfx1151)
OS: Arch Linux (CachyOS kernel 7.0.0-1-mainline)
Backend: vLLM ROCm via this PR (built from source, test-vllm branch)
Method: 5 runs per model, 2 warmup runs, 200 max tokens, mean ± stddev
Date: 2026-04-14

Results — 12/14 PASSED

Model Status Mean tok/s ±StdDev Min Max Load Time
Qwen3-0.6B-vllm ✅ PASS 116.7 ±0.2 116.4 116.9 17.0s
Qwen3.5-0.8B-vllm ❌ FAIL - - - - 0.5s
Qwen3-1.7B-vllm ✅ PASS 25.1 ±26.0 2.5 47.7 19.7s
Qwen3.5-2B-vllm ✅ PASS 44.0 ±0.0 44.0 44.0 216.2s
Llama-3.2-1B-Instruct-vllm ✅ PASS 110.4 ±0.0 110.3 110.5 38.4s
Llama-3.2-3B-Instruct-vllm ✅ PASS 50.5 ±0.1 50.3 50.5 42.8s
Qwen3-4B-vllm ✅ PASS 25.4 ±0.0 25.4 25.4 21.8s
Qwen3-4B-AWQ-vllm ✅ PASS 42.8 ±0.0 42.8 42.8 75.6s
Qwen3.5-4B-vllm ✅ PASS 23.8 ±0.0 23.8 23.8 331.7s
Gemma-3-4b-it-vllm ❌ FAIL - - - - 0.4s
Phi-4-mini-instruct-vllm ✅ PASS 25.1 ±0.0 25.1 25.1 102.0s
Qwen3-8B-vllm ✅ PASS 12.3 ±0.0 12.3 12.3 24.7s
Qwen3-8B-AWQ-vllm ✅ PASS 22.8 ±0.0 22.8 22.8 81.2s
Qwen3.5-9B-vllm ✅ PASS 11.6 ±0.0 11.6 11.7 309.3s

Notes

  • First load per model size includes Triton JIT compilation (20-350s). Subsequent loads of same architecture hit on-disk cache.
  • All measurements are wall-clock (request → response).
  • --enforce-eager set by Lemonade's vLLM wrapper (no CUDA graphs).
  • Server ran on port 8083 per PR test plan.
  • Qwen3-1.7B shows high variance (±26.0) — may warrant investigation.
  • AWQ quantized models show ~1.8x speedup over FP16 counterparts (Qwen3-4B: 25.4 → 42.8, Qwen3-8B: 12.3 → 22.8).

Failed Models

  • Qwen3.5-0.8B: Immediate failure (0.5s load), likely unsupported config
  • Gemma-3-4b-it: Immediate failure (0.4s load), likely architecture not yet supported in vLLM ROCm wrapper

Monster Burn (Queued)

Next up: Qwen2.5-72B-AWQ, Mixtral-8x22B, Qwen3-235B-A22B — testing large model viability on 128GB unified.


Tested on bare metal Arch Linux, no containers. Happy to run additional models or reproduce with different parameters. Great work on this PR — the wrapper cleanly solves the Triton HIP "invalid device ordinal" crash we were hitting running vLLM directly.

@ramkrishna2910 ramkrishna2910 marked this pull request as ready for review April 14, 2026 21:27
Comment thread docs/gfx1151_linux.html Outdated
Comment thread src/cpp/server/system_info.cpp Outdated
Comment on lines +312 to +316
// Generic installation check.
// Note: we intentionally do NOT block install here based on kernel/CWSR heuristics.
// The signal is unreliable (vendor backports and older kernels can still work), and
// a false negative strands users. If the kernel is genuinely broken, vLLM/llama.cpp
// will surface a clear failure at load time (see vllm_server.cpp wait_for_ready).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It's not a false negative. There are real problems that are incredibly difficult to debug. We need to flag it.

Copy link
Copy Markdown
Member

@superm1 superm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change any gfx1151 CWSR detection changes. The problems that can happen are very difficult to debug and will "appear randomly". The fixed kernels are all rolled out in Ubuntu 24.04 6.17 HWE, Ubuntu 24.04 6.14 OEM, Ubuntu 26.04. They're fixed in Arch and Fedora.

If someone is on something unique they need to upgrade.

Comment thread src/cpp/server/backends/backend_utils.cpp
@ghost
Copy link
Copy Markdown

ghost commented Apr 15, 2026

Test Report — CachyOS / Kernel 7.0 / gfx1151 (Strix Halo)

Tester: @stampby
Hardware: AMD Strix Halo, Radeon 8060S (gfx1151), 128GB unified
OS: Arch Linux (CachyOS), Kernel 7.0.0-1-mainline
Build: ./setup.sh && cmake --build --preset default from test-vllm branch
Server: ./build/lemond --port 8083 per PR test plan
Method: 5 runs per model, 2 warmup discarded, 200 max tokens, mean ± stddev

All 14 Catalog vLLM Models — vLLM ROCm on gfx1151

Model Status Mean tok/s ±StdDev Min Max Load Time
Qwen3-0.6B-vllm PASS 116.7 ±0.2 116.4 116.9 17.0s
Qwen3.5-0.8B-vllm FAIL - - - - 0.5s
Qwen3-1.7B-vllm PASS 25.1 ±26.0* 2.5 47.7 19.7s
Qwen3.5-2B-vllm PASS 44.0 ±0.0 44.0 44.0 216.2s
Llama-3.2-1B-Instruct-vllm PASS 110.4 ±0.0 110.3 110.5 38.4s
Llama-3.2-3B-Instruct-vllm PASS 50.5 ±0.1 50.3 50.5 42.8s
Qwen3-4B-vllm PASS 25.4 ±0.0 25.4 25.4 21.8s
Qwen3-4B-AWQ-vllm PASS 42.8 ±0.0 42.8 42.8 75.6s
Qwen3.5-4B-vllm PASS 23.8 ±0.0 23.8 23.8 331.7s
Gemma-3-4b-it-vllm FAIL - - - - 0.4s
Phi-4-mini-instruct-vllm PASS 25.1 ±0.0 25.1 25.1 102.0s
Qwen3-8B-vllm PASS 12.3 ±0.0 12.3 12.3 24.7s
Qwen3-8B-AWQ-vllm PASS 22.8 ±0.0 22.8 22.8 81.2s
Qwen3.5-9B-vllm PASS 11.6 ±0.0 11.6 11.7 309.3s

12/14 models passed.

Failures

  1. Qwen3.5-0.8B-vllm — load failed immediately (0.5s). May be a model config or transformers version issue.
  2. Gemma-3-4b-it-vllm — load failed. HuggingFace download error on .gitattributes. Likely needs HF auth token for gated model access.

Notes

  • *Qwen3-1.7B had one cold outlier run at 2.5 tok/s (first bench run after JIT), subsequent runs were 47.4-47.7. The ±26.0 stddev reflects that single outlier.
  • AWQ models show ~1.7-1.9x speedup over FP16 counterparts (e.g. Qwen3-4B: 25.4 FP16 vs 42.8 AWQ, Qwen3-8B: 12.3 FP16 vs 22.8 AWQ).
  • Qwen3.5 models have significantly longer first-load times (216-331s) due to Triton JIT compilation for a new architecture. Subsequent loads from cache are fast.
  • All variance is essentially zero (±0.0-0.2) once warmed up. Rock solid.
  • Qwen2.5-72B-Instruct-AWQ is currently running as a bonus test (72B dense on 128GB unified memory, ~2.4 tok/s generation). Results will follow.

Environment

Hardware:     AMD Strix Halo, 128GB unified, Radeon 8060S (gfx1151)
OS:           Arch Linux (CachyOS)
Kernel:       7.0.0-1-mainline
lemond:       Built from test-vllm branch
vllm-rocm:    Installed via /v1/install (rocm backend)
Server port:  8083

Happy to retest anything or run additional models. Hardware is available.

@ghost
Copy link
Copy Markdown

ghost commented Apr 15, 2026

vLLM ROCm gfx1151 Results — Strix Halo, 128GB unified

14 models tested, 12 passed. 5 runs, 2 warmup, 200 max tokens.

Model tok/s ±stddev Type
Qwen3-0.6B 116.7 ±0.2 FP16
Llama-3.2-1B-Instruct 110.4 ±0.0 AWQ
Llama-3.2-3B-Instruct 50.5 ±0.1 AWQ
Qwen3.5-2B 44.0 ±0.0 FP16
Qwen3-4B-AWQ 42.8 ±0.0 AWQ
Qwen3-4B 25.4 ±0.0 FP16
Phi-4-mini 25.1 ±0.0 FP16
Qwen3-8B-AWQ 22.8 ±0.0 AWQ
Qwen3-8B 12.3 ±0.0 FP16
Qwen3.5-9B 11.6 ±0.0 FP16
Qwen2.5-72B-AWQ 2.3 ±0.1 AWQ (72B dense)

Full results + comparison with MLX Engine ROCm: https://github.com/stampby/bleeding-edge

@eddierichter-amd
Copy link
Copy Markdown
Contributor

@ramkrishna2910

I followed the manual smoke-test portion of the vllm test plan against a local Lemonade server, and it worked out of the box without any extra code changes or request tweaking.

Command used:

curl -s -X POST http://localhost:8083/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen3-0.6B-vllm","messages":[{"role":"user","content":"Tell me a story"}],"max_tokens":30,"temperature":0}'

Response:

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "message": {
        "annotations": null,
        "audio": null,
        "content": "<think>\nOkay, the user asked for a story. I need to come up with something engaging. Let me think about a
simple yet memorable story.",
        "function_call": null,
        "reasoning": null,
        "refusal": null,
        "role": "assistant",
        "tool_calls": []
      },
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "created": 1776357013,
  "id": "chatcmpl-8704fee7a2990b81",
  "kv_transfer_params": null,
  "model": "Qwen3-0.6B-vllm",
  "object": "chat.completion",
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 30,
    "prompt_tokens": 12,
    "prompt_tokens_details": null,
    "total_tokens": 42
  }
}

The request completed successfully and returned a response from Qwen3-0.6B-vllm.

ramkrishna2910 and others added 7 commits April 22, 2026 13:39
Revert the preflight softening per PR feedback. The strict guard
(blocking /install on kernels < 6.18.4 for gfx1151 ROCm) is the
intended behavior: sysfs cwsr_size/ctl_stack_size alone isn't enough
on some OEM/DKMS combos, and letting an install succeed on a broken
kernel leads to a hang at first inference with no actionable pointer
to the docs.

Users on a vendor kernel with a known-good backport opt out with
LEMONADE_SKIP_KERNEL_CHECK=1 (already wired up for the self-hosted
vllm CI runner in 0c093d2).

This reverts commit 2be57b2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep this PR scoped to adding vLLM support; stop touching the CWSR
detection logic and docs. Reviewer note on PR #1537: the fixed kernels
are already rolled out in Ubuntu 24.04 (6.17 HWE, 6.14 OEM), Ubuntu
26.04, Arch, and Fedora. The pre-existing sysfs-only detection passes
on all of those. Adding a >=6.18.4 version guard on top wrongly blocked
users on known-good backported kernels (e.g. linux-oem-24.04's current
6.17.0-1017 build), inconsistent with the old doc's "apt install
linux-oem-24.04" instruction.

Changes:
- src/cpp/server/system_info.cpp: drop kernel_version_lacks_cwsr_fix
  and the LEMONADE_SKIP_KERNEL_CHECK escape hatch; needs_gfx1151_cwsr_fix
  returns to pure sysfs check (cwsr_size and ctl_stack_size both exported).
- docs/gfx1151_linux.html: restore origin/main version (concise
  "upgrade your kernel" guidance, no mainline-install recipe, no DKMS
  sed hack, no HIP verification step, no Escape hatch section).
- .github/workflows/validate_vllm.yml: remove LEMONADE_SKIP_KERNEL_CHECK
  env block (env var no longer exists; runner's sysfs check passes
  on its own on Ubuntu 6.17 OEM).
- src/cpp/server/backends/vllm_server.cpp: keep the load-time "kernel
  may be missing CWSR fix" hint in the wait_for_ready timeout message
  as strictly-additive UX for the rare sysfs-reports-OK-but-GPU-still-
  page-faults case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the Qwen3, Llama-3.2, Gemma-3, and Phi-4-mini vLLM entries; keep
only the four Qwen3.5 sizes (0.8B / 2B / 4B / 9B) which are the set
this PR actually exercises and benchmarks end-to-end. Other model
families can be re-added in follow-ups once they're validated on the
target hardware.

Net: 14 vllm entries -> 4. Total registry: 175 -> 165.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	src/cpp/server/backend_manager.cpp
#	src/cpp/server/system_info.cpp
Main consolidated AMD GPU device types from amd_igpu/amd_dgpu into a
single amd_gpu (consistent with sd-cpp/llamacpp ROCm entries). vLLM's
RECIPE_DEF still used the old split, so the device-type lookup never
matched and every model load returned 404 with "Requires Radeon RX
7000 series (RDNA3)" even on supported gfx1151 hardware.

Switch to the unified amd_gpu type with the union of supported
families (gfx1150/gfx1151 iGPUs + gfx110X/gfx120X dGPUs). Verified
on gfx1151: Qwen3.5-0.8B-vllm load returns 200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The label is redundant with the recipe field (already "vllm") and the
"-vllm" suffix in each model id. No code consumes it (only test/validate_vllm.py
filters by recipe + "hot", not the "vllm" label). The convention across
the registry is to use labels for capabilities/curation tags
(reasoning, hot, vision, tool-calling, ...) rather than backend names —
59 of 66 llamacpp entries already follow this convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/cpp/server/recipe_options.cpp Outdated
Comment thread src/cpp/server/system_info.cpp Outdated
Copy link
Copy Markdown
Member

@superm1 superm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking it looks fine now. a few minor nits on things that don't need to change but are being changed.

I do want to eventually get to the point we're not installing another copy of ROCm to make this work though.

Surface the experimental status in three places where end users and
contributors are most likely to encounter the backend:

- README.md: add a vllm/rocm row to the supported-configurations table
  with "(experimental)" annotation; remove the now-stale "vLLM support"
  entry from the Under Consideration column of the roadmap.
- AGENTS.md: add a vLLM row to the backend-abstraction table with
  "Experimental, validated only on gfx1151 (Strix Halo)" in the Purpose
  column.
- src/cpp/resources/server_models.json: add an "experimental" label to
  each of the four Qwen3.5 vllm entries. Labels are already rendered in
  the model picker UI and surfaced via /v1/models, so this propagates
  with no UI plumbing.

No code change. The /install gate, /recipes action redirect, and load
path are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ramkrishna2910
Copy link
Copy Markdown
Contributor Author

Generally speaking it looks fine now. a few minor nits on things that don't need to change but are being changed.

I do want to eventually get to the point we're not installing another copy of ROCm to make this work though.

Yes, agreed. We are using the pre packaged whls here. I would suggest sticking to this approach for the experimental release and eventually moving to the common rocm backend.

ramkrishna2910 and others added 2 commits April 28, 2026 12:43
- recipe_options.cpp: drop the four spurious additions to
  OPTION_TO_CLI_FLAG (steps / cfg_scale / width / height). main
  intentionally removed these flags ("recipe-level only" — see the
  comment at line 198 of main); they shouldn't have come back on this
  branch and are unrelated to vLLM. Per @superm1's review.
- system_info.cpp: revert the cosmetic rewording of the CWSR-check
  doc comment. Per @superm1's review — the original wording was fine,
  shouldn't have been touched in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The steps / cfg_scale / width / height entries got re-added to the
CLI-flag map at some point on this branch, but main intentionally
removed them — recipe-level defaults only, never exposed as CLI args
(see the comment at lines 41-43 of this file: "Image generation params
... are recipe-level defaults only — not exposed as CLI arguments").

Per @superm1's PR review (#1537 r3156710056). Unrelated to vLLM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants