Add vLLM as a wrapped server backend (ROCm)#1537
Add vLLM as a wrapped server backend (ROCm)#1537ramkrishna2910 wants to merge 36 commits intomainfrom
Conversation
Integrates vLLM as a new WrappedServer backend using AMD's pre-built ROCm wheels from lemonade-sdk/vllm-rocm. Follows the same patterns as the llama.cpp backend. New backend: VLLMServer (recipe: "vllm", binary: "vllm-server") - Linux-only, ROCm backend (gfx1150, gfx1151, gfx120X) - Uses HuggingFace model IDs directly (no GGUF) - Split archive download support for >2GB GitHub release assets Models: OPT-125M-vllm, Qwen3-0.6B-vllm, Llama-3.2-1B-Instruct-vllm Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Why nightly TheRock SDK? I would think you are better off taking the tagged release (and then you can re-use with different backends like I'm doing for SD...) |
|
Yeah, plan is to move to tagged release. This PR should be in draft 😅 |
Without this, Lemonade forwards requests with its model name (e.g. "OPT-125M-vllm") but vLLM only accepts the HuggingFace ID (e.g. "facebook/opt-125m"), causing 404 errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add 'vllm' to RECIPE_ORDER and RECIPE_DISPLAY_NAMES ("vLLM ROCm")
so it appears properly in the Backend Manager
- Set suggested=true on all 3 vLLM models so they appear in the
Model Manager
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the OPT-125M test model. Add Qwen3 family (0.6B, 1.7B, 4B, 8B) and Qwen3.5 MoE models (3B-A1B, 7B-A3B) for a range of sizes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen3.5 models are multimodal (vision+text) with different naming. Keep only verified text-only Qwen3 models. Fix Qwen3-8B size to 16.6GB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b1006 includes bundled portable Python (no system Python dependency), pre-compiled Triton HIP utils, Python headers for Triton JIT, and all 3 GPU targets (gfx1150, gfx1151, gfx120X). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Release archives are now named {tag}-{arch}-x64.tar.gz instead of
vllm-{tag}-ubuntu-rocm-{arch}-x64.tar.gz. The tag itself contains
the version info (e.g. vllm0.19.0-rocm7.12.0-b1).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New naming scheme with version info. Release includes bundled portable Python, bundled clang (no system gcc needed), all 3 GPU targets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tar archives lose execute permissions. Previously only the found binary (vllm-server) was chmod'd, but bundled python3 also needs it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add version_override to InstallParams so backends can specify a release tag different from backend_versions.json. vLLM uses this to append the GPU target to the version, creating per-target release tags (e.g. vllm0.19.0-rocm7.12.0-gfx1150). Update backend_versions.json to base version without target suffix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enables vLLM on RX 7900/7800/7700 series discrete GPUs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --enforce-eager, --dtype float16, --max-model-len 4096 as defaults in vllm_server.cpp (needed for consumer GPU inference) - Add AWQ quantized models: Qwen3-4B-AWQ, Qwen3-8B-AWQ - Add more models: Llama-3.2-1B/3B (AWQ), Gemma-3-4b-it, Phi-4-mini - Use AWQ checkpoints for Llama (casperhansen) - vLLM auto-detects AWQ from model config, no flag needed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vLLM auto-selects awq_marlin which is extremely slow on consumer AMD GPUs (gfx1150: 2 tok/s). Force --quantization awq (GEMM kernel) when model name contains AWQ, which runs at 12 tok/s. Also reduce default max-model-len to 2048. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add PYTHONNOUSERSITE=1 env var when launching vllm-server to prevent system/user Python packages from leaking into the bundled environment (fixes ImportError from huggingface-hub version mismatch) - Include vllm in the gfx1151 CWSR action URL check so users get the proper guidance when the kernel fix is missing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous filename-based check forced --quantization awq on any repo with "AWQ" in the name. Some repos (e.g. cyankiwi/Qwen3.5-4B-AWQ-4bit) actually use compressed-tensors format, which caused vLLM to fail the load with a quantization method mismatch. Read quantization_config.quant_method from the model's config.json: - Fast path: HF hub cache (no network on subsequent loads) - Fallback: HTTP GET from huggingface.co on first load, so detection works before vLLM has downloaded anything Still force --quantization awq when the method is AWQ, to keep the existing workaround for slow awq_marlin on consumer GPUs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Qwen3.5 vLLM FP16: 0.8B, 2B, 4B, 9B - BF16 GGUF (llamacpp) for: Qwen3, Qwen3.5, Llama-3.2-1B/3B, Phi-4-mini, Gemma-3-4b-it - AWQ vLLM for: Qwen3-0.6B/1.7B, Qwen3.5-0.8B/2B/4B/9B, Phi-4-mini Enables a full throughput matrix across (fp16 / int4) × (vulkan / rocm / vllm) for benchmarking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch to the same release layout as lemonade-sdk/llamacpp-rocm: one release per version (e.g. vllm0.19.0-rocm7.12.0) with one asset per GPU target (e.g. vllm0.19.0-rocm7.12.0-gfx1151-x64.tar.gz), instead of a separate release per architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The download loop retries every failure 5 times with exponential backoff. That burns 30+s on permanent 404s that will never succeed — notably when probing for a single-file asset that's stored as split parts, and when detecting the end of the parts list. Fast-exit the retry loop on 4xx client errors. 408 (Request Timeout) and 429 (Too Many Requests) are treated as transient and still retried. Cuts vLLM install time from 2:55 to 1:00 on a cold cache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing CWSR sysfs check is insufficient on gfx1151: some OEM/DKMS combos (e.g. linux-oem-24.04 6.17 + amdgpu-dkms 6.19) expose cwsr_size and ctl_stack_size but still page-fault on any GPU dispatch. Add a kernel version guard so /v1/install fails fast on pre-6.18.4 kernels and points users to docs/gfx1151_linux.html instead of letting them install a backend that will hang at first inference. Provides an opt-out (LEMONADE_SKIP_KERNEL_CHECK=1) for users on a vendor kernel with a known-good backport. Rewrite docs/gfx1151_linux.html with a tested recipe: - Recommend mainline 6.18.4+ (linux-oem-24.04 alone is unreliable) - Call out the amdgpu-dkms 6.16.13 masking problem - Add a HIP-level verification test, since sysfs properties alone don't prove the fix is active - Document the new LEMONADE_SKIP_KERNEL_CHECK escape hatch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 19 extra entries (12 BF16/F16 GGUFs, 7 AWQ vLLM repos from third-party uploaders) were added to build a full benchmark matrix but aren't core to the vLLM backend itself. Keep only the four Qwen3.5 FP16 vLLM entries for test coverage of the new family. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2K was arbitrary and surprises users coming from long-context models. 16K is large enough to be useful without risking OOM on 32–96 GB VRAM systems and without blowing up first-run Triton JIT compile time. Override per-load with vllm_args="--max-model-len 32768" if you need more, or lower if you're memory-constrained. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolved two conflicts, both additive: - backend_utils.cpp: keep both VLLMServer::SPEC (ours) and FastFlowLMServer::SPEC (main) in try_get_spec_for_recipe - recipe_options.cpp: keep the vllm option entries (ours) alongside main's sampling_method / flow_shift additions for sd-cpp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- validate_vllm.py: iterate all vllm models labeled "hot", install backend once, load/test/unload each, emit per-model JSON matching llamacpp_validation_*.json shape (model, pass, response, input_tokens, output_tokens, time_to_first_token, tokens_per_second). Adds --lite, --output, --logs-dir, --skip-install flags mirroring validate_llamacpp.py. - validate_vllm.yml: four-job pipeline matching validate_llamacpp.yml: get-latest-releases (auto-discover vllm-rocm tag), build (update backend_versions.json + cmake build), validate (self-hosted stx-halo Linux runner, rich artifact upload), create-pr (auto-open bump PR on schedule/workflow_dispatch success). Adds pull_request trigger that runs in LITE_MODE, upgrades to checkout@v5 and upload-artifact@v7. - server_models.json: mark Qwen3-0.6B-vllm, Qwen3-4B-AWQ-vllm, Qwen3.5-4B-vllm, Llama-3.2-1B-Instruct-vllm as "hot" so validation finds them via the same label filter used for llamacpp. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The web-app compile step can hit intermittent EACCES errors when webpack tries to overwrite KaTeX font files placed into build/resources/web-app by CMake. CI only validates the backend, so opt out via -DBUILD_WEB_APP=OFF. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The self-hosted stx-halo runner's kernel tripped the preflight check we added, blocking /install even though the runner may have a vendor backport that makes vLLM work. Set LEMONADE_SKIP_KERNEL_CHECK=1 in the validate job so the pipeline attempts the real install/load; if the runner is actually broken, vLLM's own failure will surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Blocking install at the preflight produced false negatives: users on vendor kernels with working backports got stranded, and the self-hosted CI runner (kernel heuristic said 'no', reality said 'yes') had to set a magic env var to get past our own guard. Change the contract: - is_recipe_installed() no longer returns false based on CWSR state - build_recipes_info() no longer redirects the action to the help URL - vllm_server.cpp enriches the wait_for_ready timeout error with the help URL when needs_gfx1151_cwsr_fix() is true, so users who do hit the real page-fault symptom still see the pointer to the docs - needs_gfx1151_cwsr_fix() goes back to a pure sysfs check; the kernel-version heuristic and LEMONADE_SKIP_KERNEL_CHECK env var are gone (no longer needed since we don't block) Docs: drop the now-defunct "Escape hatch" section. CI: drop the LEMONADE_SKIP_KERNEL_CHECK workaround from the vllm job. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vLLM ROCm Burn Test — All 14 Models on Strix Halo (gfx1151)Hardware: AMD Ryzen AI MAX+ 395, 128GB unified RAM, Radeon 8060S (gfx1151) Results — 12/14 PASSED
Notes
Failed Models
Monster Burn (Queued)Next up: Qwen2.5-72B-AWQ, Mixtral-8x22B, Qwen3-235B-A22B — testing large model viability on 128GB unified. Tested on bare metal Arch Linux, no containers. Happy to run additional models or reproduce with different parameters. Great work on this PR — the wrapper cleanly solves the Triton HIP "invalid device ordinal" crash we were hitting running vLLM directly. |
| // Generic installation check. | ||
| // Note: we intentionally do NOT block install here based on kernel/CWSR heuristics. | ||
| // The signal is unreliable (vendor backports and older kernels can still work), and | ||
| // a false negative strands users. If the kernel is genuinely broken, vLLM/llama.cpp | ||
| // will surface a clear failure at load time (see vllm_server.cpp wait_for_ready). |
There was a problem hiding this comment.
No. It's not a false negative. There are real problems that are incredibly difficult to debug. We need to flag it.
superm1
left a comment
There was a problem hiding this comment.
Please don't change any gfx1151 CWSR detection changes. The problems that can happen are very difficult to debug and will "appear randomly". The fixed kernels are all rolled out in Ubuntu 24.04 6.17 HWE, Ubuntu 24.04 6.14 OEM, Ubuntu 26.04. They're fixed in Arch and Fedora.
If someone is on something unique they need to upgrade.
Test Report — CachyOS / Kernel 7.0 / gfx1151 (Strix Halo)Tester: @stampby All 14 Catalog vLLM Models — vLLM ROCm on gfx1151
12/14 models passed. Failures
Notes
EnvironmentHappy to retest anything or run additional models. Hardware is available. |
vLLM ROCm gfx1151 Results — Strix Halo, 128GB unified14 models tested, 12 passed. 5 runs, 2 warmup, 200 max tokens.
Full results + comparison with MLX Engine ROCm: https://github.com/stampby/bleeding-edge |
|
I followed the manual smoke-test portion of the Command used: curl -s -X POST http://localhost:8083/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen3-0.6B-vllm","messages":[{"role":"user","content":"Tell me a story"}],"max_tokens":30,"temperature":0}'
Response:
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"message": {
"annotations": null,
"audio": null,
"content": "<think>\nOkay, the user asked for a story. I need to come up with something engaging. Let me think about a
simple yet memorable story.",
"function_call": null,
"reasoning": null,
"refusal": null,
"role": "assistant",
"tool_calls": []
},
"stop_reason": null,
"token_ids": null
}
],
"created": 1776357013,
"id": "chatcmpl-8704fee7a2990b81",
"kv_transfer_params": null,
"model": "Qwen3-0.6B-vllm",
"object": "chat.completion",
"prompt_logprobs": null,
"prompt_token_ids": null,
"service_tier": null,
"system_fingerprint": null,
"usage": {
"completion_tokens": 30,
"prompt_tokens": 12,
"prompt_tokens_details": null,
"total_tokens": 42
}
}
The request completed successfully and returned a response from Qwen3-0.6B-vllm. |
Revert the preflight softening per PR feedback. The strict guard (blocking /install on kernels < 6.18.4 for gfx1151 ROCm) is the intended behavior: sysfs cwsr_size/ctl_stack_size alone isn't enough on some OEM/DKMS combos, and letting an install succeed on a broken kernel leads to a hang at first inference with no actionable pointer to the docs. Users on a vendor kernel with a known-good backport opt out with LEMONADE_SKIP_KERNEL_CHECK=1 (already wired up for the self-hosted vllm CI runner in 0c093d2). This reverts commit 2be57b2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep this PR scoped to adding vLLM support; stop touching the CWSR detection logic and docs. Reviewer note on PR #1537: the fixed kernels are already rolled out in Ubuntu 24.04 (6.17 HWE, 6.14 OEM), Ubuntu 26.04, Arch, and Fedora. The pre-existing sysfs-only detection passes on all of those. Adding a >=6.18.4 version guard on top wrongly blocked users on known-good backported kernels (e.g. linux-oem-24.04's current 6.17.0-1017 build), inconsistent with the old doc's "apt install linux-oem-24.04" instruction. Changes: - src/cpp/server/system_info.cpp: drop kernel_version_lacks_cwsr_fix and the LEMONADE_SKIP_KERNEL_CHECK escape hatch; needs_gfx1151_cwsr_fix returns to pure sysfs check (cwsr_size and ctl_stack_size both exported). - docs/gfx1151_linux.html: restore origin/main version (concise "upgrade your kernel" guidance, no mainline-install recipe, no DKMS sed hack, no HIP verification step, no Escape hatch section). - .github/workflows/validate_vllm.yml: remove LEMONADE_SKIP_KERNEL_CHECK env block (env var no longer exists; runner's sysfs check passes on its own on Ubuntu 6.17 OEM). - src/cpp/server/backends/vllm_server.cpp: keep the load-time "kernel may be missing CWSR fix" hint in the wait_for_ready timeout message as strictly-additive UX for the rare sysfs-reports-OK-but-GPU-still- page-faults case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the Qwen3, Llama-3.2, Gemma-3, and Phi-4-mini vLLM entries; keep only the four Qwen3.5 sizes (0.8B / 2B / 4B / 9B) which are the set this PR actually exercises and benchmarks end-to-end. Other model families can be re-added in follow-ups once they're validated on the target hardware. Net: 14 vllm entries -> 4. Total registry: 175 -> 165. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # src/cpp/server/backend_manager.cpp # src/cpp/server/system_info.cpp
Main consolidated AMD GPU device types from amd_igpu/amd_dgpu into a single amd_gpu (consistent with sd-cpp/llamacpp ROCm entries). vLLM's RECIPE_DEF still used the old split, so the device-type lookup never matched and every model load returned 404 with "Requires Radeon RX 7000 series (RDNA3)" even on supported gfx1151 hardware. Switch to the unified amd_gpu type with the union of supported families (gfx1150/gfx1151 iGPUs + gfx110X/gfx120X dGPUs). Verified on gfx1151: Qwen3.5-0.8B-vllm load returns 200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The label is redundant with the recipe field (already "vllm") and the "-vllm" suffix in each model id. No code consumes it (only test/validate_vllm.py filters by recipe + "hot", not the "vllm" label). The convention across the registry is to use labels for capabilities/curation tags (reasoning, hot, vision, tool-calling, ...) rather than backend names — 59 of 66 llamacpp entries already follow this convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
superm1
left a comment
There was a problem hiding this comment.
Generally speaking it looks fine now. a few minor nits on things that don't need to change but are being changed.
I do want to eventually get to the point we're not installing another copy of ROCm to make this work though.
Surface the experimental status in three places where end users and contributors are most likely to encounter the backend: - README.md: add a vllm/rocm row to the supported-configurations table with "(experimental)" annotation; remove the now-stale "vLLM support" entry from the Under Consideration column of the roadmap. - AGENTS.md: add a vLLM row to the backend-abstraction table with "Experimental, validated only on gfx1151 (Strix Halo)" in the Purpose column. - src/cpp/resources/server_models.json: add an "experimental" label to each of the four Qwen3.5 vllm entries. Labels are already rendered in the model picker UI and surfaced via /v1/models, so this propagates with no UI plumbing. No code change. The /install gate, /recipes action redirect, and load path are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Yes, agreed. We are using the pre packaged whls here. I would suggest sticking to this approach for the experimental release and eventually moving to the common rocm backend. |
- recipe_options.cpp: drop the four spurious additions to
OPTION_TO_CLI_FLAG (steps / cfg_scale / width / height). main
intentionally removed these flags ("recipe-level only" — see the
comment at line 198 of main); they shouldn't have come back on this
branch and are unrelated to vLLM. Per @superm1's review.
- system_info.cpp: revert the cosmetic rewording of the CWSR-check
doc comment. Per @superm1's review — the original wording was fine,
shouldn't have been touched in this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The steps / cfg_scale / width / height entries got re-added to the CLI-flag map at some point on this branch, but main intentionally removed them — recipe-level defaults only, never exposed as CLI args (see the comment at lines 41-43 of this file: "Image generation params ... are recipe-level defaults only — not exposed as CLI arguments"). Per @superm1's PR review (#1537 r3156710056). Unrelated to vLLM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Status: early rough draft
Tested only on gfx1151 (Strix Halo / Radeon 8060S). The backend does support gfx1150, gfx110X, and gfx120X as well but havent been tested yet. Feedback from people with other AMD GPUs is the main thing we need before treating this PR as ready to merge.
Summary
Adds vLLM as a WrappedServer backend using AMD's pre-built ROCm wheels from lemonade-sdk/vllm-rocm. Follows the same install/load/unload pattern as llama.cpp-ROCm.
Scope of testing
Verified end-to-end (install → load → inference → benchmark) on exactly one configuration:
GPU: Ryzen AI MAX+ PRO 395 w/ Radeon 8060S (gfx1151, Strix Halo)
OS: Ubuntu 24.04
Kernel: 6.18.22 mainline from kernel.ubuntu.com/mainline
amdgpu: built-in driver from the 6.18.22 kernel package; amdgpu-dkms 6.19.0 from ROCm 31.20 also installed but not actively loading
Models: Qwen3 (0.6B–8B), Qwen3.5 (0.8B–9B), Llama-3.2 (1B/3B), Phi-4-mini, Gemma-3-4B, both FP16 and AWQ variants
Not tested:
gfx1150 / gfx110X / gfx120X — the install flow fetches per-arch assets that exist but have never been exercised. Other architectures may have their own kernel/driver gotchas we haven't discovered.
Other distros — everything below assumes Ubuntu 24.04. The kernel and amdgpu setup on Fedora/Arch/openSUSE likely works but is untested.
Multi-user / batched serving — all benchmarks are single-user, one request at a time. vLLM's scheduler strengths (paged attention, prefix caching, continuous batching) are not exercised here.
Prerequisites for testing
Hardware
AMD GPU in one of: gfx1151 (Strix Halo), gfx1150 (Strix Point), gfx110X (RDNA3), gfx120X (RDNA4)
Kernel — strict requirement
You need a kernel with the CWSR (Context Wave Save/Restore) fix. Without it, any GPU dispatch triggers a GCVM_L2_PROTECTION_FAULT and the backend hangs. The verified path is mainline 6.18.4+.
Full doc: docs/gfx1151_linux.html.
If amdgpu-dkms is installed
The default Radeon repo (amdgpu/30.30) ships amdgpu-dkms 6.16.13 which overrides the kernel's built-in driver with a broken version. Either switch to amdgpu/31.20:
Or uninstall
amdgpu-dkmsentirely — vLLM ships its own ROCm user-space, you don't need the DKMS package unless you also want to run other ROCm tools outside Lemonade.Verify prerequisites
Kernel version
uname -r # expect 6.18.4 or newer
CWSR properties exported
grep -E "cwsr_size|ctl_stack_size" /sys/class/kfd/kfd/topology/nodes/*/properties
Test plan
1. Build
git fetch origin git checkout test-vllm ./setup.sh cmake --preset default cmake --build --preset default -j$(nproc)Test plan
Build
git fetch origin
git checkout test-vllm
./setup.sh
cmake --preset default
cmake --build --preset default -j$(nproc)
Start server
In another terminal, verify it's up:
First install pulls ~5 GB (split into two ~2.5 GB parts). Expect ~60 s on a fast link.
First load takes 20–30 s (Triton JIT compile for the architecture). Subsequent loads are faster.
Known gotchas
First-run Triton JIT: cold load of a new model size compiles kernels for your GPU, taking 20–350 s. Subsequent loads hit the on-disk cache.
huggingface-hub version conflict: fixed by forcing PYTHONNOUSERSITE=1 when launching vllm-server. If you still hit this, make sure ~/.local/lib/python3.12/site-packages/huggingface_hub isn't shadowing the bundled one.
Transformers version lag: the bundled vLLM 0.19.0 pins transformers <5. Models whose config.json declares model_type: qwen3_5_text (only some newer Qwen3.5 variants) won't load until a vLLM release that bumps the transformers pin. The model registry in this PR avoids those repos.
amdgpu-dkms 6.16.13 masking the built-in driver: see prerequisites. Uninstall or upgrade.