You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: enable dynamic LoRA adapter loading on ministral-3-14b
Adds --enable-lora with max-loras=2, max-lora-rank=64, plus
VLLM_ALLOW_RUNTIME_LORA_UPDATING=true so adapters can be both
preloaded via --lora-modules and runtime-loaded through
/v1/load_lora_adapter. max-loras=2 (vs 4 on H100) keeps KV cache
headroom on the A30s at gpu-memory-utilization=0.95.
Adapter sources:
- HF Hub (HF_TOKEN already wired)
- Local PVC at /adapters (RWX nfs-csi)
Also adds a model-cache PVC so HF-downloaded adapters survive
pod restarts.
Pins the vLLM image to a digest for reproducibility:
vllm/vllm-openai:latest-cu130
-> sha256:04563c302537a91aa49ebdfbceda96111c5712275999b7e8804fa598f0b5641d
Supporting changes:
- models/_template: commented LoRA block with H100/A30 sizing hint
- base/litellm/configmap.yaml: commented model_list example for
routing a LoRA-served name (no live registration yet)
- docs/adding-models.md: LoRA section covering enabling, sizing,
PVC convention, kubectl-cp recipe, runtime load/unload,
GET /v1/models health check, and security note on
unauthenticated load/unload endpoints
gpt-oss-120b LoRA enablement is intentionally a separate commit
pending MoE+LoRA compatibility verification on the pinned digest.
Copy file name to clipboardExpand all lines: docs/adding-models.md
+103Lines changed: 103 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -234,6 +234,109 @@ The first deploy took ~30 minutes because the model weights (~70 GiB) had to be
234
234
- [ ] Model registered with LiteLLM (UI or script)
235
235
- [ ] `curl` test returns a valid response
236
236
237
+
## LoRA adapters (optional)
238
+
239
+
vLLM supports serving LoRA adapters on top of a base model, both **preloaded at startup** (`--lora-modules NAME=path`) and **runtime-loaded** via `POST /v1/load_lora_adapter`. Adapters can be sourced from a Hugging Face repo ID or a local path on a mounted PVC.
240
+
241
+
Currently enabled on: `ministral-3-14b`, `gpt-oss-120b`.
242
+
243
+
### Enabling on a new model
244
+
245
+
Add these to the deployment's vLLM `args`:
246
+
247
+
```yaml
248
+
- "--enable-lora"
249
+
- "--max-loras"
250
+
- "4" # 4 on H100 (80GB), 2 on A30 (24GB) — sized to leave KV cache headroom
251
+
- "--max-lora-rank"
252
+
- "64"
253
+
# Optional preloads (also accepts HF repo IDs):
254
+
# - "--lora-modules"
255
+
# - "my-adapter=/adapters/my-adapter"
256
+
```
257
+
258
+
And this env var to allow runtime load/unload:
259
+
260
+
```yaml
261
+
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
262
+
value: "true"
263
+
```
264
+
265
+
`max-loras × max-lora-rank` is what governs the GPU memory pre-allocated for adapter slots. On 24 GB cards (A30) running at `--gpu-memory-utilization 0.95`, start with `max-loras: 2` and watch vLLM's KV cache utilization log line under load before bumping.
266
+
267
+
### Adapter PVC convention
268
+
269
+
Add a `<model>-adapters` PVC (`storageClassName: nfs-csi`, `ReadWriteMany`) and mount it at `/adapters`:
270
+
271
+
```yaml
272
+
volumeMounts:
273
+
- name: adapters
274
+
mountPath: /adapters
275
+
volumes:
276
+
- name: adapters
277
+
persistentVolumeClaim:
278
+
claimName: <model>-adapters
279
+
```
280
+
281
+
RWX means you can drop adapter files onto the PVC from a temporary debug pod without bouncing the model pod.
282
+
283
+
### Getting adapter weights onto the PVC
284
+
285
+
Spin up a one-shot pod that mounts the adapters PVC, then `kubectl cp` into it:
286
+
287
+
```bash
288
+
kubectl -n litellm run nfs-tools --rm -it --restart=Never \
Use the loaded `lora_name` as the `model` field in subsequent chat completions.
320
+
321
+
### Health check: which adapters are currently loaded?
322
+
323
+
After a pod restart, runtime-loaded adapters are gone (the model cache PVC keeps the *downloaded weights*, but vLLM's loaded-adapter list is in-memory). To check current state:
The list contains the base served-model-name plus every currently-loaded adapter name. Compare against your expected set and re-load anything missing. Preloading via `--lora-modules` is the way to make a specific adapter survive restarts.
331
+
332
+
### Registering a LoRA-served name with LiteLLM
333
+
334
+
Each adapter can be registered as a separate `model_list` entry pointing at the same vLLM service. There's a commented example in `base/litellm/configmap.yaml`; either add an entry there or register at runtime via the LiteLLM UI / `add-model.sh` (the `served-model-name` you give LiteLLM must match the `lora_name` exposed by vLLM).
335
+
336
+
### Security note
337
+
338
+
`POST /v1/load_lora_adapter`and `POST /v1/unload_lora_adapter` are **unauthenticated** on the cluster network. Any pod that can reach the vLLM service can load an arbitrary adapter and shift model behavior. Acceptable today because the `litellm` namespace is locked down, but should be revisited (NetworkPolicy, Authentik-fronted ingress, or a vLLM auth flag once available) before any broader exposure. Track as a hardening follow-up.
0 commit comments