Skip to content

Commit 72ce9c0

Browse files
committed
feat: enable dynamic LoRA adapter loading on ministral-3-14b
Adds --enable-lora with max-loras=2, max-lora-rank=64, plus VLLM_ALLOW_RUNTIME_LORA_UPDATING=true so adapters can be both preloaded via --lora-modules and runtime-loaded through /v1/load_lora_adapter. max-loras=2 (vs 4 on H100) keeps KV cache headroom on the A30s at gpu-memory-utilization=0.95. Adapter sources: - HF Hub (HF_TOKEN already wired) - Local PVC at /adapters (RWX nfs-csi) Also adds a model-cache PVC so HF-downloaded adapters survive pod restarts. Pins the vLLM image to a digest for reproducibility: vllm/vllm-openai:latest-cu130 -> sha256:04563c302537a91aa49ebdfbceda96111c5712275999b7e8804fa598f0b5641d Supporting changes: - models/_template: commented LoRA block with H100/A30 sizing hint - base/litellm/configmap.yaml: commented model_list example for routing a LoRA-served name (no live registration yet) - docs/adding-models.md: LoRA section covering enabling, sizing, PVC convention, kubectl-cp recipe, runtime load/unload, GET /v1/models health check, and security note on unauthenticated load/unload endpoints gpt-oss-120b LoRA enablement is intentionally a separate commit pending MoE+LoRA compatibility verification on the pinned digest.
1 parent 42674c9 commit 72ce9c0

6 files changed

Lines changed: 180 additions & 1 deletion

File tree

base/litellm/configmap.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,11 @@ data:
3030
model_info:
3131
metadata:
3232
mode: embedding
33+
# Example: registering a LoRA adapter served by vLLM as a separate model entry.
34+
# The served-model-name must match what vLLM exposes (preloaded via
35+
# --lora-modules NAME=path, or runtime-loaded via /v1/load_lora_adapter).
36+
# - model_name: ministral-3-14b-my-adapter
37+
# litellm_params:
38+
# model: openai/ministral-3-14b-my-adapter
39+
# api_base: http://ministral-3-14b-service:8000/v1
40+
# api_key: dummy

docs/adding-models.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,109 @@ The first deploy took ~30 minutes because the model weights (~70 GiB) had to be
234234
- [ ] Model registered with LiteLLM (UI or script)
235235
- [ ] `curl` test returns a valid response
236236

237+
## LoRA adapters (optional)
238+
239+
vLLM supports serving LoRA adapters on top of a base model, both **preloaded at startup** (`--lora-modules NAME=path`) and **runtime-loaded** via `POST /v1/load_lora_adapter`. Adapters can be sourced from a Hugging Face repo ID or a local path on a mounted PVC.
240+
241+
Currently enabled on: `ministral-3-14b`, `gpt-oss-120b`.
242+
243+
### Enabling on a new model
244+
245+
Add these to the deployment's vLLM `args`:
246+
247+
```yaml
248+
- "--enable-lora"
249+
- "--max-loras"
250+
- "4" # 4 on H100 (80GB), 2 on A30 (24GB) — sized to leave KV cache headroom
251+
- "--max-lora-rank"
252+
- "64"
253+
# Optional preloads (also accepts HF repo IDs):
254+
# - "--lora-modules"
255+
# - "my-adapter=/adapters/my-adapter"
256+
```
257+
258+
And this env var to allow runtime load/unload:
259+
260+
```yaml
261+
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
262+
value: "true"
263+
```
264+
265+
`max-loras × max-lora-rank` is what governs the GPU memory pre-allocated for adapter slots. On 24 GB cards (A30) running at `--gpu-memory-utilization 0.95`, start with `max-loras: 2` and watch vLLM's KV cache utilization log line under load before bumping.
266+
267+
### Adapter PVC convention
268+
269+
Add a `<model>-adapters` PVC (`storageClassName: nfs-csi`, `ReadWriteMany`) and mount it at `/adapters`:
270+
271+
```yaml
272+
volumeMounts:
273+
- name: adapters
274+
mountPath: /adapters
275+
volumes:
276+
- name: adapters
277+
persistentVolumeClaim:
278+
claimName: <model>-adapters
279+
```
280+
281+
RWX means you can drop adapter files onto the PVC from a temporary debug pod without bouncing the model pod.
282+
283+
### Getting adapter weights onto the PVC
284+
285+
Spin up a one-shot pod that mounts the adapters PVC, then `kubectl cp` into it:
286+
287+
```bash
288+
kubectl -n litellm run nfs-tools --rm -it --restart=Never \
289+
--image=busybox \
290+
--overrides='{"spec":{"containers":[{"name":"nfs-tools","image":"busybox","stdin":true,"tty":true,"volumeMounts":[{"name":"a","mountPath":"/adapters"}]}],"volumes":[{"name":"a","persistentVolumeClaim":{"claimName":"ministral-3-14b-adapters"}}]}}'
291+
292+
# in another terminal:
293+
kubectl -n litellm cp ./my-adapter-dir nfs-tools:/adapters/my-adapter
294+
```
295+
296+
Adapter dirs should contain the standard PEFT layout (`adapter_config.json`, `adapter_model.safetensors`).
297+
298+
### Runtime load and unload
299+
300+
```bash
301+
kubectl -n litellm port-forward svc/ministral-3-14b-service 8000:8000 &
302+
303+
# load from a HF repo:
304+
curl -sS http://localhost:8000/v1/load_lora_adapter \
305+
-H 'Content-Type: application/json' \
306+
-d '{"lora_name":"my-adapter","lora_path":"hf-user/some-public-lora"}'
307+
308+
# load from the PVC:
309+
curl -sS http://localhost:8000/v1/load_lora_adapter \
310+
-H 'Content-Type: application/json' \
311+
-d '{"lora_name":"my-adapter","lora_path":"/adapters/my-adapter"}'
312+
313+
# unload:
314+
curl -sS http://localhost:8000/v1/unload_lora_adapter \
315+
-H 'Content-Type: application/json' \
316+
-d '{"lora_name":"my-adapter"}'
317+
```
318+
319+
Use the loaded `lora_name` as the `model` field in subsequent chat completions.
320+
321+
### Health check: which adapters are currently loaded?
322+
323+
After a pod restart, runtime-loaded adapters are gone (the model cache PVC keeps the *downloaded weights*, but vLLM's loaded-adapter list is in-memory). To check current state:
324+
325+
```bash
326+
kubectl -n litellm port-forward svc/ministral-3-14b-service 8000:8000 &
327+
curl -sS http://localhost:8000/v1/models | jq '.data[].id'
328+
```
329+
330+
The list contains the base served-model-name plus every currently-loaded adapter name. Compare against your expected set and re-load anything missing. Preloading via `--lora-modules` is the way to make a specific adapter survive restarts.
331+
332+
### Registering a LoRA-served name with LiteLLM
333+
334+
Each adapter can be registered as a separate `model_list` entry pointing at the same vLLM service. There's a commented example in `base/litellm/configmap.yaml`; either add an entry there or register at runtime via the LiteLLM UI / `add-model.sh` (the `served-model-name` you give LiteLLM must match the `lora_name` exposed by vLLM).
335+
336+
### Security note
337+
338+
`POST /v1/load_lora_adapter` and `POST /v1/unload_lora_adapter` are **unauthenticated** on the cluster network. Any pod that can reach the vLLM service can load an arbitrary adapter and shift model behavior. Acceptable today because the `litellm` namespace is locked down, but should be revisited (NetworkPolicy, Authentik-fronted ingress, or a vLLM auth flag once available) before any broader exposure. Track as a hardening follow-up.
339+
237340
## Troubleshooting
238341

239342
| Symptom | Cause | Fix |

models/_template/deployment.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,16 @@ spec:
2626
- "MODEL_NAME"
2727
- "--max-model-len"
2828
- "MAX_MODEL_LEN"
29+
# Optional: dynamic LoRA adapter loading.
30+
# Sizing: max-loras pre-allocates GPU memory per slot at max-lora-rank.
31+
# Use 4 on 80GB cards (H100), 2 on 24GB cards (A30) as a starting point.
32+
# - "--enable-lora"
33+
# - "--max-loras"
34+
# - "4"
35+
# - "--max-lora-rank"
36+
# - "64"
37+
# - "--lora-modules"
38+
# - "my-adapter=/adapters/my-adapter"
2939
- "--port"
3040
- "PORT"
3141
ports:
@@ -36,6 +46,17 @@ spec:
3646
secretKeyRef:
3747
name: litellm-secret
3848
key: HF_TOKEN
49+
# Optional: enable runtime /v1/load_lora_adapter and /v1/unload_lora_adapter.
50+
# - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
51+
# value: "true"
52+
# Optional LoRA adapter PVC (pair with the args above and a MODEL_NAME-adapters PVC):
53+
# volumeMounts:
54+
# - name: adapters
55+
# mountPath: /adapters
3956
resources:
4057
limits:
4158
nvidia.com/gpu: "GPU_COUNT"
59+
# volumes:
60+
# - name: adapters
61+
# persistentVolumeClaim:
62+
# claimName: MODEL_NAME-adapters

models/kustomization.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ resources:
2121
- qwen3-vl-embedding-8b/service.yaml
2222
- qwen3-vl-32b/deployment.yaml
2323
- qwen3-vl-32b/service.yaml
24+
- ministral-3-14b/pvc.yaml
2425
- ministral-3-14b/deployment.yaml
2526
- ministral-3-14b/service.yaml
2627
- qwen-image-edit/pvc.yaml

models/ministral-3-14b/deployment.yaml

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ spec:
1919
accelerator: a30
2020
containers:
2121
- name: vllm
22-
image: vllm/vllm-openai:latest-cu130
22+
image: vllm/vllm-openai@sha256:04563c302537a91aa49ebdfbceda96111c5712275999b7e8804fa598f0b5641d
2323
args:
2424
- "--model"
2525
- "mistralai/Ministral-3-14B-Instruct-2512-BF16"
@@ -38,6 +38,15 @@ spec:
3838
- "8"
3939
- "--gpu-memory-utilization"
4040
- "0.95"
41+
- "--enable-lora"
42+
- "--max-loras"
43+
- "2"
44+
- "--max-lora-rank"
45+
- "64"
46+
# Preload static adapters by uncommenting and listing name=path pairs.
47+
# Paths may be /adapters/<dir> (PVC) or HF repo IDs (e.g. user/my-lora).
48+
# - "--lora-modules"
49+
# - "my-adapter=/adapters/my-adapter"
4150
- "--port"
4251
- "8000"
4352
ports:
@@ -48,8 +57,22 @@ spec:
4857
secretKeyRef:
4958
name: litellm-secret
5059
key: HF_TOKEN
60+
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
61+
value: "true"
62+
volumeMounts:
63+
- name: model-cache
64+
mountPath: /root/.cache/huggingface
65+
- name: adapters
66+
mountPath: /adapters
5167
resources:
5268
limits:
5369
nvidia.com/gpu: "2"
5470
requests:
5571
nvidia.com/gpu: "2"
72+
volumes:
73+
- name: model-cache
74+
persistentVolumeClaim:
75+
claimName: ministral-3-14b-cache
76+
- name: adapters
77+
persistentVolumeClaim:
78+
claimName: ministral-3-14b-adapters

models/ministral-3-14b/pvc.yaml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
apiVersion: v1
2+
kind: PersistentVolumeClaim
3+
metadata:
4+
name: ministral-3-14b-cache
5+
spec:
6+
accessModes:
7+
- ReadWriteOnce
8+
storageClassName: nfs-csi
9+
resources:
10+
requests:
11+
storage: 50Gi
12+
---
13+
apiVersion: v1
14+
kind: PersistentVolumeClaim
15+
metadata:
16+
name: ministral-3-14b-adapters
17+
spec:
18+
accessModes:
19+
- ReadWriteMany
20+
storageClassName: nfs-csi
21+
resources:
22+
requests:
23+
storage: 50Gi

0 commit comments

Comments
 (0)