feat: enable dynamic LoRA adapter loading on ministral-3-14b

felixboelter · felixboelter · commit 72ce9c0ae6d1 · 2026-05-08T09:57:35.000+02:00
Adds --enable-lora with max-loras=2, max-lora-rank=64, plus
VLLM_ALLOW_RUNTIME_LORA_UPDATING=true so adapters can be both
preloaded via --lora-modules and runtime-loaded through
/v1/load_lora_adapter. max-loras=2 (vs 4 on H100) keeps KV cache
headroom on the A30s at gpu-memory-utilization=0.95.

Adapter sources:
- HF Hub (HF_TOKEN already wired)
- Local PVC at /adapters (RWX nfs-csi)

Also adds a model-cache PVC so HF-downloaded adapters survive
pod restarts.

Pins the vLLM image to a digest for reproducibility:
vllm/vllm-openai:latest-cu130
  -&gt; sha256:04563c302537a91aa49ebdfbceda96111c5712275999b7e8804fa598f0b5641d

Supporting changes:
- models/_template: commented LoRA block with H100/A30 sizing hint
- base/litellm/configmap.yaml: commented model_list example for
  routing a LoRA-served name (no live registration yet)
- docs/adding-models.md: LoRA section covering enabling, sizing,
  PVC convention, kubectl-cp recipe, runtime load/unload,
  GET /v1/models health check, and security note on
  unauthenticated load/unload endpoints

gpt-oss-120b LoRA enablement is intentionally a separate commit
pending MoE+LoRA compatibility verification on the pinned digest.
diff --git a/base/litellm/configmap.yaml b/base/litellm/configmap.yaml
@@ -30,3 +30,11 @@ data:
         model_info:
           metadata:
             mode: embedding
+      # Example: registering a LoRA adapter served by vLLM as a separate model entry.
+      # The served-model-name must match what vLLM exposes (preloaded via
+      # --lora-modules NAME=path, or runtime-loaded via /v1/load_lora_adapter).
+      # - model_name: ministral-3-14b-my-adapter
+      #   litellm_params:
+      #     model: openai/ministral-3-14b-my-adapter
+      #     api_base: http://ministral-3-14b-service:8000/v1
+      #     api_key: dummy
diff --git a/docs/adding-models.md b/docs/adding-models.md
@@ -234,6 +234,109 @@ The first deploy took ~30 minutes because the model weights (~70 GiB) had to be
 - [ ] Model registered with LiteLLM (UI or script)
 - [ ] `curl` test returns a valid response
 
+## LoRA adapters (optional)
+
+vLLM supports serving LoRA adapters on top of a base model, both **preloaded at startup** (`--lora-modules NAME=path`) and **runtime-loaded** via `POST /v1/load_lora_adapter`. Adapters can be sourced from a Hugging Face repo ID or a local path on a mounted PVC.
+
+Currently enabled on: `ministral-3-14b`, `gpt-oss-120b`.
+
+### Enabling on a new model
+
+Add these to the deployment's vLLM `args`:
+
+```yaml
+- "--enable-lora"
+- "--max-loras"
+- "4"        # 4 on H100 (80GB), 2 on A30 (24GB) — sized to leave KV cache headroom
+- "--max-lora-rank"
+- "64"
+# Optional preloads (also accepts HF repo IDs):
+# - "--lora-modules"
+# - "my-adapter=/adapters/my-adapter"
+```
+
+And this env var to allow runtime load/unload:
+
+```yaml
+- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
+  value: "true"
+```
+
+`max-loras × max-lora-rank` is what governs the GPU memory pre-allocated for adapter slots. On 24 GB cards (A30) running at `--gpu-memory-utilization 0.95`, start with `max-loras: 2` and watch vLLM's KV cache utilization log line under load before bumping.
+
+### Adapter PVC convention
+
+Add a `<model>-adapters` PVC (`storageClassName: nfs-csi`, `ReadWriteMany`) and mount it at `/adapters`:
+
+```yaml
+volumeMounts:
+  - name: adapters
+    mountPath: /adapters
+volumes:
+  - name: adapters
+    persistentVolumeClaim:
+      claimName: <model>-adapters
+```
+
+RWX means you can drop adapter files onto the PVC from a temporary debug pod without bouncing the model pod.
+
+### Getting adapter weights onto the PVC
+
+Spin up a one-shot pod that mounts the adapters PVC, then `kubectl cp` into it:
+
+```bash
+kubectl -n litellm run nfs-tools --rm -it --restart=Never \
+  --image=busybox \
+  --overrides='{"spec":{"containers":[{"name":"nfs-tools","image":"busybox","stdin":true,"tty":true,"volumeMounts":[{"name":"a","mountPath":"/adapters"}]}],"volumes":[{"name":"a","persistentVolumeClaim":{"claimName":"ministral-3-14b-adapters"}}]}}'
+
+# in another terminal:
+kubectl -n litellm cp ./my-adapter-dir nfs-tools:/adapters/my-adapter
+```
+
+Adapter dirs should contain the standard PEFT layout (`adapter_config.json`, `adapter_model.safetensors`).
+
+### Runtime load and unload
+
+```bash
+kubectl -n litellm port-forward svc/ministral-3-14b-service 8000:8000 &
+
+# load from a HF repo:
+curl -sS http://localhost:8000/v1/load_lora_adapter \
+  -H 'Content-Type: application/json' \
+  -d '{"lora_name":"my-adapter","lora_path":"hf-user/some-public-lora"}'
+
+# load from the PVC:
+curl -sS http://localhost:8000/v1/load_lora_adapter \
+  -H 'Content-Type: application/json' \
+  -d '{"lora_name":"my-adapter","lora_path":"/adapters/my-adapter"}'
+
+# unload:
+curl -sS http://localhost:8000/v1/unload_lora_adapter \
+  -H 'Content-Type: application/json' \
+  -d '{"lora_name":"my-adapter"}'
+```
+
+Use the loaded `lora_name` as the `model` field in subsequent chat completions.
+
+### Health check: which adapters are currently loaded?
+
+After a pod restart, runtime-loaded adapters are gone (the model cache PVC keeps the *downloaded weights*, but vLLM's loaded-adapter list is in-memory). To check current state:
+
+```bash
+kubectl -n litellm port-forward svc/ministral-3-14b-service 8000:8000 &
+curl -sS http://localhost:8000/v1/models | jq '.data[].id'
+```
+
+The list contains the base served-model-name plus every currently-loaded adapter name. Compare against your expected set and re-load anything missing. Preloading via `--lora-modules` is the way to make a specific adapter survive restarts.
+
+### Registering a LoRA-served name with LiteLLM
+
+Each adapter can be registered as a separate `model_list` entry pointing at the same vLLM service. There's a commented example in `base/litellm/configmap.yaml`; either add an entry there or register at runtime via the LiteLLM UI / `add-model.sh` (the `served-model-name` you give LiteLLM must match the `lora_name` exposed by vLLM).
+
+### Security note
+
+`POST /v1/load_lora_adapter` and `POST /v1/unload_lora_adapter` are **unauthenticated** on the cluster network. Any pod that can reach the vLLM service can load an arbitrary adapter and shift model behavior. Acceptable today because the `litellm` namespace is locked down, but should be revisited (NetworkPolicy, Authentik-fronted ingress, or a vLLM auth flag once available) before any broader exposure. Track as a hardening follow-up.
+
 ## Troubleshooting
 
 | Symptom                           | Cause                                | Fix                                              |
diff --git a/models/_template/deployment.yaml b/models/_template/deployment.yaml
@@ -26,6 +26,16 @@ spec:
             - "MODEL_NAME"
             - "--max-model-len"
             - "MAX_MODEL_LEN"
+            # Optional: dynamic LoRA adapter loading.
+            # Sizing: max-loras pre-allocates GPU memory per slot at max-lora-rank.
+            # Use 4 on 80GB cards (H100), 2 on 24GB cards (A30) as a starting point.
+            # - "--enable-lora"
+            # - "--max-loras"
+            # - "4"
+            # - "--max-lora-rank"
+            # - "64"
+            # - "--lora-modules"
+            # - "my-adapter=/adapters/my-adapter"
             - "--port"
             - "PORT"
           ports:
@@ -36,6 +46,17 @@ spec:
                 secretKeyRef:
                   name: litellm-secret
                   key: HF_TOKEN
+            # Optional: enable runtime /v1/load_lora_adapter and /v1/unload_lora_adapter.
+            # - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
+            #   value: "true"
+          # Optional LoRA adapter PVC (pair with the args above and a MODEL_NAME-adapters PVC):
+          # volumeMounts:
+          #   - name: adapters
+          #     mountPath: /adapters
           resources:
             limits:
               nvidia.com/gpu: "GPU_COUNT"
+      # volumes:
+      #   - name: adapters
+      #     persistentVolumeClaim:
+      #       claimName: MODEL_NAME-adapters
diff --git a/models/kustomization.yaml b/models/kustomization.yaml
@@ -21,6 +21,7 @@ resources:
   - qwen3-vl-embedding-8b/service.yaml
   - qwen3-vl-32b/deployment.yaml
   - qwen3-vl-32b/service.yaml
+  - ministral-3-14b/pvc.yaml
   - ministral-3-14b/deployment.yaml
   - ministral-3-14b/service.yaml
   - qwen-image-edit/pvc.yaml
diff --git a/models/ministral-3-14b/deployment.yaml b/models/ministral-3-14b/deployment.yaml
@@ -19,7 +19,7 @@ spec:
         accelerator: a30
       containers:
         - name: vllm
-          image: vllm/vllm-openai:latest-cu130
+          image: vllm/vllm-openai@sha256:04563c302537a91aa49ebdfbceda96111c5712275999b7e8804fa598f0b5641d
           args:
             - "--model"
             - "mistralai/Ministral-3-14B-Instruct-2512-BF16"
@@ -38,6 +38,15 @@ spec:
             - "8"
             - "--gpu-memory-utilization"
             - "0.95"
+            - "--enable-lora"
+            - "--max-loras"
+            - "2"
+            - "--max-lora-rank"
+            - "64"
+            # Preload static adapters by uncommenting and listing name=path pairs.
+            # Paths may be /adapters/<dir> (PVC) or HF repo IDs (e.g. user/my-lora).
+            # - "--lora-modules"
+            # - "my-adapter=/adapters/my-adapter"
             - "--port"
             - "8000"
           ports:
@@ -48,8 +57,22 @@ spec:
                 secretKeyRef:
                   name: litellm-secret
                   key: HF_TOKEN
+            - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
+              value: "true"
+          volumeMounts:
+            - name: model-cache
+              mountPath: /root/.cache/huggingface
+            - name: adapters
+              mountPath: /adapters
           resources:
             limits:
               nvidia.com/gpu: "2"
             requests:
               nvidia.com/gpu: "2"
+      volumes:
+        - name: model-cache
+          persistentVolumeClaim:
+            claimName: ministral-3-14b-cache
+        - name: adapters
+          persistentVolumeClaim:
+            claimName: ministral-3-14b-adapters
diff --git a/models/ministral-3-14b/pvc.yaml b/models/ministral-3-14b/pvc.yaml
@@ -0,0 +1,23 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: ministral-3-14b-cache
+spec:
+  accessModes:
+    - ReadWriteOnce
+  storageClassName: nfs-csi
+  resources:
+    requests:
+      storage: 50Gi
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: ministral-3-14b-adapters
+spec:
+  accessModes:
+    - ReadWriteMany
+  storageClassName: nfs-csi
+  resources:
+    requests:
+      storage: 50Gi