Skip to content

Commit 70aa572

Browse files
committed
feat: enable dynamic LoRA adapter loading on gpt-oss-120b
Adds --enable-lora with max-loras=4, max-lora-rank=64, plus VLLM_ALLOW_RUNTIME_LORA_UPDATING=true and a /adapters mount backed by a new gpt-oss-120b-adapters PVC (50Gi RWX nfs-csi). max-loras=4 fits comfortably on the H100 (80GB). Pins the vLLM image to the same digest as the ministral commit: sha256:04563c302537a91aa49ebdfbceda96111c5712275999b7e8804fa598f0b5641d DO NOT APPLY without first running Phase 0.3 from the rollout plan: gpt-oss-120b is MoE; vLLM LoRA support for MoE has been incremental and must be verified against the pinned digest before this lands in prod. Acceptable evidence is either a successful runtime /v1/load_lora_adapter against a small public adapter or a release-note confirmation for the pinned vLLM build. If MoE+LoRA turns out to be unsupported, revert just this commit; the ministral rollout and shared infra (template, docs, litellm configmap) stay intact.
1 parent 72ce9c0 commit 70aa572

2 files changed

Lines changed: 30 additions & 2 deletions

File tree

models/gpt-oss-120b/deployment.yaml

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ spec:
1818
accelerator: h100
1919
containers:
2020
- name: vllm
21-
image: vllm/vllm-openai:latest-cu130
21+
image: vllm/vllm-openai@sha256:04563c302537a91aa49ebdfbceda96111c5712275999b7e8804fa598f0b5641d
2222
args:
2323
- "--model"
2424
- "openai/gpt-oss-120b"
@@ -29,10 +29,19 @@ spec:
2929
- "--tensor-parallel-size"
3030
- "1"
3131
- "--tool-call-parser"
32-
- "openai"
32+
- "openai"
3333
- "--enable-auto-tool-choice"
3434
- "--gpu-memory-utilization"
3535
- "0.90"
36+
- "--enable-lora"
37+
- "--max-loras"
38+
- "4"
39+
- "--max-lora-rank"
40+
- "64"
41+
# Preload static adapters by uncommenting and listing name=path pairs.
42+
# Paths may be /adapters/<dir> (PVC) or HF repo IDs (e.g. user/my-lora).
43+
# - "--lora-modules"
44+
# - "my-adapter=/adapters/my-adapter"
3645
- "--port"
3746
- "8000"
3847
ports:
@@ -45,11 +54,15 @@ spec:
4554
secretKeyRef:
4655
name: litellm-secret
4756
key: HF_TOKEN
57+
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
58+
value: "true"
4859
volumeMounts:
4960
- name: model-cache
5061
mountPath: /root/.cache/huggingface
5162
- name: vllm-config
5263
mountPath: /etc/vllm
64+
- name: adapters
65+
mountPath: /adapters
5366
resources:
5467
limits:
5568
nvidia.com/gpu: "1"
@@ -66,3 +79,6 @@ spec:
6679
- name: vllm-config
6780
configMap:
6881
name: gpt-oss-120b-config
82+
- name: adapters
83+
persistentVolumeClaim:
84+
claimName: gpt-oss-120b-adapters

models/gpt-oss-120b/pvc.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,15 @@ spec:
99
resources:
1010
requests:
1111
storage: 200Gi
12+
---
13+
apiVersion: v1
14+
kind: PersistentVolumeClaim
15+
metadata:
16+
name: gpt-oss-120b-adapters
17+
spec:
18+
accessModes:
19+
- ReadWriteMany
20+
storageClassName: nfs-csi
21+
resources:
22+
requests:
23+
storage: 50Gi

0 commit comments

Comments
 (0)