LiteLLM Kubernetes Deployment

AI model gateway

Quick Start

# 1. Clone repo
git clone https://github.com/hpi/litellm-k8s.git
cd litellm-k8s

# 2. Create secrets (see secrets/README.md)
cp secrets/example-secrets.yaml secrets/secrets.yaml
# Edit secrets/secrets.yaml with your values

# 3. Deploy
./scripts/deploy.sh dev

# 4. Port-forward
kubectl port-forward -n litellm service/litellm-service 4000:4000

# 5. Access UI
open http://localhost:4000/ui/login/

Architecture

Internet -> nginx proxy (LoadBalancer) -> LiteLLM (ClusterIP)
                                               |
                                               v
                                         ClusterIP Services
                                               |
                                               v
                                         vLLM Model Pods

An nginx reverse proxy sits in front of LiteLLM and acts as the external-facing LoadBalancer. It blocks the /metrics endpoint from external access (returns 403) while internal Prometheus scrapers can still reach LiteLLM directly via the ClusterIP service.

Staging Namespace

Use ./scripts/deploy.sh staging to create a litellm-staging namespace without redeploying the GPU model workloads.

This staging overlay deploys LiteLLM, the KISZ Auth Wrapper, Postgres, Qdrant, and the Postgres PVC in litellm-staging, then points LiteLLM at the existing model services in litellm via cluster DNS. The staging litellm-service and kisz-auth-wrapper-service are both exposed as LoadBalancer services.

./scripts/deploy.sh staging
kubectl get svc -n litellm-staging litellm-service
kubectl get svc -n litellm-staging kisz-auth-wrapper-service

For the wrapper OIDC flow, create a temporary HTTPS hostname such as llm-portal-staging.<your-domain> that points to the wrapper LoadBalancer and set Authentik's redirect URI to:

https://llm-portal-staging.<your-domain>/callback

Adding Models

See docs/adding-models.md

Infrastructure

Cluster: HPI K8s (40x A30)
Namespace: litellm
GPU Scheduling: Uses GPU requests in model deployments

Maintenance

Logs: kubectl logs -n litellm deployment/litellm-proxy -f
Restart: kubectl rollout restart -n litellm deployment/litellm-proxy
Scale: kubectl scale -n litellm deployment/llama-3b --replicas=2

Handoff / Recent Changes

Added scripts:
- scripts/call_qwen_image_edit.py (image edit via LiteLLM /v1/images/edits)
- scripts/test_octen_embedding.py (embeddings via LiteLLM /v1/embeddings)
Added octen-embedding-8b to LiteLLM model list (default encoding_format: float).
Added models/gpt-oss-120b (deployment/service/pvc) with vLLM config mounted from models/gpt-oss-120b/configmap.yaml using GPT-OSS_EAGLE3_Hopper.yaml. (Note: model is not yet added to LiteLLM proxy config.)

Apply model resources:

kubectl apply -k models

Calling the API (via LiteLLM)

Port-forward in dev or access via your ingress.

kubectl port-forward -n litellm service/litellm-service 4000:4000

Chat/completions (example)

curl -sS -H "Authorization: Bearer $LITELLM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama-3b","messages":[{"role":"user","content":"Hello"}]}' \
  http://localhost:4000/v1/chat/completions

Embeddings (octen-embedding-8b)

curl -sS -H "Authorization: Bearer $LITELLM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"octen-embedding-8b","input":"Hello from octen","encoding_format":"float"}' \
  http://localhost:4000/v1/embeddings

Or run:

LITELLM_API_KEY=sk-... python3 scripts/test_octen_embedding.py

Image edits (qwen-image-edit)

LITELLM_API_KEY=sk-... python3 scripts/call_qwen_image_edit.py \
  --api-base http://localhost:4000 \
  --prompt "Remove the sleeves; keep fabric/lighting unchanged"

UI Login

Default credentials:

Username: admin
Password: your LITELLM_MASTER_KEY

Contributors

Felix Boelter (@felixboelter)

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github/workflows		.github/workflows
base-infra		base-infra
base		base
docs		docs
models		models
namespaces		namespaces
overlays		overlays
scripts		scripts
secrets		secrets
.gitignore		.gitignore
FUTURE.md		FUTURE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LiteLLM Kubernetes Deployment

Quick Start

Architecture

Staging Namespace

Adding Models

Infrastructure

Maintenance

Handoff / Recent Changes

Calling the API (via LiteLLM)

Chat/completions (example)

Embeddings (octen-embedding-8b)

Image edits (qwen-image-edit)

UI Login

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LiteLLM Kubernetes Deployment

Quick Start

Architecture

Staging Namespace

Adding Models

Infrastructure

Maintenance

Handoff / Recent Changes

Calling the API (via LiteLLM)

Chat/completions (example)

Embeddings (octen-embedding-8b)

Image edits (qwen-image-edit)

UI Login

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages