The Workload-Variant-Autoscaler (WVA) is a Kubernetes controller that performs intelligent autoscaling for inference model servers based on saturation. The high-level details of the algorithm where we explore only capacity are here. It determines optimal replica counts for given request traffic loads for inference servers.
- Intelligent Autoscaling: Optimizes replica count and GPU allocation based on inference server saturation
- Cost Optimization: Minimizes infrastructure costs while meeting SLO requirements
- Kubernetes v1.31.0+ (or OpenShift 4.18+)
- Helm 3.x
- kubectl
# Add the WVA Helm repository (when published)
helm upgrade -i workload-variant-autoscaler ./charts/workload-variant-autoscaler \
--namespace workload-variant-autoscaler-system \
--set-file prometheus.caCert=/tmp/prometheus-ca.crt \
--set variantAutoscaling.accelerator=L40S \
--set variantAutoscaling.modelID=unsloth/Meta-Llama-3.1-8B \
--set vllmService.enabled=true \
--set vllmService.nodePort=30000
--create-namespace# Deploy WVA with llm-d infrastructure on a local Kind cluster
make deploy-llm-d-wva-emulated-on-kind
# This creates a Kind cluster with emulated GPUs and deploys:
# - WVA controller
# - llm-d infrastructure (simulation mode)
# - Prometheus and monitoring stack
# - vLLM emulator for testingWorks on Mac (Apple Silicon/Intel) and Windows - no physical GPUs needed! Perfect for development and testing with GPU emulation.
See the Installation Guide for detailed instructions.
WVA consists of several key components:
- Reconciler: Kubernetes controller that manages VariantAutoscaling resources
- Collector: Gathers cluster state and vLLM server metrics
- Optimizer: Capacity model provides saturation based scaling based on threshold
- Actuator: Emits metrics to Prometheus and updates deployment replicas
- Platform admin deploys llm-d infrastructure (including model servers) and waits for servers to warm up and start serving requests
- Platform admin creates a
VariantAutoscalingCR for the running deployment - WVA continuously monitors request rates and server performance via Prometheus metrics
- Capacity model obtains KV cache utilization and queue depth of inference servers with slack capacity to determine replicas
- Actuator emits optimization metrics to Prometheus and updates VariantAutoscaling status
- External autoscaler (HPA/KEDA) reads the metrics and scales the deployment accordingly
Important Notes:
- Configure HPA stabilization window (recommend 120s+) for gradual scaling behavior
- WVA updates the VA status with current and desired allocations every reconciliation cycle
apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
name: llama-8b-autoscaler
namespace: llm-inference
spec:
modelId: "meta/llama-3.1-8b"More examples in config/samples/.
We welcome contributions! See the llm-d Contributing Guide for guidelines.
Join the llm-d autoscaling community meetings to get involved.
Apache 2.0 - see LICENSE for details.
For detailed documentation, visit the docs directory.