Workload-Variant-Autoscaler (WVA)

The Workload-Variant-Autoscaler (WVA) is a Kubernetes controller that performs intelligent autoscaling for inference model servers based on saturation. The high-level details of the algorithm where we explore only capacity are here. It determines optimal replica counts for given request traffic loads for inference servers.

Key Features

Intelligent Autoscaling: Optimizes replica count and GPU allocation based on inference server saturation
Cost Optimization: Minimizes infrastructure costs while meeting SLO requirements

Quick Start

Prerequisites

Kubernetes v1.31.0+ (or OpenShift 4.18+)
Helm 3.x
kubectl

Install with Helm (Recommended)

# Add the WVA Helm repository (when published)
helm upgrade -i workload-variant-autoscaler ./charts/workload-variant-autoscaler \
  --namespace workload-variant-autoscaler-system \
  --set-file prometheus.caCert=/tmp/prometheus-ca.crt \
  --set variantAutoscaling.accelerator=L40S \
  --set variantAutoscaling.modelID=unsloth/Meta-Llama-3.1-8B \
  --set vllmService.enabled=true \
  --set vllmService.nodePort=30000
  --create-namespace

Try it Locally with Kind (No GPU Required!)

# Deploy WVA with llm-d infrastructure on a local Kind cluster
make deploy-llm-d-wva-emulated-on-kind

# This creates a Kind cluster with emulated GPUs and deploys:
# - WVA controller
# - llm-d infrastructure (simulation mode)
# - Prometheus and monitoring stack
# - vLLM emulator for testing

Works on Mac (Apple Silicon/Intel) and Windows - no physical GPUs needed! Perfect for development and testing with GPU emulation.

See the Installation Guide for detailed instructions.

Documentation

User Guide

Integrations

Deployment Options

Architecture

WVA consists of several key components:

Reconciler: Kubernetes controller that manages VariantAutoscaling resources
Collector: Gathers cluster state and vLLM server metrics

Optimizer: Capacity model provides saturation based scaling based on threshold
Actuator: Emits metrics to Prometheus and updates deployment replicas

How It Works

Platform admin deploys llm-d infrastructure (including model servers) and waits for servers to warm up and start serving requests
Platform admin creates a VariantAutoscaling CR for the running deployment
WVA continuously monitors request rates and server performance via Prometheus metrics

Capacity model obtains KV cache utilization and queue depth of inference servers with slack capacity to determine replicas
Actuator emits optimization metrics to Prometheus and updates VariantAutoscaling status
External autoscaler (HPA/KEDA) reads the metrics and scales the deployment accordingly

Important Notes:

Configure HPA stabilization window (recommend 120s+) for gradual scaling behavior
WVA updates the VA status with current and desired allocations every reconciliation cycle

Example

apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
  name: llama-8b-autoscaler
  namespace: llm-inference
spec:
  modelId: "meta/llama-3.1-8b"

More examples in config/samples/.

Contributing

We welcome contributions! See the llm-d Contributing Guide for guidelines.

Join the llm-d autoscaling community meetings to get involved.

License

Apache 2.0 - see LICENSE for details.

Related Projects

References

Saturation based design discussion

For detailed documentation, visit the docs directory.

Name		Name	Last commit message	Last commit date
Latest commit History 872 Commits
.github		.github
api/v1alpha1		api/v1alpha1
charts/workload-variant-autoscaler		charts/workload-variant-autoscaler
cmd		cmd
config		config
deploy		deploy
docs		docs
hack		hack
internal		internal
pkg		pkg
test		test
tools/vllm-emulator		tools/vllm-emulator
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Workload-Variant-Autoscaler (WVA)

Key Features

Quick Start

Prerequisites

Install with Helm (Recommended)

Try it Locally with Kind (No GPU Required!)

Documentation

User Guide

Integrations

Deployment Options

Architecture

How It Works

Example

Contributing

License

Related Projects

References

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors 15

Uh oh!

Languages

License

llm-d-incubation/workload-variant-autoscaler

Folders and files

Latest commit

History

Repository files navigation

Workload-Variant-Autoscaler (WVA)

Key Features

Quick Start

Prerequisites

Install with Helm (Recommended)

Try it Locally with Kind (No GPU Required!)

Documentation

User Guide

Integrations

Deployment Options

Architecture

How It Works

Example

Contributing

License

Related Projects

References

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors 15

Uh oh!

Languages

Packages