The Workload-Variant-Autoscaler (WVA) is a Kubernetes controller that performs intelligent autoscaling for inference model servers based on saturation. The high-level details of the algorithm where we explore only capacity are here. It determines optimal replica counts for given request traffic loads for inference servers.
- Intelligent Autoscaling: Optimizes replica count and GPU allocation based on inference server saturation
- Cost Optimization: Minimizes infrastructure costs while meeting SLO requirements
- Kubernetes v1.31.0+ (or OpenShift 4.18+)
- Helm 3.x
- kubectl
# Add the WVA Helm repository (when published)
helm upgrade -i workload-variant-autoscaler ./charts/workload-variant-autoscaler \
--namespace workload-variant-autoscaler-system \
--set-file prometheus.caCert=/tmp/prometheus-ca.crt \
--set variantAutoscaling.accelerator=L40S \
--set variantAutoscaling.modelID=unsloth/Meta-Llama-3.1-8B \
--set vllmService.enabled=true \
--set vllmService.nodePort=30000
--create-namespace# Deploy WVA with llm-d infrastructure on a local Kind cluster
make deploy-wva-emulated-on-kind CREATE_CLUSTER=true DEPLOY_LLM_D=true
# This creates a Kind cluster with emulated GPUs and deploys:
# - WVA controller
# - llm-d infrastructure (simulation mode)
# - Prometheus and monitoring stack
# - vLLM emulator for testingWorks on Mac (Apple Silicon/Intel) and Windows - no physical GPUs needed! Perfect for development and testing with GPU emulation.
See the Installation Guide for detailed instructions.
WVA consists of several key components:
- Reconciler: Kubernetes controller that manages VariantAutoscaling resources
- Collector: Gathers cluster state and vLLM server metrics
- Optimizer: Capacity model provides saturation based scaling based on threshold
- Actuator: Emits metrics to Prometheus and updates deployment replicas
- Platform admin deploys llm-d infrastructure (including model servers) and waits for servers to warm up and start serving requests
- Platform admin creates a
VariantAutoscalingCR for the running deployment - WVA continuously monitors request rates and server performance via Prometheus metrics
- Capacity model obtains KV cache utilization and queue depth of inference servers with slack capacity to determine replicas
- Actuator emits optimization metrics to Prometheus and updates VariantAutoscaling status
- External autoscaler (HPA/KEDA) reads the metrics and scales the deployment accordingly
Important Notes:
- WVA handles the creation order gracefully - you can create the VA before or after the deployment
- If a deployment is deleted, the VA status is immediately updated to reflect the missing deployment
- When the deployment is recreated, the VA automatically resumes operation
- Configure HPA stabilization window (recommend 120s+) for gradual scaling behavior
- WVA updates the VA status with current and desired allocations every reconciliation cycle
apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
name: llama-8b-autoscaler
namespace: llm-inference
spec:
scaleTargetRef:
kind: Deployment
name: llama-8b
modelID: "meta/llama-3.1-8b"
variantCost: "10.0" # Optional, defaults to "10.0"More examples in config/samples/.
Important: Helm does not automatically update CRDs during helm upgrade. When upgrading WVA to a new version with CRD changes, you must manually apply the updated CRDs first:
# Apply the latest CRDs before upgrading
kubectl apply -f charts/workload-variant-autoscaler/crds/
# Then upgrade the Helm release
helm upgrade workload-variant-autoscaler ./charts/workload-variant-autoscaler \
--namespace workload-variant-autoscaler-system \
[your-values...]- VariantAutoscaling CRD: Added
scaleTargetReffield as required. v0.4.1 VariantAutoscaling resources withoutscaleTargetRefmust be updated before upgrading:- Impact on Scale-to-Zero: VAs without
scaleTargetRefwill not scale to zero properly, even with HPAScaleToZero enabled and HPAminReplicas: 0, because the HPA cannot reference the target deployment. - Migration: Update existing VAs to include
scaleTargetRef:spec: scaleTargetRef: kind: Deployment name: <your-deployment-name>
- Validation: After CRD update, VAs without
scaleTargetRefwill fail validation.
- Impact on Scale-to-Zero: VAs without
To check if your cluster has the latest CRD schema:
# Check the CRD fields
kubectl get crd variantautoscalings.llmd.ai -o jsonpath='{.spec.versions[0].schema.openAPIV3Schema.properties.spec.properties}' | jq 'keys'We welcome contributions! See the llm-d Contributing Guide for guidelines.
Join the llm-d autoscaling community meetings to get involved.
Apache 2.0 - see LICENSE for details.
For detailed documentation, visit the docs directory.