Skip to content

llm-d-incubation/workload-variant-autoscaler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,406 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Workload-Variant-Autoscaler (WVA)

Go Report Card License

The Workload-Variant-Autoscaler (WVA) is a Kubernetes controller that performs intelligent autoscaling for inference model servers based on saturation. The high-level details of the algorithm where we explore only capacity are here. It determines optimal replica counts for given request traffic loads for inference servers.

Key Features

  • Intelligent Autoscaling: Optimizes replica count and GPU allocation based on inference server saturation
  • Cost Optimization: Minimizes infrastructure costs while meeting SLO requirements

Quick Start

Prerequisites

  • Kubernetes v1.31.0+ (or OpenShift 4.18+)
  • Helm 3.x
  • kubectl

Install with Helm (Recommended)

# Add the WVA Helm repository (when published)
helm upgrade -i workload-variant-autoscaler ./charts/workload-variant-autoscaler \
  --namespace workload-variant-autoscaler-system \
  --set-file prometheus.caCert=/tmp/prometheus-ca.crt \
  --set variantAutoscaling.accelerator=L40S \
  --set variantAutoscaling.modelID=unsloth/Meta-Llama-3.1-8B \
  --set vllmService.enabled=true \
  --set vllmService.nodePort=30000
  --create-namespace

Try it Locally with Kind (No GPU Required!)

# Deploy WVA with llm-d infrastructure on a local Kind cluster
make deploy-wva-emulated-on-kind CREATE_CLUSTER=true DEPLOY_LLM_D=true

# This creates a Kind cluster with emulated GPUs and deploys:
# - WVA controller
# - llm-d infrastructure (simulation mode)
# - Prometheus and monitoring stack
# - vLLM emulator for testing

Works on Mac (Apple Silicon/Intel) and Windows - no physical GPUs needed! Perfect for development and testing with GPU emulation.

See the Installation Guide for detailed instructions.

Documentation

User Guide

Integrations

Deployment Options

Architecture

WVA consists of several key components:

  • Reconciler: Kubernetes controller that manages VariantAutoscaling resources
  • Collector: Gathers cluster state and vLLM server metrics
  • Optimizer: Capacity model provides saturation based scaling based on threshold
  • Actuator: Emits metrics to Prometheus and updates deployment replicas

How It Works

  1. Platform admin deploys llm-d infrastructure (including model servers) and waits for servers to warm up and start serving requests
  2. Platform admin creates a VariantAutoscaling CR for the running deployment
  3. WVA continuously monitors request rates and server performance via Prometheus metrics
  1. Capacity model obtains KV cache utilization and queue depth of inference servers with slack capacity to determine replicas
  2. Actuator emits optimization metrics to Prometheus and updates VariantAutoscaling status
  3. External autoscaler (HPA/KEDA) reads the metrics and scales the deployment accordingly

Important Notes:

  • WVA handles the creation order gracefully - you can create the VA before or after the deployment
  • If a deployment is deleted, the VA status is immediately updated to reflect the missing deployment
  • When the deployment is recreated, the VA automatically resumes operation
  • Configure HPA stabilization window (recommend 120s+) for gradual scaling behavior
  • WVA updates the VA status with current and desired allocations every reconciliation cycle

Example

apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
  name: llama-8b-autoscaler
  namespace: llm-inference
spec:
  scaleTargetRef:
    kind: Deployment
    name: llama-8b
  modelID: "meta/llama-3.1-8b"
  variantCost: "10.0"  # Optional, defaults to "10.0"

More examples in config/samples/.

Upgrading

CRD Updates

Important: Helm does not automatically update CRDs during helm upgrade. When upgrading WVA to a new version with CRD changes, you must manually apply the updated CRDs first:

# Apply the latest CRDs before upgrading
kubectl apply -f charts/workload-variant-autoscaler/crds/

# Then upgrade the Helm release
helm upgrade workload-variant-autoscaler ./charts/workload-variant-autoscaler \
  --namespace workload-variant-autoscaler-system \
  [your-values...]

Breaking Changes

v0.5.0 (upcoming)

  • VariantAutoscaling CRD: Added scaleTargetRef field as required. v0.4.1 VariantAutoscaling resources without scaleTargetRef must be updated before upgrading:
    • Impact on Scale-to-Zero: VAs without scaleTargetRef will not scale to zero properly, even with HPAScaleToZero enabled and HPA minReplicas: 0, because the HPA cannot reference the target deployment.
    • Migration: Update existing VAs to include scaleTargetRef:
      spec:
        scaleTargetRef:
          kind: Deployment
          name: <your-deployment-name>
    • Validation: After CRD update, VAs without scaleTargetRef will fail validation.

Verifying CRD Version

To check if your cluster has the latest CRD schema:

# Check the CRD fields
kubectl get crd variantautoscalings.llmd.ai -o jsonpath='{.spec.versions[0].schema.openAPIV3Schema.properties.spec.properties}' | jq 'keys'

Contributing

We welcome contributions! See the llm-d Contributing Guide for guidelines.

Join the llm-d autoscaling community meetings to get involved.

License

Apache 2.0 - see LICENSE for details.

Related Projects

References


For detailed documentation, visit the docs directory.

About

Variant optimization autoscaler for distributed inference workloads

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors 25