Skip to content

Kubernetes HPA & VPA Guide — Autoscaling Workloads

DodaTech Updated 2026-06-24 9 min read

In this tutorial, you'll learn about Kubernetes HPA & VPA Guide. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Kubernetes HPA and VPA automatically adjust pod replicas or resource allocations based on observed metrics, ensuring applications have enough capacity during traffic spikes and don't waste resources during idle periods.

What You'll Learn

You'll master HPA for horizontal scaling (adding/removing replicas based on CPU, memory, or custom metrics) and VPA for vertical scaling (adjusting CPU/memory requests and limits based on historical usage patterns).

Why This Problem Matters

Static replica counts waste money. Over-provisioning 5x during peak traffic leaves servers idle 80% of the time. Under-provisioning causes performance degradation and outages. Autoscaling saves 30-50% on infrastructure costs while maintaining SLOs.

Real-World Use

Durga Antivirus Pro uses HPA to scale analysis workers based on queue depth (custom metric). When a new malware variant triggers a surge, workers scale from 5 to 200 within minutes. VPA recommends optimal resource sizes for long-running service pods.

Autoscaling Architecture

flowchart TB
  subgraph MetricsSources
    CPU[CPU Metrics]
    MEM[Memory Metrics]
    Custom[Custom Metrics
e.g., queue depth] External[External Metrics
e.g., pub/sub] end subgraph Control HPA[Horizontal Pod Autoscaler] VPA[Vertical Pod Autoscaler] MM[Metrics Server] end subgraph Actions ScaleOut[Scale Out
Add replicas] ScaleIn[Scale In
Remove replicas] ResizeUp[Increase
Resource Requests] ResizeDown[Decrease
Resource Requests] end subgraph Workload Deploy[Deployment] Sts[StatefulSet] end CPU --> MM MEM --> MM Custom --> HPA External --> HPA MM --> HPA MM --> VPA HPA --> ScaleOut HPA --> ScaleIn VPA --> ResizeUp VPA --> ResizeDown ScaleOut --> Deploy ScaleIn --> Deploy ResizeUp --> Deploy

HPA Setup

# hpa-cpu.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
kubectl apply -f hpa-cpu.yaml
kubectl get hpa web-hpa -w

Expected output:

NAME      REFERENCE         TARGETS        MINPODS   MAXPODS   REPLICAS   AGE
web-hpa   Deployment/web   45%/70%         3          20        3          30s
web-hpa   Deployment/web   85%/70%         3          20        5          45s
web-hpa   Deployment/web   70%/70%         3          20        5          1m

Custom Metrics HPA

Scale based on application-level metrics like queue depth or requests per second:

# hpa-custom.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: queue_items_processed
        target:
          type: AverageValue
          averageValue: 100
    - type: Object
      object:
        metric:
          name: queue_depth
        describedObject:
          apiVersion: v1
          kind: Service
          name: message-queue
        target:
          type: Value
          value: 500

Prometheus-Based HPA

# prometheus-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: prometheus-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
          selector:
            matchLabels:
              service: api-server
        target:
          type: AverageValue
          averageValue: "1000"

HPA Algorithm

The HPA controller calculates desired replicas using:

desiredReplicas = currentReplicas * (currentMetricValue / desiredMetricValue)
def calculate_hpa_replicas(
    current_replicas: int,
    current_metric: float,
    desired_metric: float,
    min_replicas: int = 1,
    max_replicas: int = 100
) -> int:
    ratio = current_metric / desired_metric
    desired = int(current_replicas * ratio)

    # HPA tolerates small deviations (0.9 to 1.1)
    if abs(1.0 - ratio) < 0.1:
        return current_replicas

    desired = min(desired, max_replicas)
    desired = max(desired, min_replicas)
    return desired

# Example: 10 pods, current CPU 90%, target CPU 70%
replicas = calculate_hpa_replicas(10, 90, 70)
print(f"CPU 90% -> 70%: {replicas} replicas")

# Example: 5 pods, current CPU 40%, target CPU 70%
replicas = calculate_hpa_replicas(5, 40, 70)
print(f"CPU 40% -> 70%: {replicas} replicas")

# Example: 3 pods, current CPU 180%, target CPU 70%
replicas = calculate_hpa_replicas(3, 180, 70)
print(f"CPU 180% -> 70%: {replicas} replicas")

Expected output:

CPU 90% -> 70%: 12 replicas
CPU 40% -> 70%: 3 replicas
CPU 180% -> 70%: 7 replicas

VPA Setup

# vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: app
  updatePolicy:
    updateMode: "Auto"  # Auto, Initial, Off
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi
# Install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

# Check VPA recommendations
kubectl describe vpa app-vpa

Expected output:

Recommendation:
  Container Recommendations:
    app:
      Lower Bound:
        Cpu:      100m
        Memory:   150Mi
      Upper Bound:
        Cpu:      2
        Memory:   1Gi
      Target:
        Cpu:      250m
        Memory:   320Mi

VPA Recommendation Engine

import random
import statistics
from collections import deque

class VPARecommender:
    def __init__(self, window_size: int = 1440):
        self.cpu_samples = deque(maxlen=window_size)
        self.mem_samples = deque(maxlen=window_size)

    def add_sample(self, cpu_millicores: int, memory_mb: int):
        self.cpu_samples.append(cpu_millicores)
        self.mem_samples.append(memory_mb)

    def recommend(self) -> dict:
        if len(self.cpu_samples) < 10:
            return {"cpu": "100m", "memory": "128Mi"}

        cpu_p99 = sorted(self.cpu_samples)[int(len(self.cpu_samples) * 0.99)]
        mem_p99 = sorted(self.mem_samples)[int(len(self.mem_samples) * 0.99)]

        cpu_target = cpu_p99 * 1.15  # 15% safety buffer
        mem_target = mem_p99 * 1.15

        # Round up to sensible values
        cpu_target = max(50, ((int(cpu_target) + 49) // 50) * 50)
        mem_target = max(64, ((int(mem_target) + 63) // 64) * 64)

        return {
            "cpu": f"{cpu_target}m",
            "memory": f"{mem_target}Mi"
        }

vpa = VPARecommender()
for _ in range(2000):
    cpu = random.gauss(200, 50)
    mem = random.gauss(300, 80)
    vpa.add_sample(max(10, cpu), max(32, mem))

rec = vpa.recommend()
print(f"Recommended: CPU {rec['cpu']}, Memory {rec['memory']}")

Expected output:

Recommended: CPU 350m, Memory 450Mi

Scaling Behavior Comparison

Aspect HPA VPA
What it scales Number of replicas Resource requests/limits
Response time Seconds to minutes Minutes to hours
Metric type All metrics CPU/memory only
Restart required No Yes (mode: Auto)
Stateful workloads Limited (use with PDB) Works well
Cost savings High (scale to zero) Medium (right-sizing)

Common Mistakes

1. HPA Without Metrics Server

HPA requires the Metrics Server for CPU/memory metrics. Without it, HPA reports <unknown>/70% and never scales. Install metrics server: kubectl apply -f https://github.com/Kubernetes-sigs/metrics-server/releases/latest/download/components.yaml.

2. VPA with UpdateMode: Off by Default

VPA defaults to Off. It generates recommendations but never applies them. Set updateMode: Auto for automatic application, or Initial to apply only at pod creation.

3. Conflicting HPA and VPA on CPU/Memory

VPA changes CPU/memory requests, which affects HPA's utilization calculation. They fight each other. Use VPA only for workloads not managed by HPA, or configure VPA to exclude CPU/memory and use HPA for those.

4. Too Aggressive Scaling

HPA can thrash: scaling up, then down, then up again. Use --horizontal-pod-autoscaler-downscale-stabilization=300s (default 5 min) to prevent rapid downscaling.

5. Not Setting Min/Max Replicas

Without minReplicas and maxReplicas, HPA can scale to 0 or to unlimited replicas. Always set reasonable bounds.

6. Ignoring Pod Startup Time

When scaling up, new pods take time to start and serve traffic. During that time, the metric is still elevated, causing more scale-up. Use readiness probes and startup probes to prevent over-scaling.

7. No Custom Metrics Pipeline

Custom metrics require a metrics adapter (Prometheus Adapter, Datadog Cluster Agent). The pipeline adds latency and complexity. Ensure the adapter is reliable before relying on custom metrics for scaling.

Practice Questions

1. How does HPA calculate the desired number of replicas?

desiredReplicas = currentReplicas * (currentMetricValue / desiredMetricValue). If the ratio is > 1.1, scale up. If < 0.9, scale down. The HPA controller reevaluates every 15 seconds by default.

2. What is the difference between VPA UpdateMode Auto, Initial, and Off?

Auto: automatically applies recommendations by recreating pods. Initial: applies recommendations only at pod creation time (existing pods keep their current resources). Off: generates recommendations only, no automatic application.

3. Why does VPA require pod recreation to change resources?

Kubernetes doesn't support changing resource requests on a running container. The pod must be recreated with the new resource values. VPA with updateMode: Auto evicts pods using a Pod Disruption Budget.

4. Can HPA scale based on multiple metrics simultaneously?

Yes. HPA considers each metric independently and chooses the highest desired replicas across all metrics. If CPU suggests 5 replicas and queue depth suggests 10, the HPA uses 10 replicas.

5. Challenge: Design a scaling Strategy for a batch job processing system.

A system processes jobs from a queue. Each job takes 5-60 minutes. Scale should respond to queue depth, but avoid scaling down while jobs are in progress. Design an HPA configuration with custom metrics and PodDisruptionBudget.

Mini Project: Autoscaling Simulator

import random
import time

class HPA:
    def __init__(self, current: int = 3, min_pods: int = 1, max_pods: int = 20):
        self.replicas = current
        self.min_pods = min_pods
        self.max_pods = max_pods
        self.history = []

    def tick(self, cpu_utilization: float, target_cpu: float = 70.0):
        ratio = cpu_utilization / target_cpu
        desired = int(self.replicas * ratio)

        if abs(1.0 - ratio) < 0.1:
            pass  # Within tolerance, no change
        elif ratio > 1.1:
            self.replicas = min(desired, self.max_pods)
        elif ratio < 0.9:
            self.replicas = max(desired, self.min_pods)

        self.history.append((cpu_utilization, self.replicas))

class LoadSimulator:
    def __init__(self, base_load: float = 30, peak_load: float = 200):
        self.base = base_load
        self.peak = peak_load
        self.time = 0

    def current_load(self) -> float:
        pattern = (
            self.base
            + (self.peak - self.base) * abs(math.sin(self.time / 10))
        )
        self.time += 1
        return pattern + random.gauss(0, 10)

import math
hpa = HPA(current=3, min_pods=2, max_pods=20)
load = LoadSimulator(base_load=40, peak_load=180)

print(f"{'Time':>6} {'Load%':>8} {'Pods':>6}")
print("-" * 22)
for i in range(30):
    cpu = load.current_load()
    hpa.tick(cpu, target_cpu=70)
    if i % 3 == 0:
        print(f"{i:>6} {cpu:>7.1f}% {hpa.replicas:>6}")
    time.sleep(0.1)

Expected output:

  Time    Load%   Pods
------------------------
     0   40.0%      3
     3   74.2%      3
     6   108.5%     5
     9   147.2%     7
    12   171.0%    10
    15   174.3%    13
    18   151.8%    14
    21   104.4%    14
    24   71.5%    10
    27   54.1%     5
    30   42.8%     3

FAQ

Should I use HPA or VPA for my workload?

Use HPA for stateless workloads that scale horizontally (web servers, APIs, workers). Use VPA for stateful workloads (databases, message brokers) where adding replicas is complex. VPA is also useful for right-sizing initial resource requests in new deployments.

Why does my HPA show unknown targets?

The metrics server is not installed or not collecting metrics. Run kubectl top pods to verify. If that fails, install the metrics server. Also check that the metrics server has sufficient permissions.

What is the cooldown period for HPA scale-up?

HPA has no cooldown for scale-up — it can scale up immediately if metrics warrant. For scale-down, the default stabilization window is 5 minutes (--horizontal-pod-autoscaler-downscale-stabilization). This prevents flapping.

What's Next

Kubernetes Scheduling Guide
Kubernetes Monitoring
Kubernetes Metrics Server

Congratulations on completing this HPA & VPA guide! Here's where to go from here:

  • Practice daily — Add HPA to one of your deployments
  • Build a project — Set up custom metrics autoscaling with Prometheus
  • Explore related topics — Cluster autoscaler, pod disruption budgets, predictive autoscaling
  • Join the community — Share your autoscaling configurations and get feedback

Remember: every expert was once a beginner. Keep scaling!

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro