Horizontal Pod Autoscaling: Metrics, Policies & Custom Autoscalers

DodaTech 5 min read

In this tutorial, you'll learn about Horizontal Pod Autoscaling: Metrics, Policies & Custom Autoscalers. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Horizontal Pod Autoscaling automatically adjusts the number of pod replicas based on observed metrics, helping applications handle traffic spikes while minimizing cost during low-demand periods.

What You'll Learn

This tutorial covers HPA configuration with CPU and memory metrics, custom Prometheus-based metrics, scaling behavior policies including stabilization Windows, and when to use VPA alongside HPA for comprehensive scaling.

Why It Matters

Static replica counts waste resources during low traffic and cause outages during spikes. Autoscaling reduces cloud bills by 30-50 percent while maintaining application responsiveness under variable load.

Real-World Use

Zalando uses HPA with custom Prometheus metrics based on order queue depth to scale their e-commerce platform during flash sales, going from 50 to 500 pods in under two minutes. Lyft uses HPA with gRPC request metrics to scale backend services across thousands of Microservices.

graph LR
  A[Metrics Server / Prometheus] --> B[HPA Controller]
  B --> C{Calculate desired replicas}
  C --> D[Scale Up]
  C --> E[Scale Down]
  D --> F[Update Deployment replicas]
  E --> F
  F --> G[Pod count adjusted]
  G --> A

Expected output: diagram showing the HPA feedback loop -- metrics feed the controller, it calculates desired replicas, and updates the deployment.

Resource-Based HPA

The simplest HPA configuration uses CPU and memory utilization from the metrics server.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 75
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

# Check metrics server is running
kubectl -n kube-system get pods -l k8s-app=metrics-server

# Create the HPA
kubectl apply -f hpa-cpu-memory.yaml

# Watch HPA status
kubectl get hpa api-server-hpa --watch

Expected output: the HPA shows current CPU and memory utilization percentages alongside target values. As load increases, the replicas count rises toward maxReplicas.

Custom Metrics HPA

Use custom metrics from Prometheus to scale based on application-specific signals.

# prometheus-adapter values.yaml
rules:
  custom:
  - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "http_requests_total"
      as: "requests_per_second"
    metricsQuery: 'rate(http_requests_total{<<.LabelMatchers>>}[2m])'

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa-custom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-frontend
  minReplicas: 2
  maxReplicas: 30
  metrics:
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "500"

# Install Prometheus adapter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --values adapter-values.yaml

# Verify custom metric is available
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .

Expected output: the API response lists available custom metrics including requests_per_second with current values for each pod.

Scaling Behavior Policies

Fine-tune how fast the HPA scales up and down to avoid thrashing.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa-policies
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-frontend
  minReplicas: 2
  maxReplicas: 50
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Pods
        value: 5
        periodSeconds: 15
      - type: Percent
        value: 100
        periodSeconds: 15

The scaleDown policy limits removal to 10 percent of current replicas per minute with a 5-minute stabilization window. The scaleUp policy allows adding 5 pods every 15 seconds or doubling, whichever is more aggressive.

# Generate load to trigger scaling
kubectl run load-generator --image=busybox -- /bin/sh -c \
  "while true; do wget -q -O- http://web-frontend; done"

# Observe scaling behavior
kubectl describe hpa web-hpa-policies

Expected output: the describe output shows the current replicas, metrics, and scaling events including timestamps for each scale-up and scale-down action.

Vertical Pod Autoscaler

VPA adjusts CPU and memory requests for individual pods, complementing HPA.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: Auto
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: "4"
        memory: 4Gi

# Install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

# Check VPA recommendations
kubectl get vpa api-server-vpa -o yaml

Expected output: the VPA status section shows recommended CPU and memory values (lowerBound, target, upperBound) based on historical usage.

Practice Questions

How does HPA calculate the desired number of replicas? It divides the current metric value by the target value and multiplies by the current replica count, using the metric that requires the most replicas when multiple metrics are defined.
What is the purpose of the stabilization window? It prevents flapping by requiring the metric to stay above or below the threshold for a specified duration before scaling. This avoids rapid scale-up and scale-down cycles caused by brief metric spikes.
Can HPA and VPA be used together? Yes, but they should not target the same metric. Use HPA for horizontal scaling based on load metrics and VPA for right-sizing container resource requests based on historical usage patterns.

Frequently Asked Questions

How long does it take for HPA to react to load changes?

HPA checks metrics every 15 seconds by default (configurable via --horizontal-pod-autoscaler-sync-period). However, the metrics server scrapes pods every 60 seconds, so the total reaction time is typically 1-2 minutes from load increase to new pods starting. For faster response, use KEDA with Prometheus metrics scraped at shorter intervals.

What happens when a metric is not available?

If a metric is missing for a pod, HPA calculates the desired replicas using only the pods with available metrics. If no pods have metrics, HPA does not scale. The autoscaler records an event explaining why it could not calculate. Ensure the metrics server or Prometheus adapter is healthy and scraping all pods.

How do I prevent HPA from scaling down too aggressively?

Set a stabilizationWindowSeconds in the scaleDown behavior policy. A window of 300-600 seconds is typical. You can also define a minimum replica count that prevents scaling below a baseline, and configure a percent-based policy that limits how many pods can be removed per minute.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous ConfigMaps and Secrets: Managing Configuration in Kubernetes Next → Building Helm Charts: From Templates to Production Deployments

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Kubernetes