Horizontal Pod Autoscaling: Metrics, Policies & Custom Autoscalers
In this tutorial, you'll learn about Horizontal Pod Autoscaling: Metrics, Policies & Custom Autoscalers. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Horizontal Pod Autoscaling automatically adjusts the number of pod replicas based on observed metrics, helping applications handle traffic spikes while minimizing cost during low-demand periods.
What You'll Learn
This tutorial covers HPA configuration with CPU and memory metrics, custom Prometheus-based metrics, scaling behavior policies including stabilization Windows, and when to use VPA alongside HPA for comprehensive scaling.
Why It Matters
Static replica counts waste resources during low traffic and cause outages during spikes. Autoscaling reduces cloud bills by 30-50 percent while maintaining application responsiveness under variable load.
Real-World Use
Zalando uses HPA with custom Prometheus metrics based on order queue depth to scale their e-commerce platform during flash sales, going from 50 to 500 pods in under two minutes. Lyft uses HPA with gRPC request metrics to scale backend services across thousands of Microservices.
graph LR
A[Metrics Server / Prometheus] --> B[HPA Controller]
B --> C{Calculate desired replicas}
C --> D[Scale Up]
C --> E[Scale Down]
D --> F[Update Deployment replicas]
E --> F
F --> G[Pod count adjusted]
G --> A
Expected output: diagram showing the HPA feedback loop -- metrics feed the controller, it calculates desired replicas, and updates the deployment.
Resource-Based HPA
The simplest HPA configuration uses CPU and memory utilization from the metrics server.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 75
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Check metrics server is running
kubectl -n kube-system get pods -l k8s-app=metrics-server
# Create the HPA
kubectl apply -f hpa-cpu-memory.yaml
# Watch HPA status
kubectl get hpa api-server-hpa --watch
Expected output: the HPA shows current CPU and memory utilization percentages alongside target values. As load increases, the replicas count rises toward maxReplicas.
Custom Metrics HPA
Use custom metrics from Prometheus to scale based on application-specific signals.
# prometheus-adapter values.yaml
rules:
custom:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "http_requests_total"
as: "requests_per_second"
metricsQuery: 'rate(http_requests_total{<<.LabelMatchers>>}[2m])'
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa-custom
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-frontend
minReplicas: 2
maxReplicas: 30
metrics:
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "500"
# Install Prometheus adapter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--values adapter-values.yaml
# Verify custom metric is available
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .
Expected output: the API response lists available custom metrics including requests_per_second with current values for each pod.
Scaling Behavior Policies
Fine-tune how fast the HPA scales up and down to avoid thrashing.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa-policies
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-frontend
minReplicas: 2
maxReplicas: 50
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Pods
value: 5
periodSeconds: 15
- type: Percent
value: 100
periodSeconds: 15
The scaleDown policy limits removal to 10 percent of current replicas per minute with a 5-minute stabilization window. The scaleUp policy allows adding 5 pods every 15 seconds or doubling, whichever is more aggressive.
# Generate load to trigger scaling
kubectl run load-generator --image=busybox -- /bin/sh -c \
"while true; do wget -q -O- http://web-frontend; done"
# Observe scaling behavior
kubectl describe hpa web-hpa-policies
Expected output: the describe output shows the current replicas, metrics, and scaling events including timestamps for each scale-up and scale-down action.
Vertical Pod Autoscaler
VPA adjusts CPU and memory requests for individual pods, complementing HPA.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: Auto
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: "4"
memory: 4Gi
# Install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
# Check VPA recommendations
kubectl get vpa api-server-vpa -o yaml
Expected output: the VPA status section shows recommended CPU and memory values (lowerBound, target, upperBound) based on historical usage.
Practice Questions
How does HPA calculate the desired number of replicas? It divides the current metric value by the target value and multiplies by the current replica count, using the metric that requires the most replicas when multiple metrics are defined.
What is the purpose of the stabilization window? It prevents flapping by requiring the metric to stay above or below the threshold for a specified duration before scaling. This avoids rapid scale-up and scale-down cycles caused by brief metric spikes.
Can HPA and VPA be used together? Yes, but they should not target the same metric. Use HPA for horizontal scaling based on load metrics and VPA for right-sizing container resource requests based on historical usage patterns.
Frequently Asked Questions
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro