Kubernetes HPA & VPA Guide — Autoscaling Workloads
In this tutorial, you'll learn about Kubernetes HPA & VPA Guide. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Kubernetes HPA and VPA automatically adjust pod replicas or resource allocations based on observed metrics, ensuring applications have enough capacity during traffic spikes and don't waste resources during idle periods.
What You'll Learn
You'll master HPA for horizontal scaling (adding/removing replicas based on CPU, memory, or custom metrics) and VPA for vertical scaling (adjusting CPU/memory requests and limits based on historical usage patterns).
Why This Problem Matters
Static replica counts waste money. Over-provisioning 5x during peak traffic leaves servers idle 80% of the time. Under-provisioning causes performance degradation and outages. Autoscaling saves 30-50% on infrastructure costs while maintaining SLOs.
Real-World Use
Durga Antivirus Pro uses HPA to scale analysis workers based on queue depth (custom metric). When a new malware variant triggers a surge, workers scale from 5 to 200 within minutes. VPA recommends optimal resource sizes for long-running service pods.
Autoscaling Architecture
flowchart TB
subgraph MetricsSources
CPU[CPU Metrics]
MEM[Memory Metrics]
Custom[Custom Metrics
e.g., queue depth]
External[External Metrics
e.g., pub/sub]
end
subgraph Control
HPA[Horizontal Pod Autoscaler]
VPA[Vertical Pod Autoscaler]
MM[Metrics Server]
end
subgraph Actions
ScaleOut[Scale Out
Add replicas]
ScaleIn[Scale In
Remove replicas]
ResizeUp[Increase
Resource Requests]
ResizeDown[Decrease
Resource Requests]
end
subgraph Workload
Deploy[Deployment]
Sts[StatefulSet]
end
CPU --> MM
MEM --> MM
Custom --> HPA
External --> HPA
MM --> HPA
MM --> VPA
HPA --> ScaleOut
HPA --> ScaleIn
VPA --> ResizeUp
VPA --> ResizeDown
ScaleOut --> Deploy
ScaleIn --> Deploy
ResizeUp --> Deploy
HPA Setup
# hpa-cpu.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
kubectl apply -f hpa-cpu.yaml
kubectl get hpa web-hpa -w
Expected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
web-hpa Deployment/web 45%/70% 3 20 3 30s
web-hpa Deployment/web 85%/70% 3 20 5 45s
web-hpa Deployment/web 70%/70% 3 20 5 1m
Custom Metrics HPA
Scale based on application-level metrics like queue depth or requests per second:
# hpa-custom.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: worker
minReplicas: 2
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: queue_items_processed
target:
type: AverageValue
averageValue: 100
- type: Object
object:
metric:
name: queue_depth
describedObject:
apiVersion: v1
kind: Service
name: message-queue
target:
type: Value
value: 500
Prometheus-Based HPA
# prometheus-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: prometheus-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
selector:
matchLabels:
service: api-server
target:
type: AverageValue
averageValue: "1000"
HPA Algorithm
The HPA controller calculates desired replicas using:
desiredReplicas = currentReplicas * (currentMetricValue / desiredMetricValue)
def calculate_hpa_replicas(
current_replicas: int,
current_metric: float,
desired_metric: float,
min_replicas: int = 1,
max_replicas: int = 100
) -> int:
ratio = current_metric / desired_metric
desired = int(current_replicas * ratio)
# HPA tolerates small deviations (0.9 to 1.1)
if abs(1.0 - ratio) < 0.1:
return current_replicas
desired = min(desired, max_replicas)
desired = max(desired, min_replicas)
return desired
# Example: 10 pods, current CPU 90%, target CPU 70%
replicas = calculate_hpa_replicas(10, 90, 70)
print(f"CPU 90% -> 70%: {replicas} replicas")
# Example: 5 pods, current CPU 40%, target CPU 70%
replicas = calculate_hpa_replicas(5, 40, 70)
print(f"CPU 40% -> 70%: {replicas} replicas")
# Example: 3 pods, current CPU 180%, target CPU 70%
replicas = calculate_hpa_replicas(3, 180, 70)
print(f"CPU 180% -> 70%: {replicas} replicas")
Expected output:
CPU 90% -> 70%: 12 replicas
CPU 40% -> 70%: 3 replicas
CPU 180% -> 70%: 7 replicas
VPA Setup
# vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: app
updatePolicy:
updateMode: "Auto" # Auto, Initial, Off
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 4
memory: 8Gi
# Install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
# Check VPA recommendations
kubectl describe vpa app-vpa
Expected output:
Recommendation:
Container Recommendations:
app:
Lower Bound:
Cpu: 100m
Memory: 150Mi
Upper Bound:
Cpu: 2
Memory: 1Gi
Target:
Cpu: 250m
Memory: 320Mi
VPA Recommendation Engine
import random
import statistics
from collections import deque
class VPARecommender:
def __init__(self, window_size: int = 1440):
self.cpu_samples = deque(maxlen=window_size)
self.mem_samples = deque(maxlen=window_size)
def add_sample(self, cpu_millicores: int, memory_mb: int):
self.cpu_samples.append(cpu_millicores)
self.mem_samples.append(memory_mb)
def recommend(self) -> dict:
if len(self.cpu_samples) < 10:
return {"cpu": "100m", "memory": "128Mi"}
cpu_p99 = sorted(self.cpu_samples)[int(len(self.cpu_samples) * 0.99)]
mem_p99 = sorted(self.mem_samples)[int(len(self.mem_samples) * 0.99)]
cpu_target = cpu_p99 * 1.15 # 15% safety buffer
mem_target = mem_p99 * 1.15
# Round up to sensible values
cpu_target = max(50, ((int(cpu_target) + 49) // 50) * 50)
mem_target = max(64, ((int(mem_target) + 63) // 64) * 64)
return {
"cpu": f"{cpu_target}m",
"memory": f"{mem_target}Mi"
}
vpa = VPARecommender()
for _ in range(2000):
cpu = random.gauss(200, 50)
mem = random.gauss(300, 80)
vpa.add_sample(max(10, cpu), max(32, mem))
rec = vpa.recommend()
print(f"Recommended: CPU {rec['cpu']}, Memory {rec['memory']}")
Expected output:
Recommended: CPU 350m, Memory 450Mi
Scaling Behavior Comparison
| Aspect | HPA | VPA |
|---|---|---|
| What it scales | Number of replicas | Resource requests/limits |
| Response time | Seconds to minutes | Minutes to hours |
| Metric type | All metrics | CPU/memory only |
| Restart required | No | Yes (mode: Auto) |
| Stateful workloads | Limited (use with PDB) | Works well |
| Cost savings | High (scale to zero) | Medium (right-sizing) |
Common Mistakes
1. HPA Without Metrics Server
HPA requires the Metrics Server for CPU/memory metrics. Without it, HPA reports <unknown>/70% and never scales. Install metrics server: kubectl apply -f https://github.com/Kubernetes-sigs/metrics-server/releases/latest/download/components.yaml.
2. VPA with UpdateMode: Off by Default
VPA defaults to Off. It generates recommendations but never applies them. Set updateMode: Auto for automatic application, or Initial to apply only at pod creation.
3. Conflicting HPA and VPA on CPU/Memory
VPA changes CPU/memory requests, which affects HPA's utilization calculation. They fight each other. Use VPA only for workloads not managed by HPA, or configure VPA to exclude CPU/memory and use HPA for those.
4. Too Aggressive Scaling
HPA can thrash: scaling up, then down, then up again. Use --horizontal-pod-autoscaler-downscale-stabilization=300s (default 5 min) to prevent rapid downscaling.
5. Not Setting Min/Max Replicas
Without minReplicas and maxReplicas, HPA can scale to 0 or to unlimited replicas. Always set reasonable bounds.
6. Ignoring Pod Startup Time
When scaling up, new pods take time to start and serve traffic. During that time, the metric is still elevated, causing more scale-up. Use readiness probes and startup probes to prevent over-scaling.
7. No Custom Metrics Pipeline
Custom metrics require a metrics adapter (Prometheus Adapter, Datadog Cluster Agent). The pipeline adds latency and complexity. Ensure the adapter is reliable before relying on custom metrics for scaling.
Practice Questions
1. How does HPA calculate the desired number of replicas?
desiredReplicas = currentReplicas * (currentMetricValue / desiredMetricValue). If the ratio is > 1.1, scale up. If < 0.9, scale down. The HPA controller reevaluates every 15 seconds by default.
2. What is the difference between VPA UpdateMode Auto, Initial, and Off?
Auto: automatically applies recommendations by recreating pods. Initial: applies recommendations only at pod creation time (existing pods keep their current resources). Off: generates recommendations only, no automatic application.
3. Why does VPA require pod recreation to change resources?
Kubernetes doesn't support changing resource requests on a running container. The pod must be recreated with the new resource values. VPA with updateMode: Auto evicts pods using a Pod Disruption Budget.
4. Can HPA scale based on multiple metrics simultaneously?
Yes. HPA considers each metric independently and chooses the highest desired replicas across all metrics. If CPU suggests 5 replicas and queue depth suggests 10, the HPA uses 10 replicas.
5. Challenge: Design a scaling Strategy for a batch job processing system.
A system processes jobs from a queue. Each job takes 5-60 minutes. Scale should respond to queue depth, but avoid scaling down while jobs are in progress. Design an HPA configuration with custom metrics and PodDisruptionBudget.
Mini Project: Autoscaling Simulator
import random
import time
class HPA:
def __init__(self, current: int = 3, min_pods: int = 1, max_pods: int = 20):
self.replicas = current
self.min_pods = min_pods
self.max_pods = max_pods
self.history = []
def tick(self, cpu_utilization: float, target_cpu: float = 70.0):
ratio = cpu_utilization / target_cpu
desired = int(self.replicas * ratio)
if abs(1.0 - ratio) < 0.1:
pass # Within tolerance, no change
elif ratio > 1.1:
self.replicas = min(desired, self.max_pods)
elif ratio < 0.9:
self.replicas = max(desired, self.min_pods)
self.history.append((cpu_utilization, self.replicas))
class LoadSimulator:
def __init__(self, base_load: float = 30, peak_load: float = 200):
self.base = base_load
self.peak = peak_load
self.time = 0
def current_load(self) -> float:
pattern = (
self.base
+ (self.peak - self.base) * abs(math.sin(self.time / 10))
)
self.time += 1
return pattern + random.gauss(0, 10)
import math
hpa = HPA(current=3, min_pods=2, max_pods=20)
load = LoadSimulator(base_load=40, peak_load=180)
print(f"{'Time':>6} {'Load%':>8} {'Pods':>6}")
print("-" * 22)
for i in range(30):
cpu = load.current_load()
hpa.tick(cpu, target_cpu=70)
if i % 3 == 0:
print(f"{i:>6} {cpu:>7.1f}% {hpa.replicas:>6}")
time.sleep(0.1)
Expected output:
Time Load% Pods
------------------------
0 40.0% 3
3 74.2% 3
6 108.5% 5
9 147.2% 7
12 171.0% 10
15 174.3% 13
18 151.8% 14
21 104.4% 14
24 71.5% 10
27 54.1% 5
30 42.8% 3
FAQ
What's Next
Congratulations on completing this HPA & VPA guide! Here's where to go from here:
- Practice daily — Add HPA to one of your deployments
- Build a project — Set up custom metrics autoscaling with Prometheus
- Explore related topics — Cluster autoscaler, pod disruption budgets, predictive autoscaling
- Join the community — Share your autoscaling configurations and get feedback
Remember: every expert was once a beginner. Keep scaling!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro