Kubernetes Chaos Testing — Pod, Node & Cluster Resilience
In this tutorial, you'll learn about Kubernetes Chaos Testing. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Kubernetes chaos testing validates that your cluster can survive failures at every level of the stack: pods, nodes, control plane components, and the underlying etcd cluster. Unlike basic pod-kill experiments, comprehensive Chaos Engineering on Kubernetes tests the entire Orchestration layer.
What You Will Learn
This tutorial teaches you how to test Kubernetes resilience at four levels: pod disruptions and scheduling, node failures and taints, control plane API degradation, and etcd quorum and consensus failures.
Why It Matters
Kubernetes adds significant complexity to Distributed Systems. The control plane, scheduler, kubelet, and etcd each introduce failure modes that do not exist in traditional infrastructure. Testing these components in isolation and combination reveals configuration errors and architectural weaknesses that pod-level experiments miss.
Real-World Use
DodaTech runs a quarterly "cluster stress test" that simulates a gradual failure of three worker nodes in the Durga Antivirus Pro scanning cluster. The test verifies that the remaining nodes can absorb the workload and that the cluster autoscaler provisions replacement nodes within the target window.
Prerequisites
Before starting you should understand:
- Kubernetes architecture (control plane, nodes, kubelet, etcd)
- Chaos Mesh installation and fault types
- Chaos Engineering fundamentals
- kubectl and cluster admin access
Step 1: Test Pod Disruption Budgets
Verify that PodDisruptionBudgets protect critical services during voluntary disruptions:
# pdb-test.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pdb-violation-test
spec:
action: pod-kill
mode: all
selector:
namespaces:
- production
labelSelectors:
app: critical-service
duration: 10s
# Create a PDB that allows max 1 unavailable pod
kubectl create poddisruptionbudget critical-pdb \
--selector app=critical-service \
--max-unavailable 1
# Expected output:
# poddisruptionbudget.policy/critical-pdb created
# Try to delete 3 pods simultaneously
kubectl delete pod -l app=critical-service --all &
kubectl delete pod -l app=critical-service --all &
kubectl delete pod -l app=critical-service --all &
wait
# Check how many pods are actually deleted
kubectl get pods -l app=critical-service
# Expected output:
# NAME READY STATUS
# critical-service-7d9f8c6b4f-1 1/1 Running
# critical-service-7d9f8c6b4f-2 1/1 Running
# critical-service-7d9f8c6b4f-3 1/1 Running
# critical-service-7d9f8c6b4f-4 0/1 Terminating
# Only 1 pod is terminated because the PDB prevents more than 1 unavailable
Step 2: Test Node Taints and Toleration
Add a taint to a node and verify that only tolerating pods remain:
# Add a taint simulating node degradation
kubectl taint nodes worker-node-1 disaster=failed:NoSchedule
# Expected output:
# node/worker-node-1 tainted
# Check which pods remain on the tainted node
kubectl get pods --all-namespaces \
--field-selector spec.nodeName=worker-node-1
# Expected output:
# NAMESPACE NAME READY STATUS
# kube-system kube-proxy-worker-node-1 1/1 Running
# kube-system chaos-daemon-worker-node-1 1/1 Running
# Daemonset pods tolerate the taint; regular pods should be evicted
# Verify unschedulable status
kubectl describe node worker-node-1 | grep Taints
# Expected output:
# Taints: disaster=failed:NoSchedule
# Remove the taint after testing
kubectl taint nodes worker-node-1 disaster=failed:NoSchedule-
# Expected output:
# node/worker-node-1 untainted
Step 3: Test Control Plane API Degradation
Simulate etcd latency to test control plane resilience:
# etcd-latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: etcd-latency
spec:
action: delay
mode: all
selector:
namespaces:
- kube-system
labelSelectors:
component: etcd
delay:
latency: 2000ms
jitter: 500ms
duration: 120s
kubectl apply -f etcd-latency.yaml
# Expected output:
# networkchaos.chaos-mesh.org/etcd-latency created
# While latency is active, test API responsiveness
time kubectl get pods --all-namespaces
# Expected output:
# real 0m4.234s
# (API calls are delayed by 2000ms from etcd latency)
# Check kube-apiserver logs for latency impact
kubectl logs -n kube-system -l component=kube-apiserver --tail=20
# Expected output:
# I0623 10:00:05.123456 ... "etcd" "latency"=2012ms
# The API server logs show increased etcd request latency
Step 4: Test etcd Quorum Loss
Simulate an etcd member failure and observe cluster behavior:
# Get etcd pods
kubectl get pods -n kube-system -l component=etcd
# Expected output:
# NAME READY STATUS
# etcd-master-0 1/1 Running
# etcd-master-1 1/1 Running
# etcd-master-2 1/1 Running
# Scale down etcd to 1 node (simulate quorum loss)
kubectl scale statefulset etcd-master -n kube-system --replicas=1
# Expected output:
# statefulset.apps/etcd-master scaled
# Verify etcd health
kubectl exec -n kube-system etcd-master-0 -- etcdctl endpoint health
# Expected output:
# http://localhost:2379 is healthy
# (Only 1 node remaining - quorum of 1 out of 1 works)
# Try to write a configuration change
kubectl exec -n kube-system etcd-master-0 -- etcdctl put /test/key value
# Expected output:
# OK
# Writes still work with 1 node, but there is no redundancy
Step 5: Test Persistent Volume Failures
Simulate a persistent volume becoming unavailable:
# Find a pod using a PVC
kubectl get pods -o wide | grep -i pvc
# Expected output:
# postgres-0 1/1 Running node=worker-1
# Block access to the PV through network chaos
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: pv-block
spec:
action: partition
mode: all
selector:
namespaces:
- default
labelSelectors:
app: postgres
direction: both
target:
mode: all
selector:
namespaces:
- default
labelSelectors:
role: storage-node
duration: 30s
EOF
# Expected output:
# networkchaos.chaos-mesh.org/pv-block created
# Watch the pod status
kubectl get pod postgres-0 -w
# Expected output:
# postgres-0 1/1 Running
# postgres-0 0/1 Pending
# The pod becomes pending because it cannot reach the storage backend
Learning Path
flowchart LR A[Kubernetes Chaos Basics] --> B[Kubernetes Chaos Testing] B --> C[Database Chaos] C --> D[Network Chaos] D --> E[Chaos Observability] style B fill:#f90,color:#fff
Common Errors
- Testing etcd by directly killing pods instead of simulating network issues: Killing etcd pods triggers the statefulset controller to restart them. Network partitions are more realistic.
- Forgetting that NoExecute taints evict all pods without tolerations: NoExecute taints immediately evict running pods. Use NoSchedule for less disruptive testing.
- PDB experiments without understanding maxUnavailable behavior: PDBs prevent voluntary disruptions but do not protect against node failures. Test both scenarios.
- Scaling down etcd without verifying the backup: If the remaining etcd node fails during the experiment, the cluster is unrecoverable. Always verify backups first.
- Not testing PV recovery after partition heals: After a PV becomes available again, verify that the pod can remount and resume normal operation without data loss.
Practice Questions
- How does a PodDisruptionBudget protect services during voluntary disruptions?
- What is the difference between NoSchedule and NoExecute taints?
- How does etcd latency affect Kubernetes API server performance?
- What happens to a pod when its persistent volume becomes unavailable?
- Why should you test both pod-kill and network-partition experiments on stateful workloads?
Challenge
Design and execute a comprehensive Kubernetes chaos test suite that covers: pod disruption budget validation by attempting to delete more pods than the PDB allows, node taint testing by adding a NoSchedule taint and verifying pod Migration, etcd quorum testing by Partitioning one etcd member, and persistent volume recovery by blocking and unblocking network access to the storage backend.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro