Kubernetes Chaos Testing — Pod, Node & Cluster Resilience

DodaTech Updated 2026-06-23 6 min read

In this tutorial, you'll learn about Kubernetes Chaos Testing. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Kubernetes chaos testing validates that your cluster can survive failures at every level of the stack: pods, nodes, control plane components, and the underlying etcd cluster. Unlike basic pod-kill experiments, comprehensive Chaos Engineering on Kubernetes tests the entire Orchestration layer.

What You Will Learn

This tutorial teaches you how to test Kubernetes resilience at four levels: pod disruptions and scheduling, node failures and taints, control plane API degradation, and etcd quorum and consensus failures.

Why It Matters

Kubernetes adds significant complexity to Distributed Systems. The control plane, scheduler, kubelet, and etcd each introduce failure modes that do not exist in traditional infrastructure. Testing these components in isolation and combination reveals configuration errors and architectural weaknesses that pod-level experiments miss.

Real-World Use

DodaTech runs a quarterly "cluster stress test" that simulates a gradual failure of three worker nodes in the Durga Antivirus Pro scanning cluster. The test verifies that the remaining nodes can absorb the workload and that the cluster autoscaler provisions replacement nodes within the target window.

Prerequisites

Before starting you should understand:

Kubernetes architecture (control plane, nodes, kubelet, etcd)
Chaos Mesh installation and fault types
Chaos Engineering fundamentals
kubectl and cluster admin access

Step 1: Test Pod Disruption Budgets

Verify that PodDisruptionBudgets protect critical services during voluntary disruptions:

# pdb-test.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pdb-violation-test
spec:
  action: pod-kill
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: critical-service
  duration: 10s

# Create a PDB that allows max 1 unavailable pod
kubectl create poddisruptionbudget critical-pdb \
  --selector app=critical-service \
  --max-unavailable 1

# Expected output:
# poddisruptionbudget.policy/critical-pdb created

# Try to delete 3 pods simultaneously
kubectl delete pod -l app=critical-service --all &
kubectl delete pod -l app=critical-service --all &
kubectl delete pod -l app=critical-service --all &
wait

# Check how many pods are actually deleted
kubectl get pods -l app=critical-service
# Expected output:
# NAME                             READY   STATUS
# critical-service-7d9f8c6b4f-1   1/1     Running
# critical-service-7d9f8c6b4f-2   1/1     Running
# critical-service-7d9f8c6b4f-3   1/1     Running
# critical-service-7d9f8c6b4f-4   0/1     Terminating
# Only 1 pod is terminated because the PDB prevents more than 1 unavailable

Step 2: Test Node Taints and Toleration

Add a taint to a node and verify that only tolerating pods remain:

# Add a taint simulating node degradation
kubectl taint nodes worker-node-1 disaster=failed:NoSchedule

# Expected output:
# node/worker-node-1 tainted

# Check which pods remain on the tainted node
kubectl get pods --all-namespaces \
  --field-selector spec.nodeName=worker-node-1
# Expected output:
# NAMESPACE   NAME                              READY   STATUS
# kube-system kube-proxy-worker-node-1          1/1     Running
# kube-system chaos-daemon-worker-node-1        1/1     Running
# Daemonset pods tolerate the taint; regular pods should be evicted

# Verify unschedulable status
kubectl describe node worker-node-1 | grep Taints
# Expected output:
# Taints:             disaster=failed:NoSchedule

# Remove the taint after testing
kubectl taint nodes worker-node-1 disaster=failed:NoSchedule-
# Expected output:
# node/worker-node-1 untainted

Step 3: Test Control Plane API Degradation

Simulate etcd latency to test control plane resilience:

# etcd-latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: etcd-latency
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - kube-system
    labelSelectors:
      component: etcd
  delay:
    latency: 2000ms
    jitter: 500ms
  duration: 120s

kubectl apply -f etcd-latency.yaml
# Expected output:
# networkchaos.chaos-mesh.org/etcd-latency created

# While latency is active, test API responsiveness
time kubectl get pods --all-namespaces
# Expected output:
# real    0m4.234s
# (API calls are delayed by 2000ms from etcd latency)

# Check kube-apiserver logs for latency impact
kubectl logs -n kube-system -l component=kube-apiserver --tail=20
# Expected output:
# I0623 10:00:05.123456    ... "etcd" "latency"=2012ms
# The API server logs show increased etcd request latency

Step 4: Test etcd Quorum Loss

Simulate an etcd member failure and observe cluster behavior:

# Get etcd pods
kubectl get pods -n kube-system -l component=etcd
# Expected output:
# NAME           READY   STATUS
# etcd-master-0  1/1     Running
# etcd-master-1  1/1     Running
# etcd-master-2  1/1     Running

# Scale down etcd to 1 node (simulate quorum loss)
kubectl scale statefulset etcd-master -n kube-system --replicas=1

# Expected output:
# statefulset.apps/etcd-master scaled

# Verify etcd health
kubectl exec -n kube-system etcd-master-0 -- etcdctl endpoint health
# Expected output:
# http://localhost:2379 is healthy
# (Only 1 node remaining - quorum of 1 out of 1 works)

# Try to write a configuration change
kubectl exec -n kube-system etcd-master-0 -- etcdctl put /test/key value
# Expected output:
# OK
# Writes still work with 1 node, but there is no redundancy

Step 5: Test Persistent Volume Failures

Simulate a persistent volume becoming unavailable:

# Find a pod using a PVC
kubectl get pods -o wide | grep -i pvc
# Expected output:
# postgres-0   1/1   Running   node=worker-1

# Block access to the PV through network chaos
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: pv-block
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: postgres
  direction: both
  target:
    mode: all
    selector:
      namespaces:
        - default
      labelSelectors:
        role: storage-node
  duration: 30s
EOF

# Expected output:
# networkchaos.chaos-mesh.org/pv-block created

# Watch the pod status
kubectl get pod postgres-0 -w
# Expected output:
# postgres-0   1/1   Running
# postgres-0   0/1   Pending
# The pod becomes pending because it cannot reach the storage backend

Learning Path

flowchart LR
  A[Kubernetes Chaos Basics] --> B[Kubernetes Chaos Testing]
  B --> C[Database Chaos]
  C --> D[Network Chaos]
  D --> E[Chaos Observability]
  style B fill:#f90,color:#fff

Common Errors

Testing etcd by directly killing pods instead of simulating network issues: Killing etcd pods triggers the statefulset controller to restart them. Network partitions are more realistic.
Forgetting that NoExecute taints evict all pods without tolerations: NoExecute taints immediately evict running pods. Use NoSchedule for less disruptive testing.
PDB experiments without understanding maxUnavailable behavior: PDBs prevent voluntary disruptions but do not protect against node failures. Test both scenarios.
Scaling down etcd without verifying the backup: If the remaining etcd node fails during the experiment, the cluster is unrecoverable. Always verify backups first.
Not testing PV recovery after partition heals: After a PV becomes available again, verify that the pod can remount and resume normal operation without data loss.

Practice Questions

How does a PodDisruptionBudget protect services during voluntary disruptions?
What is the difference between NoSchedule and NoExecute taints?
How does etcd latency affect Kubernetes API server performance?
What happens to a pod when its persistent volume becomes unavailable?
Why should you test both pod-kill and network-partition experiments on stateful workloads?

Challenge

Design and execute a comprehensive Kubernetes chaos test suite that covers: pod disruption budget validation by attempting to delete more pods than the PDB allows, node taint testing by adding a NoSchedule taint and verifying pod Migration, etcd quorum testing by Partitioning one etcd member, and persistent volume recovery by blocking and unblocking network access to the storage backend.

FAQ

What is Kubernetes chaos testing?

Kubernetes chaos testing validates cluster resilience by injecting faults at the pod, node, control plane, and etcd levels to uncover configuration errors and architectural weaknesses.

How do PodDisruptionBudgets work during chaos experiments?

PDBs limit the number of pods that can be unavailable during voluntary disruptions. They prevent chaos experiments from taking down too many replicas of a critical service.

What happens when you taint a Kubernetes node?

Adding a NoSchedule taint prevents new pods from scheduling on the node. Adding a NoExecute taint evicts existing pods that do not tolerate the taint.

How does etcd failure affect Kubernetes?

etcd failure makes the Kubernetes API server read-only or completely unavailable, preventing all cluster operations including deployments, scaling, and pod scheduling.

Can Kubernetes recover from a persistent volume failure?

If the storage backend recovers, the pod can remount the volume. Data consistency depends on the storage system and application. Always test recovery after volume failures.

← Previous Azure Chaos Pipeline — Automated Experiments with DevOps Next → Database Chaos Engineering — PostgreSQL, MySQL & Redis Resilience

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Chaos Engineering