Kubernetes Chaos — Pod Failures, DNS Issues & Resource Pressure
In this tutorial, you'll learn about Kubernetes Chaos. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Kubernetes Chaos Engineering focuses on the unique failure modes of container Orchestration: pods crash, nodes drain, DNS records change, and resource limits are exceeded. Chaos Engineering on Kubernetes helps you verify that your cluster can survive these common failure scenarios.
What You Will Learn
This tutorial teaches you how to inject pod-level faults, DNS failures, node resource pressure, and container resource exhaustion on Kubernetes clusters using Chaos Mesh and kubectl.
Why It Matters
Kubernetes clusters are complex Distributed Systems with many moving parts. A single misconfigured readiness probe, a DNS cache issue, or a node under memory pressure can cascade into a cluster-wide outage. Testing these scenarios proactively reveals configuration errors and architectural weaknesses.
Real-World Use
DodaTech runs a daily Kubernetes chaos schedule that kills one random pod in every deployment with at least 3 replicas. The experiment verifies that each deployment meets the minimum replica requirement and that the horizontal pod autoscaler responds correctly.
Prerequisites
Before starting you should understand:
- Kubernetes operations (pods, deployments, services, nodes)
- Chaos Engineering concepts (steady state, hypothesis, blast radius)
- Chaos Mesh installation and basic usage
- kubectl command-line proficiency
Step 1: Kill a Pod and Verify Recovery
The most basic Kubernetes Chaos Experiment:
# Delete a pod and watch it recover
kubectl delete pod -l app=payment-service
# Watch the deployment recreate the pod
kubectl get pods -l app=payment-service -w
# Expected output:
# payment-service-7d9f8c6b4f-abc1 1/1 Terminating
# payment-service-7d9f8c6b4f-abc1 0/1 Completed
# payment-service-7d9f8c6b4f-abc2 0/1 Pending
# payment-service-7d9f8c6b4f-abc2 0/1 ContainerCreating
# payment-service-7d9f8c6b4f-abc2 1/1 Running
Step 2: Inject DNS Failures
Use Chaos Mesh DNSChaos to simulate DNS resolution failures:
# dns-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
name: dns-failure-test
spec:
action: error
mode: all
selector:
namespaces:
- staging
labelSelectors:
app: web-service
patterns:
- "*.external-service.com"
duration: 120s
kubectl apply -f dns-failure.yaml
# Expected output:
# dnschaos.chaos-mesh.org/dns-failure-test created
# From inside the affected pod, test DNS resolution
kubectl exec -it -l app=web-service -- nslookup external-service.com
# Expected output:
# ;; Got recursion not available from ::1, trying next server
# ;; connection timed out; no servers could be reached
Step 3: Simulate Node Resource Pressure
Drain a node and verify workload migration:
# Mark a node as unschedulable and drain it
kubectl drain worker-node-1 \
--ignore-daemonsets \
--delete-emptydir-data
# Expected output:
# node/worker-node-1 drained
# Verify pods migrated to other nodes
kubectl get pods -o wide | grep worker-node-1
# Expected output:
# (no pods running on worker-node-1 - all migrated)
# Uncordon the node after testing
kubectl uncordon worker-node-1
# Expected output:
# node/worker-node-1 uncordoned
Step 4: Test Container Resource Limits
Configure a container with strict limits and push beyond them:
# resource-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: memory-stress-test
spec:
containers:
- name: stress
image: polinux/stress
resources:
limits:
memory: "128Mi"
cpu: "500m"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "256M", "--timeout", "30"]
kubectl apply -f resource-pod.yaml
# Expected output:
# pod/memory-stress-test created
# Watch what happens when memory limit is exceeded
kubectl describe pod memory-stress-test
# Expected output:
# State: Waiting
# Reason: CrashLoopBackOff
# Last State: Terminated
# Reason: OOMKilled
# The pod is OOMKilled because 256MB exceeds the 128Mi limit
Step 5: Use PodChaos in Scheduler Mode
Schedule regular chaos experiments using Cron syntax:
# scheduled-pod-chaos.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: daily-pod-kill
spec:
schedule: "@daily"
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: payment-service
duration: 10s
kubectl apply -f scheduled-pod-chaos.yaml
# Expected output:
# podchaos.chaos-mesh.org/daily-pod-kill created
# List scheduled experiments
kubectl get podchaos
# Expected output:
# NAME ACTION SCHEDULE STATUS
# daily-pod-kill pod-kill @daily Running
Learning Path
flowchart LR A[Infrastructure Faults] --> B[Kubernetes Chaos] B --> C[Chaos Engineering Pipeline] style B fill:#f90,color:#fff
Common Errors
- Draining a node with critical daemonsets: Daemonsets like kube-proxy and networking plugins must be excluded from drain with
--ignore-daemonsets. - DNS chaos blocking essential DNS records: Blocking
kubernetes.default.svc.cluster.localwill break the pod entirely. Scope DNS chaos to external service patterns only. - Forgetting to uncordon drained nodes: A drained node stays unschedulable indefinitely. Always uncordon after testing.
- Setting resource limits too tight for stress testing: The stress container must request more than the limit to trigger OOMKill. If the request exceeds the node capacity the pod will not be scheduled.
- Running PodChaos on singleton deployments: Killing the only pod of a deployment creates a real outage. Use mode: one only on deployments with 3 or more replicas.
Practice Questions
- How does
kubectl draindiffer fromkubectl delete pod? - What is the purpose of DNS chaos in a Kubernetes environment?
- What happens when a container exceeds its memory limit?
- How does the scheduler cron syntax work in Chaos Mesh?
- Why should you exclude daemonsets when draining a node?
Challenge
Create a scheduled Chaos Experiment suite that runs daily at midnight: kill one pod in each deployment with 3+ replicas, inject 2 seconds of DNS latency to the authentication service, and apply 80 percent CPU stress to one node for 5 minutes. Configure alerts that page the on-call engineer if any SLO is breached during the experiment.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro