Kubernetes Chaos — Pod Failures, DNS Issues & Resource Pressure

Q: Can I run Kubernetes chaos experiments in production?

Yes, with proper safety controls: small Blast Radius , automated duration limits, monitoring-backed guardrails, and team notification before execution.

DodaTech Updated 2026-06-21 5 min read

In this tutorial, you'll learn about Kubernetes Chaos. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Kubernetes Chaos Engineering focuses on the unique failure modes of container Orchestration: pods crash, nodes drain, DNS records change, and resource limits are exceeded. Chaos Engineering on Kubernetes helps you verify that your cluster can survive these common failure scenarios.

What You Will Learn

This tutorial teaches you how to inject pod-level faults, DNS failures, node resource pressure, and container resource exhaustion on Kubernetes clusters using Chaos Mesh and kubectl.

Why It Matters

Kubernetes clusters are complex Distributed Systems with many moving parts. A single misconfigured readiness probe, a DNS cache issue, or a node under memory pressure can cascade into a cluster-wide outage. Testing these scenarios proactively reveals configuration errors and architectural weaknesses.

Real-World Use

DodaTech runs a daily Kubernetes chaos schedule that kills one random pod in every deployment with at least 3 replicas. The experiment verifies that each deployment meets the minimum replica requirement and that the horizontal pod autoscaler responds correctly.

Prerequisites

Before starting you should understand:

Kubernetes operations (pods, deployments, services, nodes)
Chaos Engineering concepts (steady state, hypothesis, blast radius)
Chaos Mesh installation and basic usage
kubectl command-line proficiency

Step 1: Kill a Pod and Verify Recovery

The most basic Kubernetes Chaos Experiment:

# Delete a pod and watch it recover
kubectl delete pod -l app=payment-service

# Watch the deployment recreate the pod
kubectl get pods -l app=payment-service -w
# Expected output:
# payment-service-7d9f8c6b4f-abc1   1/1   Terminating
# payment-service-7d9f8c6b4f-abc1   0/1   Completed
# payment-service-7d9f8c6b4f-abc2   0/1   Pending
# payment-service-7d9f8c6b4f-abc2   0/1   ContainerCreating
# payment-service-7d9f8c6b4f-abc2   1/1   Running

Step 2: Inject DNS Failures

Use Chaos Mesh DNSChaos to simulate DNS resolution failures:

# dns-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: dns-failure-test
spec:
  action: error
  mode: all
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: web-service
  patterns:
    - "*.external-service.com"
  duration: 120s

kubectl apply -f dns-failure.yaml
# Expected output:
# dnschaos.chaos-mesh.org/dns-failure-test created

# From inside the affected pod, test DNS resolution
kubectl exec -it -l app=web-service -- nslookup external-service.com
# Expected output:
# ;; Got recursion not available from ::1, trying next server
# ;; connection timed out; no servers could be reached

Step 3: Simulate Node Resource Pressure

Drain a node and verify workload migration:

# Mark a node as unschedulable and drain it
kubectl drain worker-node-1 \
  --ignore-daemonsets \
  --delete-emptydir-data

# Expected output:
# node/worker-node-1 drained

# Verify pods migrated to other nodes
kubectl get pods -o wide | grep worker-node-1
# Expected output:
# (no pods running on worker-node-1 - all migrated)

# Uncordon the node after testing
kubectl uncordon worker-node-1
# Expected output:
# node/worker-node-1 uncordoned

Step 4: Test Container Resource Limits

Configure a container with strict limits and push beyond them:

# resource-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: memory-stress-test
spec:
  containers:
    - name: stress
      image: polinux/stress
      resources:
        limits:
          memory: "128Mi"
          cpu: "500m"
      command: ["stress"]
      args: ["--vm", "1", "--vm-bytes", "256M", "--timeout", "30"]

kubectl apply -f resource-pod.yaml
# Expected output:
# pod/memory-stress-test created

# Watch what happens when memory limit is exceeded
kubectl describe pod memory-stress-test
# Expected output:
# State:          Waiting
# Reason:         CrashLoopBackOff
# Last State:     Terminated
# Reason:         OOMKilled
# The pod is OOMKilled because 256MB exceeds the 128Mi limit

Step 5: Use PodChaos in Scheduler Mode

Schedule regular chaos experiments using Cron syntax:

# scheduled-pod-chaos.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: daily-pod-kill
spec:
  schedule: "@daily"
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  duration: 10s

kubectl apply -f scheduled-pod-chaos.yaml
# Expected output:
# podchaos.chaos-mesh.org/daily-pod-kill created

# List scheduled experiments
kubectl get podchaos
# Expected output:
# NAME              ACTION    SCHEDULE   STATUS
# daily-pod-kill    pod-kill  @daily     Running

Learning Path

flowchart LR
  A[Infrastructure Faults] --> B[Kubernetes Chaos]
  B --> C[Chaos Engineering Pipeline]
  style B fill:#f90,color:#fff

Common Errors

Draining a node with critical daemonsets: Daemonsets like kube-proxy and networking plugins must be excluded from drain with --ignore-daemonsets.
DNS chaos blocking essential DNS records: Blocking kubernetes.default.svc.cluster.local will break the pod entirely. Scope DNS chaos to external service patterns only.
Forgetting to uncordon drained nodes: A drained node stays unschedulable indefinitely. Always uncordon after testing.
Setting resource limits too tight for stress testing: The stress container must request more than the limit to trigger OOMKill. If the request exceeds the node capacity the pod will not be scheduled.
Running PodChaos on singleton deployments: Killing the only pod of a deployment creates a real outage. Use mode: one only on deployments with 3 or more replicas.

Practice Questions

How does kubectl drain differ from kubectl delete pod?
What is the purpose of DNS chaos in a Kubernetes environment?
What happens when a container exceeds its memory limit?
How does the scheduler cron syntax work in Chaos Mesh?
Why should you exclude daemonsets when draining a node?

Challenge

Create a scheduled Chaos Experiment suite that runs daily at midnight: kill one pod in each deployment with 3+ replicas, inject 2 seconds of DNS latency to the authentication service, and apply 80 percent CPU stress to one node for 5 minutes. Configure alerts that page the on-call engineer if any SLO is breached during the experiment.

FAQ

What is Kubernetes Chaos Engineering?

Kubernetes Chaos Engineering injects pod, node, DNS, and resource faults into Kubernetes clusters to test resilience and validate configuration.

How do I safely test pod failures?

Always target deployments with 3 or more replicas. Use mode: one in Chaos Mesh to limit the Blast Radius. Verify the deployment controller recreates the pod.

What is node draining in Kubernetes?

Draining marks a node as unschedulable and evicts all pods gracefully, allowing them to be rescheduled on other nodes. It simulates a planned maintenance event.

How does DNS chaos work in Chaos Mesh?

DNSChaos intercepts DNS queries from affected pods and can return errors, modify responses, or delay responses for specified domain patterns.

Can I run Kubernetes chaos experiments in production?

Yes, with proper safety controls: small Blast Radius, automated duration limits, monitoring-backed guardrails, and team notification before execution.

← Previous Infrastructure Faults — CPU, Memory, Disk & Node Failures Next → Chaos Engineering Pipeline — Automating Experiments in CI/CD

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Chaos Engineering