Chaos Mesh on Kubernetes — Practical Fault Injection Guide
In this tutorial, you'll learn about Chaos Mesh on Kubernetes. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Chaos Mesh is an open-source Chaos Engineering platform purpose-built for Kubernetes. It exposes chaos experiments as Kubernetes custom resources, enabling you to define, version, and manage Fault Injection with the same tools you use for application deployments.
What You Will Learn
This tutorial teaches you how to install Chaos Mesh on any Kubernetes cluster, create pod-kill and network latency experiments, schedule recurring chaos, and monitor active experiments through the dashboard.
Why It Matters
Chaos Mesh reduces Chaos Engineering to native Kubernetes operations. There are no agents to install on pods, no external dashboards to configure, and no new YAML syntax to learn beyond standard Kubernetes resource definitions. This makes it the most accessible platform for teams already running Kubernetes.
Real-World Use
DodaTech runs Chaos Mesh across four Kubernetes clusters hosting Microservices for Doda Browser and Durga Antivirus Pro. Each service team owns weekly chaos experiments defined as YAML files in their Git repositories, making every experiment auditable and repeatable.
Prerequisites
Before starting you should understand:
- Kubernetes cluster administration and kubectl commands
- Chaos Engineering fundamentals (hypothesis, Steady State, blast radius)
- Helm package manager for Kubernetes
- Basic YAML and bash scripting
Step 1: Install Chaos Mesh
Install Chaos Mesh using Helm with recommended settings for production clusters.
# Add the Chaos Mesh Helm repository
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
# Install Chaos Mesh in a dedicated namespace
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--create-namespace \
--set chaosDaemon.mode=daemonset \
--set dashboard.securityMode=false \
--version 2.7.0
# Verify all pods are running
kubectl get pods -n chaos-mesh
Expected output:
NAME READY STATUS
chaos-controller-manager-7d9f8c6b4f-abc1 1/1 Running
chaos-daemon-5h6k8 1/1 Running
chaos-daemon-7j2k9 1/1 Running
chaos-dashboard-abc123 1/1 Running
Step 2: Create a Pod Kill Experiment
The simplest and safest Chaos Experiment kills a single pod and verifies that the system continues serving traffic.
# pod-kill-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-demo
spec:
action: pod-kill
mode: one
selector:
namespaces:
- default
labelSelectors:
app: nginx
duration: 30s
# Apply the experiment
kubectl apply -f pod-kill-experiment.yaml
# Expected output:
# podchaos.chaos-mesh.org/pod-kill-demo created
# Watch pods being killed and recreated
kubectl get pods -l app=nginx -w
Expected output showing pod lifecycle:
nginx-7d9f8c6b4f-abc1 1/1 Running
nginx-7d9f8c6b4f-abc2 1/1 Running
nginx-7d9f8c6b4f-abc1 0/1 Terminating
nginx-7d9f8c6b4f-abc1 0/1 Completed
nginx-7d9f8c6b4f-abc3 0/1 Pending
nginx-7d9f8c6b4f-abc3 1/1 Running
Step 3: Inject Network Latency
Network latency experiments reveal how services behave under degraded network conditions.
# network-latency-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-latency-demo
spec:
action: delay
mode: all
selector:
namespaces:
- default
labelSelectors:
app: web-service
delay:
latency: 500ms
correlation: 50
jitter: 100ms
duration: 60s
kubectl apply -f network-latency-experiment.yaml
# Expected output:
# networkchaos.chaos-mesh.org/network-latency-demo created
Step 4: Schedule Recurring Experiments
Schedule chaos experiments to run automatically at defined intervals using the scheduler field.
# scheduled-pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: scheduled-pod-kill
spec:
action: pod-kill
mode: one
selector:
namespaces:
- staging
labelSelectors:
app: payment-service
duration: 30s
scheduler:
cron: "@every 6h"
kubectl apply -f scheduled-pod-kill.yaml
# List scheduled experiments
kubectl get podchaos scheduled-pod-kill -o yaml | grep -A5 status
Expected output:
status:
conditions:
- status: "True"
type: AllRecovered
experiment: scheduled-pod-kill
scheduler: running
Step 5: Monitor Through the Dashboard
Access the Chaos Mesh dashboard to view active experiments, historical runs, and cluster-wide fault statistics.
# Port-forward the dashboard
kubectl port-forward svc/chaos-dashboard 2333:2333 -n chaos-mesh
# Expected output:
# Forwarding from 127.0.0.1:2333 -> 2333
Open http://localhost:2333 in your browser. The dashboard shows active experiments, completed runs, and per-namespace fault history.
Learning Path
flowchart LR A[Game Days] --> B[Chaos Mesh] B --> C[LitmusChaos] C --> D[Gremlin] D --> E[AWS Fault Injection] style B fill:#f90,color:#fff
Common Errors
- Forgetting to set a duration on experiments: Without a duration the fault runs indefinitely. Always set a duration field or use a scheduler with an end time.
- Using mode: all without understanding selector scope: Mode: all affects every matching pod. Use mode: one for initial experiments and expand scope gradually.
- Network chaos blocking Kubernetes control plane traffic: Carefully scope network chaos selectors. Avoid targeting namespaces that run critical cluster components.
- Chaos Daemon pods not running on all nodes: If chaos-daemon is missing from any node, experiments targeting pods on that node will silently fail. Check daemon pod status.
- Applying chaos resources to the wrong namespace: Double-check selector namespace fields. An experiment targeting staging might accidentally affect production if labels match.
Practice Questions
- What are the five main fault types available in Chaos Mesh?
- How do you limit a Chaos Mesh experiment to affect only one pod at a time?
- What is the purpose of the chaos-daemon component?
- How do you schedule a Chaos Mesh experiment to run every 12 hours?
- How can you verify that a Chaos Mesh experiment is currently running?
Challenge
Create a Chaos Mesh experiment that injects 300ms of latency into all pods with label tier: frontend in the staging namespace for 90 seconds. Configure the experiment to run automatically every 8 hours. Verify the experiment is running through the CLI and dashboard, then stop it manually using kubectl delete.
FAQ
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro