Chaos Mesh — Kubernetes Chaos Engineering Platform
In this tutorial, you'll learn about Chaos Mesh. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Chaos Mesh is an open-source Chaos Engineering platform designed specifically for Kubernetes. It provides a rich set of fault types — pod kills, network partitions, DNS failures, disk I/O delays, and CPU stress — all managed through Kubernetes custom resources.
What You Will Learn
This tutorial teaches you how to install Chaos Mesh, define chaos experiments as Kubernetes resources, and run safe Fault Injection experiments on your cluster.
Why It Matters
Chaos Mesh turns Chaos Engineering into a native Kubernetes experience. You define experiments with the same tools and workflows you already use for deployments. This reduces the barrier to entry and makes experiments reproducible and version-controlled.
Real-World Use
DodaTech uses Chaos Mesh to run weekly experiments across 40 Microservices. Every experiment is defined as a YAML file stored in the same Git repository as the service manifests, making experiments auditable and repeatable.
Prerequisites
Before starting you should understand:
- Kubernetes cluster operations and kubectl commands
- Chaos Engineering fundamentals from the previous tutorials
- How to create and apply Kubernetes custom resources
- Basic YAML syntax
Step 1: Install Chaos Mesh
Install Chaos Mesh using Helm or the quick installation script:
# Install Chaos Mesh using Helm
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--create-namespace \
--version 2.7.0
# Verify installation
kubectl get pods -n chaos-mesh
# Expected output:
# NAME READY STATUS
# chaos-controller-manager-7d9f8c6b4f-abc1 1/1 Running
# chaos-daemon-5h6k8 1/1 Running
# chaos-dashboard-abc123 1/1 Running
Step 2: Explore the Fault Types
Chaos Mesh supports multiple fault types organized into categories:
# List available chaos kinds
kubectl api-resources | grep chaos
# Expected output:
# podchaos chaos-mesh.org/v1alpha1
# networkchaos chaos-mesh.org/v1alpha1
# dnschaos chaos-mesh.org/v1alpha1
# httpchaos chaos-mesh.org/v1alpha1
# iochaos chaos-mesh.org/v1alpha1
# stresschaos chaos-mesh.org/v1alpha1
# kernelnos chaos-mesh.org/v1alpha1
# timechaos chaos-mesh.org/v1alpha1
Step 3: Create a Pod Kill Experiment
The simplest Chaos Mesh experiment kills a single pod:
# pod-kill-example.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-demo
spec:
action: pod-kill
mode: one
selector:
namespaces:
- default
labelSelectors:
app: nginx
duration: 30s
kubectl apply -f pod-kill-example.yaml
# Expected output:
# podchaos.chaos-mesh.org/pod-kill-demo created
Step 4: Create a Network Latency Experiment
Simulate network delays to test how services handle slow connections:
# network-latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-latency-demo
spec:
action: delay
mode: all
selector:
namespaces:
- default
labelSelectors:
app: web-service
delay:
latency: 500ms
correlation: 50
jitter: 100ms
duration: 60s
kubectl apply -f network-latency.yaml
# Expected output:
# networkchaos.chaos-mesh.org/network-latency-demo created
Step 5: Monitor and Stop Experiments
Chaos Mesh provides a dashboard and CLI for monitoring active experiments:
# List active experiments
kubectl get podchaos
# Expected output:
# NAME ACTION DURATION STATUS
# pod-kill-demo pod-kill 30s Running
# Manually stop an experiment
kubectl delete podchaos pod-kill-demo
# Expected output:
# podchaos.chaos-mesh.org "pod-kill-demo" deleted
Learning Path
flowchart LR A[Game Days] --> B[Chaos Mesh] B --> C[LitmusChaos] C --> D[Gremlin Platform] D --> E[AWS Fault Injection] style B fill:#f90,color:#fff
Common Errors
- Forgetting to set a duration: Without a duration the fault runs indefinitely. Always set a duration or use a scheduler.
- Using mode: all without understanding Blast Radius: Mode: all affects every matching pod. Use mode: one for initial experiments.
- Network chaos blocking critical system traffic: Carefully scope network chaos selectors to avoid blocking Kubernetes control plane traffic.
- Not verifying Chaos Mesh pod status after installation: If the chaos-daemon pods are not running experiments will silently fail.
- Applying chaos resources to the wrong namespace: Double-check the selector namespace. An experiment meant for staging might target production.
Practice Questions
- What are the six main fault types supported by Chaos Mesh?
- How do you limit a Chaos Mesh experiment to a single pod?
- What is the purpose of the chaos-daemon component?
- How do you stop a running Chaos Mesh experiment?
- Why should you set a duration on every experiment?
Challenge
Create a Chaos Mesh experiment that injects 300ms of latency into a web service for 90 seconds, targeting only pods with the label tier: frontend in the staging namespace. Verify the experiment is running and then stop it manually.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro