Chaos Mesh on Kubernetes — Practical Fault Injection Guide

DodaTech Updated 2026-06-23 5 min read

In this tutorial, you'll learn about Chaos Mesh on Kubernetes. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Chaos Mesh is an open-source Chaos Engineering platform purpose-built for Kubernetes. It exposes chaos experiments as Kubernetes custom resources, enabling you to define, version, and manage Fault Injection with the same tools you use for application deployments.

What You Will Learn

This tutorial teaches you how to install Chaos Mesh on any Kubernetes cluster, create pod-kill and network latency experiments, schedule recurring chaos, and monitor active experiments through the dashboard.

Why It Matters

Chaos Mesh reduces Chaos Engineering to native Kubernetes operations. There are no agents to install on pods, no external dashboards to configure, and no new YAML syntax to learn beyond standard Kubernetes resource definitions. This makes it the most accessible platform for teams already running Kubernetes.

Real-World Use

DodaTech runs Chaos Mesh across four Kubernetes clusters hosting Microservices for Doda Browser and Durga Antivirus Pro. Each service team owns weekly chaos experiments defined as YAML files in their Git repositories, making every experiment auditable and repeatable.

Prerequisites

Before starting you should understand:

Kubernetes cluster administration and kubectl commands
Chaos Engineering fundamentals (hypothesis, Steady State, blast radius)
Helm package manager for Kubernetes
Basic YAML and bash scripting

Step 1: Install Chaos Mesh

Install Chaos Mesh using Helm with recommended settings for production clusters.

# Add the Chaos Mesh Helm repository
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

# Install Chaos Mesh in a dedicated namespace
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --create-namespace \
  --set chaosDaemon.mode=daemonset \
  --set dashboard.securityMode=false \
  --version 2.7.0

# Verify all pods are running
kubectl get pods -n chaos-mesh

Expected output:

NAME                                        READY   STATUS
chaos-controller-manager-7d9f8c6b4f-abc1   1/1     Running
chaos-daemon-5h6k8                         1/1     Running
chaos-daemon-7j2k9                         1/1     Running
chaos-dashboard-abc123                     1/1     Running

Step 2: Create a Pod Kill Experiment

The simplest and safest Chaos Experiment kills a single pod and verifies that the system continues serving traffic.

# pod-kill-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-demo
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: nginx
  duration: 30s

# Apply the experiment
kubectl apply -f pod-kill-experiment.yaml

# Expected output:
# podchaos.chaos-mesh.org/pod-kill-demo created

# Watch pods being killed and recreated
kubectl get pods -l app=nginx -w

Expected output showing pod lifecycle:

nginx-7d9f8c6b4f-abc1   1/1   Running
nginx-7d9f8c6b4f-abc2   1/1   Running
nginx-7d9f8c6b4f-abc1   0/1   Terminating
nginx-7d9f8c6b4f-abc1   0/1   Completed
nginx-7d9f8c6b4f-abc3   0/1   Pending
nginx-7d9f8c6b4f-abc3   1/1   Running

Step 3: Inject Network Latency

Network latency experiments reveal how services behave under degraded network conditions.

# network-latency-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-latency-demo
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: web-service
  delay:
    latency: 500ms
    correlation: 50
    jitter: 100ms
  duration: 60s

kubectl apply -f network-latency-experiment.yaml

# Expected output:
# networkchaos.chaos-mesh.org/network-latency-demo created

Step 4: Schedule Recurring Experiments

Schedule chaos experiments to run automatically at defined intervals using the scheduler field.

# scheduled-pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: scheduled-pod-kill
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: payment-service
  duration: 30s
  scheduler:
    cron: "@every 6h"

kubectl apply -f scheduled-pod-kill.yaml

# List scheduled experiments
kubectl get podchaos scheduled-pod-kill -o yaml | grep -A5 status

Expected output:

status:
  conditions:
  - status: "True"
    type: AllRecovered
  experiment: scheduled-pod-kill
  scheduler: running

Step 5: Monitor Through the Dashboard

Access the Chaos Mesh dashboard to view active experiments, historical runs, and cluster-wide fault statistics.

# Port-forward the dashboard
kubectl port-forward svc/chaos-dashboard 2333:2333 -n chaos-mesh

# Expected output:
# Forwarding from 127.0.0.1:2333 -> 2333

Open http://localhost:2333 in your browser. The dashboard shows active experiments, completed runs, and per-namespace fault history.

Learning Path

flowchart LR
  A[Game Days] --> B[Chaos Mesh]
  B --> C[LitmusChaos]
  C --> D[Gremlin]
  D --> E[AWS Fault Injection]
  style B fill:#f90,color:#fff

Common Errors

Forgetting to set a duration on experiments: Without a duration the fault runs indefinitely. Always set a duration field or use a scheduler with an end time.
Using mode: all without understanding selector scope: Mode: all affects every matching pod. Use mode: one for initial experiments and expand scope gradually.
Network chaos blocking Kubernetes control plane traffic: Carefully scope network chaos selectors. Avoid targeting namespaces that run critical cluster components.
Chaos Daemon pods not running on all nodes: If chaos-daemon is missing from any node, experiments targeting pods on that node will silently fail. Check daemon pod status.
Applying chaos resources to the wrong namespace: Double-check selector namespace fields. An experiment targeting staging might accidentally affect production if labels match.

Practice Questions

What are the five main fault types available in Chaos Mesh?
How do you limit a Chaos Mesh experiment to affect only one pod at a time?
What is the purpose of the chaos-daemon component?
How do you schedule a Chaos Mesh experiment to run every 12 hours?
How can you verify that a Chaos Mesh experiment is currently running?

Challenge

Create a Chaos Mesh experiment that injects 300ms of latency into all pods with label tier: frontend in the staging namespace for 90 seconds. Configure the experiment to run automatically every 8 hours. Verify the experiment is running through the CLI and dashboard, then stop it manually using kubectl delete.

FAQ

What is Chaos Mesh?

Chaos Mesh is an open-source Chaos Engineering platform for Kubernetes that provides fault types as Kubernetes custom resources, enabling native GitOps workflows for chaos experiments.

How does Chaos Mesh differ from LitmusChaos?

Chaos Mesh focuses on fine-grained fault types with deep Kubernetes integration. LitmusChaos emphasizes workflow Orchestration and CI/CD integration. Many teams use both.

Can Chaos Mesh run experiments without affecting all replicas?

Yes. Use mode: one to target a single pod, or use the value field with mode: fixed to target a specific number of pods.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Designing Chaos Experiments — Structured Fault Injection for Resilient Systems Next → LitmusChaos Guide — Cloud-Native Chaos Engineering for Kubernetes

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Chaos Engineering