Chaos Mesh Advanced — Workflows, Schedules & Custom Faults
In this tutorial, you'll learn about Chaos Mesh Advanced. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Chaos Mesh advanced features go beyond basic Fault Injection to include workflow Orchestration, scheduled experiments, custom fault types, and multi-cluster management. Chaos Engineering teams use these capabilities to automate Resilience Testing at scale.
What You Will Learn
This tutorial teaches you how to use Chaos Mesh workflows for complex experiment Orchestration, schedule recurring experiments with Cron, create custom fault types, and manage chaos experiments across multiple Kubernetes clusters.
Why It Matters
Basic Fault Injection is the first step, but production-grade Chaos Engineering requires automation, scheduling, and Orchestration. Chaos Mesh workflows let you chain multiple faults with conditions, schedules enable continuous verification, and custom faults extend the platform to your specific failure modes.
Real-World Use
DodaTech uses Chaos Mesh workflows to run a "deployment qualification" suite that executes five sequential experiments after every production deployment. If any experiment fails the workflow stops and the deployment is automatically rolled back by the CI/CD pipeline.
Prerequisites
Before starting you should understand:
- Basic Chaos Mesh installation and fault types
- Kubernetes custom resource definitions and controllers
- Chaos Engineering experiment design principles
- Cron syntax for scheduling
Step 1: Create a Chaos Mesh Workflow
Workflows chain multiple experiments with conditional logic:
# deployment-qualification-workflow.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: deployment-qualification
spec:
entry: main-entry
templates:
- name: main-entry
templateType: Serial
children:
- pod-kill-test
- network-latency-test
- dns-failure-test
- name: pod-kill-test
templateType: PodChaos
podChaos:
action: pod-kill
mode: one
selector:
namespaces: ["staging"]
labelSelectors:
app: web-service
duration: 30s
- name: network-latency-test
templateType: NetworkChaos
networkChaos:
action: delay
mode: all
selector:
namespaces: ["staging"]
labelSelectors:
app: web-service
delay:
latency: 300ms
duration: 60s
- name: dns-failure-test
templateType: DNSChaos
dnsChaos:
action: error
mode: all
selector:
namespaces: ["staging"]
labelSelectors:
app: web-service
patterns:
- "*.external-api.com"
duration: 30s
kubectl apply -f deployment-qualification-workflow.yaml
# Expected output:
# workflow.chaos-mesh.org/deployment-qualification created
# Watch workflow progress
kubectl get workflow deployment-qualification -o yaml
# Expected output (abbreviated):
# status:
# condition: Accomplished
# startTime: "2026-06-23T10:00:00Z"
# endTime: "2026-06-23T10:03:00Z"
Step 2: Schedule Recurring Experiments
Schedule experiments to run automatically using Cron syntax:
# weekly-chaos-schedule.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: weekly-pod-chaos
spec:
schedule: "0 2 * * 0"
type: PodChaos
historyLimit: 5
concurrencyPolicy: Forbid
workflow:
entry: weekly-main
templates:
- name: weekly-main
templateType: Serial
children:
- payment-pod-kill
- api-network-delay
- name: payment-pod-kill
templateType: PodChaos
podChaos:
action: pod-kill
mode: one
selector:
namespaces: ["production"]
labelSelectors:
app: payment-service
duration: 30s
- name: api-network-delay
templateType: NetworkChaos
networkChaos:
action: delay
mode: all
selector:
namespaces: ["production"]
labelSelectors:
app: api-gateway
delay:
latency: 200ms
duration: 60s
kubectl apply -f weekly-chaos-schedule.yaml
# Expected output:
# schedule.chaos-mesh.org/weekly-pod-chaos created
# List scheduled experiments
kubectl get schedule
# Expected output:
# NAME SCHEDULE TYPE STATUS
# weekly-pod-chaos 0 2 * * 0 PodChaos Running
Step 3: Use the Chaos Dashboard
Install and use the Chaos Mesh Dashboard for visual experiment management:
# Access the Chaos Dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
# Expected output:
# Forwarding from 127.0.0.1:2333 -> 2333
# Then open http://localhost:2333 in your browser
#!/usr/bin/env python3
"""Programmatic access to Chaos Mesh Dashboard API."""
import requests
import json
DASHBOARD_URL = "http://localhost:2333"
def list_experiments():
response = requests.get(f"{DASHBOARD_URL}/api/experiments")
experiments = response.json()
for exp in experiments:
print(f"Name: {exp['name']}, Status: {exp['status']}, Kind: {exp['kind']}")
def create_experiment(experiment_yaml):
response = requests.post(
f"{DASHBOARD_URL}/api/experiments",
json={"yaml_content": experiment_yaml}
)
return response.json()
# List all experiments
list_experiments()
# Expected output:
# Name: pod-kill-demo, Status: Running, Kind: PodChaos
# Name: network-latency-demo, Status: Completed, Kind: NetworkChaos
# Name: dns-failure-test, Status: Failed, Kind: DNSChaos
Step 4: Create Custom Fault Types with KernelChaos
Use KernelChaos to inject custom faults using Linux BPF programs:
# kernelnos-example.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: KernelChaos
metadata:
name: syscall-failure
spec:
mode: all
selector:
namespaces:
- staging
labelSelectors:
app: file-service
failKernRequest:
call-chain: "openat+0x0/0x100"
failtype: "ENOENT"
kubectl apply -f kernelnos-example.yaml
# Expected output:
# kernelnos.chaos-mesh.org/syscall-failure created
# From inside the affected pod
kubectl exec -it -l app=file-service -- cat /data/config.json
# Expected output:
# cat: /data/config.json: No such file or directory
# (The openat syscall is failing with ENOENT)
Learning Path
flowchart LR A[Chaos Mesh Basics] --> B[Chaos Mesh Advanced] B --> C[Chaos Engineering Pipeline] C --> D[Kubernetes Chaos Testing] D --> E[Chaos Observability] style B fill:#f90,color:#fff
Common Errors
- Workflow templates with circular dependencies: Ensure your workflow DAG has no cycles. Chaos Mesh validates this at creation time but complex workflows can still have logical loops.
- Cron schedules with overlapping executions: Use
concurrencyPolicy: Forbidto prevent overlapping scheduled experiments from running simultaneously. - KernelChaos experiments on nodes without BPF support: KernelChaos requires Linux kernel 4.15+ with BPF support. Verify kernel capabilities before using this fault type.
- Dashboard access without proper RBAC: The Chaos Dashboard requires its own ServiceAccount and RBAC bindings. Check the dashboard pod logs for permission errors.
- Forgetting to set historyLimit on schedules: Without a historyLimit, Chaos Mesh retains every completed experiment record indefinitely, consuming etcd storage.
Practice Questions
- How do Chaos Mesh workflows chain multiple experiments together?
- What is the purpose of the
concurrencyPolicyfield in a Schedule? - How do you access the Chaos Mesh Dashboard and what can you do there?
- What kernel features are required for KernelChaos experiments?
- How does the
historyLimitfield protect etcd storage?
Challenge
Create a Chaos Mesh schedule that runs every Monday at 3 AM and executes a workflow with four steps: kill one pod in the auth service, add 200ms latency to the database connection, inject a DNS error for the external payment API, and apply 50 percent CPU stress to a worker node. The workflow must stop if any step fails. Set a history limit of 10 records.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro