Chaos Mesh Advanced — Workflows, Schedules & Custom Faults

Q: How do I monitor Chaos Mesh workflow progress?

Use `kubectl get workflow -o yaml` to see the workflow status, or use the Chaos Dashboard for visual progress tracking.

DodaTech Updated 2026-06-23 6 min read

In this tutorial, you'll learn about Chaos Mesh Advanced. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Chaos Mesh advanced features go beyond basic Fault Injection to include workflow Orchestration, scheduled experiments, custom fault types, and multi-cluster management. Chaos Engineering teams use these capabilities to automate Resilience Testing at scale.

What You Will Learn

This tutorial teaches you how to use Chaos Mesh workflows for complex experiment Orchestration, schedule recurring experiments with Cron, create custom fault types, and manage chaos experiments across multiple Kubernetes clusters.

Why It Matters

Basic Fault Injection is the first step, but production-grade Chaos Engineering requires automation, scheduling, and Orchestration. Chaos Mesh workflows let you chain multiple faults with conditions, schedules enable continuous verification, and custom faults extend the platform to your specific failure modes.

Real-World Use

DodaTech uses Chaos Mesh workflows to run a "deployment qualification" suite that executes five sequential experiments after every production deployment. If any experiment fails the workflow stops and the deployment is automatically rolled back by the CI/CD pipeline.

Prerequisites

Before starting you should understand:

Basic Chaos Mesh installation and fault types
Kubernetes custom resource definitions and controllers
Chaos Engineering experiment design principles
Cron syntax for scheduling

Step 1: Create a Chaos Mesh Workflow

Workflows chain multiple experiments with conditional logic:

# deployment-qualification-workflow.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: deployment-qualification
spec:
  entry: main-entry
  templates:
    - name: main-entry
      templateType: Serial
      children:
        - pod-kill-test
        - network-latency-test
        - dns-failure-test
    - name: pod-kill-test
      templateType: PodChaos
      podChaos:
        action: pod-kill
        mode: one
        selector:
          namespaces: ["staging"]
          labelSelectors:
            app: web-service
        duration: 30s
    - name: network-latency-test
      templateType: NetworkChaos
      networkChaos:
        action: delay
        mode: all
        selector:
          namespaces: ["staging"]
          labelSelectors:
            app: web-service
        delay:
          latency: 300ms
        duration: 60s
    - name: dns-failure-test
      templateType: DNSChaos
      dnsChaos:
        action: error
        mode: all
        selector:
          namespaces: ["staging"]
          labelSelectors:
            app: web-service
        patterns:
          - "*.external-api.com"
        duration: 30s

kubectl apply -f deployment-qualification-workflow.yaml
# Expected output:
# workflow.chaos-mesh.org/deployment-qualification created

# Watch workflow progress
kubectl get workflow deployment-qualification -o yaml
# Expected output (abbreviated):
# status:
#   condition: Accomplished
#   startTime: "2026-06-23T10:00:00Z"
#   endTime: "2026-06-23T10:03:00Z"

Step 2: Schedule Recurring Experiments

Schedule experiments to run automatically using Cron syntax:

# weekly-chaos-schedule.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: weekly-pod-chaos
spec:
  schedule: "0 2 * * 0"
  type: PodChaos
  historyLimit: 5
  concurrencyPolicy: Forbid
  workflow:
    entry: weekly-main
    templates:
      - name: weekly-main
        templateType: Serial
        children:
          - payment-pod-kill
          - api-network-delay
      - name: payment-pod-kill
        templateType: PodChaos
        podChaos:
          action: pod-kill
          mode: one
          selector:
            namespaces: ["production"]
            labelSelectors:
              app: payment-service
          duration: 30s
      - name: api-network-delay
        templateType: NetworkChaos
        networkChaos:
          action: delay
          mode: all
          selector:
            namespaces: ["production"]
            labelSelectors:
              app: api-gateway
          delay:
            latency: 200ms
          duration: 60s

kubectl apply -f weekly-chaos-schedule.yaml
# Expected output:
# schedule.chaos-mesh.org/weekly-pod-chaos created

# List scheduled experiments
kubectl get schedule
# Expected output:
# NAME                 SCHEDULE       TYPE      STATUS
# weekly-pod-chaos     0 2 * * 0     PodChaos  Running

Step 3: Use the Chaos Dashboard

Install and use the Chaos Mesh Dashboard for visual experiment management:

# Access the Chaos Dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
# Expected output:
# Forwarding from 127.0.0.1:2333 -> 2333

# Then open http://localhost:2333 in your browser

#!/usr/bin/env python3
"""Programmatic access to Chaos Mesh Dashboard API."""
import requests
import json

DASHBOARD_URL = "http://localhost:2333"

def list_experiments():
    response = requests.get(f"{DASHBOARD_URL}/api/experiments")
    experiments = response.json()
    for exp in experiments:
        print(f"Name: {exp['name']}, Status: {exp['status']}, Kind: {exp['kind']}")

def create_experiment(experiment_yaml):
    response = requests.post(
        f"{DASHBOARD_URL}/api/experiments",
        json={"yaml_content": experiment_yaml}
    )
    return response.json()

# List all experiments
list_experiments()
# Expected output:
# Name: pod-kill-demo, Status: Running, Kind: PodChaos
# Name: network-latency-demo, Status: Completed, Kind: NetworkChaos
# Name: dns-failure-test, Status: Failed, Kind: DNSChaos

Step 4: Create Custom Fault Types with KernelChaos

Use KernelChaos to inject custom faults using Linux BPF programs:

# kernelnos-example.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: KernelChaos
metadata:
  name: syscall-failure
spec:
  mode: all
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: file-service
  failKernRequest:
    call-chain: "openat+0x0/0x100"
    failtype: "ENOENT"

kubectl apply -f kernelnos-example.yaml
# Expected output:
# kernelnos.chaos-mesh.org/syscall-failure created

# From inside the affected pod
kubectl exec -it -l app=file-service -- cat /data/config.json
# Expected output:
# cat: /data/config.json: No such file or directory
# (The openat syscall is failing with ENOENT)

Learning Path

flowchart LR
  A[Chaos Mesh Basics] --> B[Chaos Mesh Advanced]
  B --> C[Chaos Engineering Pipeline]
  C --> D[Kubernetes Chaos Testing]
  D --> E[Chaos Observability]
  style B fill:#f90,color:#fff

Common Errors

Workflow templates with circular dependencies: Ensure your workflow DAG has no cycles. Chaos Mesh validates this at creation time but complex workflows can still have logical loops.
Cron schedules with overlapping executions: Use concurrencyPolicy: Forbid to prevent overlapping scheduled experiments from running simultaneously.
KernelChaos experiments on nodes without BPF support: KernelChaos requires Linux kernel 4.15+ with BPF support. Verify kernel capabilities before using this fault type.
Dashboard access without proper RBAC: The Chaos Dashboard requires its own ServiceAccount and RBAC bindings. Check the dashboard pod logs for permission errors.
Forgetting to set historyLimit on schedules: Without a historyLimit, Chaos Mesh retains every completed experiment record indefinitely, consuming etcd storage.

Practice Questions

How do Chaos Mesh workflows chain multiple experiments together?
What is the purpose of the concurrencyPolicy field in a Schedule?
How do you access the Chaos Mesh Dashboard and what can you do there?
What kernel features are required for KernelChaos experiments?
How does the historyLimit field protect etcd storage?

Challenge

Create a Chaos Mesh schedule that runs every Monday at 3 AM and executes a workflow with four steps: kill one pod in the auth service, add 200ms latency to the database connection, inject a DNS error for the external payment API, and apply 50 percent CPU stress to a worker node. The workflow must stop if any step fails. Set a history limit of 10 records.

FAQ

What is a Chaos Mesh workflow?

A Chaos Mesh workflow is a DAG (directed acyclic graph) of chaos experiments that can run serially, in parallel, or conditionally, enabling complex experiment Orchestration.

How do scheduled experiments work in Chaos Mesh?

Scheduled experiments use a Cron expression to run chaos experiments automatically at specified intervals. The Schedule custom resource wraps any chaos kind with scheduling logic.

What is KernelChaos?

KernelChaos is a Chaos Mesh fault type that injects kernel-level faults using Linux BPF programs, allowing you to fail specific system calls with custom error codes.

Can I run Chaos Mesh experiments across multiple clusters?

Yes. Chaos Mesh supports a multi-cluster mode where a central dashboard manages experiments across multiple Kubernetes clusters using remote agents.

How do I monitor Chaos Mesh workflow progress?

Use kubectl get workflow <name> -o yaml to see the workflow status, or use the Chaos Dashboard for visual progress tracking.

← Previous Advanced Chaos Experiments — Multi-Fault & Orchestrated Testing Next → LitmusChaos Advanced — Workflows, GitOps & Resilience Scores

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Chaos Engineering