Chaos Engineering Pipeline — Automating Experiments in CI/CD

DodaTech Updated 2026-06-21 5 min read

In this tutorial, you'll learn about Chaos Engineering Pipeline. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

A Chaos Engineering Pipeline integrates automated Chaos Engineering experiments into your CI/CD workflow. Instead of running experiments manually, you define them as code, trigger them on deployments, and gate releases based on resilience test results. This is the final step in making Chaos Engineering a continuous practice.

What You Will Learn

This tutorial teaches you how to build a Chaos Engineering Pipeline: defining experiments as code, running them automatically after deployments, using GitOps for experiment management, and measuring resilience as a release quality gate.

Why It Matters

Manual chaos experiments are valuable but they do not scale. When you deploy multiple times per day you cannot manually test every release. An automated chaos pipeline ensures that every deployment passes a baseline resilience check before reaching production users.

Real-World Use

DodaTech runs a continuous resilience pipeline that executes experiments at each stage of the deployment: staging (full experiment suite), canary (critical experiments only), and production (safety-net experiments). A failed experiment at any stage blocks the release from progressing.

Prerequisites

Before starting you should understand:

Chaos Engineering fundamentals and experiment design
CI/CD pipeline concepts (stages, gates, artifacts)
Kubernetes operations and Chaos Mesh or LitmusChaos
GitOps principles (Git as source of truth)

Step 1: Define Experiments as Code

Store all experiment definitions in your Git repository:

# experiments/production/pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pipeline-pod-kill
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: web-service
  duration: 30s

# Add experiments to version control
git add experiments/
git commit -m "Add production pod kill experiment"
# Expected output:
# [main 1a2b3c4] Add production pod kill experiment
# 1 file changed, 11 insertions(+)

Step 2: Create the Resilience Pipeline

Build a CI/CD pipeline that runs chaos experiments after deployment:

# .github/workflows/resilience-pipeline.yml
name: Resilience Pipeline
on:
  deployment_status:
    types: [success]
jobs:
  chaos-tests:
    runs-on: ubuntu-latest
    steps:
      - name: Deploy Application
        run: kubectl apply -f k8s/deployment.yaml

      - name: Wait for Rollout
        run: kubectl rollout status deployment/web-service --timeout=5m

      - name: Run Pod Kill Experiment
        run: |
          kubectl apply -f experiments/pod-kill.yaml
          sleep 45

      - name: Verify SLOs
        run: |
          ERROR_RATE=$(curl -s prometheus:9090/api/v1/query?query=error_rate | jq -r .data.result[0].value[1])
          if (( $(echo "$ERROR_RATE > 1.0" | bc -l) )); then
            echo "SLO violation: error rate $ERROR_RATE%"
            exit 1
          fi

Expected pipeline output:

Deploy Application ......... ✅
Wait for Rollout .......... ✅
Run Pod Kill Experiment ... ✅
Verify SLOs ............... ✅

Step 3: Gate Deployments on Resilience

Add quality gates that block releases if experiments fail:

# resilience-gate.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: deployment-gates
data:
  gates: |
    - name: pod-failure-resilience
      experiment: pod-kill
      threshold:
        error_rate: 1.0
        p99_latency: 1000
    - name: network-resilience
      experiment: network-latency
      threshold:
        error_rate: 0.5
        p99_latency: 2000

# Apply the gate configuration
kubectl apply -f resilience-gate.yaml

# The pipeline checks gates before promoting to next stage
echo "Gate check: pod-failure-resilience PASSED"
# Expected output:
# Gate check: pod-failure-resilience PASSED

Step 4: Use GitOps for Experiment Management

Store experiments with application manifests and sync them automatically:

# ArgoCD syncs experiments from Git to cluster
argocd app sync chaos-experiments
# Expected output:
# NAME                STATUS   HEALTH   SYNC
# chaos-experiments   Synced   Healthy  ✅

# Update experiment by pushing to Git
git push origin main
# Expected output:
# remote: Resolving deltas: 100% (6/6)
# ArgoCD automatically syncs the updated experiment

Step 5: Track Resilience Metrics

Measure and trend the resilience score over time:

# Query resilience pass rate for the last 30 days
curl -s "http://prometheus:9090/api/v1/query_range?query=resilience_experiment_result&start=$(date -d '30 days ago' +%s)&end=$(date +%s)&step=1d"
# Expected output:
# {
#   "data": {
#     "result": [
#       {"values": [["1718400000", "1"], ["1718486400", "1"], ["1718572800", "0"]]}
#     ]
#   }
# }
# A value of 1 means the experiment passed. 0 means it failed.
# Current trend: 96.7% pass rate over 30 days

Learning Path

flowchart LR
  A[Kubernetes Chaos] --> B[Chaos Engineering Pipeline]
  style B fill:#f90,color:#fff

Common Errors

Running experiments before the deployment is stable: If the pod is still starting when you kill it the deployment may be marked as failed. Add a stabilization wait period.
Using the same experiment suite for all environments: Staging should run a full suite while production runs only safety-net experiments. Different environments need different risk profiles.
Not setting pipeline timeouts: A Chaos Experiment that hangs will also hang your pipeline. Always set explicit timeouts on pipeline steps.
Ignoring experiment result trends: A single experiment failure is less important than a declining trend. Monitor the rolling pass rate over time.
Failing to alert on resilience degradation: The pipeline should not only block releases but also alert the team when resilience metrics decline.

Practice Questions

What is the purpose of a resilience pipeline in CI/CD?
How do you define experiments as code in a Git repository?
What is a deployment quality gate and how does it use chaos results?
How does GitOps help manage chaos experiments?
How do you measure resilience trends over time?

Challenge

Build a complete Chaos Engineering Pipeline for a microservices application. Define three experiment files (pod-kill, network-latency, cpu-stress), create a GitHub Actions workflow that runs them sequentially after deployment, add quality gates that block the release if any experiment fails, and set up a Grafana dashboard showing the 30-day resilience trend.

FAQ

What is a Chaos Engineering Pipeline?

A Chaos Engineering Pipeline automates the execution of chaos experiments as part of your CI/CD workflow, gating deployments on resilience test results.

How do I store chaos experiments?

Store experiment definitions as YAML files in your Git repository, alongside your application manifests. Version control ensures experiments are auditable and reproducible.

What experiments should run in production?

Run low-risk experiments in production: single pod kills (on deployments with many replicas), low-percentage Latency Injection, and read-only dependency failures.

How do I prevent experiments from blocking emergency releases?

Implement an escalation path: a human can override the gate for emergency releases, but the override is logged and triggers a follow-up review.

What metrics should I track for the resilience pipeline?

Track experiment pass rate, time to recover from each experiment, and the number of experiments run per deployment. Trend these metrics over weeks and months.

← Previous Kubernetes Chaos — Pod Failures, DNS Issues & Resource Pressure Next → Designing Chaos Experiments — Structured Fault Injection for Resilient Systems

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Chaos Engineering