Chaos Engineering Pipeline — Automating Experiments in CI/CD
In this tutorial, you'll learn about Chaos Engineering Pipeline. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
A Chaos Engineering Pipeline integrates automated Chaos Engineering experiments into your CI/CD workflow. Instead of running experiments manually, you define them as code, trigger them on deployments, and gate releases based on resilience test results. This is the final step in making Chaos Engineering a continuous practice.
What You Will Learn
This tutorial teaches you how to build a Chaos Engineering Pipeline: defining experiments as code, running them automatically after deployments, using GitOps for experiment management, and measuring resilience as a release quality gate.
Why It Matters
Manual chaos experiments are valuable but they do not scale. When you deploy multiple times per day you cannot manually test every release. An automated chaos pipeline ensures that every deployment passes a baseline resilience check before reaching production users.
Real-World Use
DodaTech runs a continuous resilience pipeline that executes experiments at each stage of the deployment: staging (full experiment suite), canary (critical experiments only), and production (safety-net experiments). A failed experiment at any stage blocks the release from progressing.
Prerequisites
Before starting you should understand:
- Chaos Engineering fundamentals and experiment design
- CI/CD pipeline concepts (stages, gates, artifacts)
- Kubernetes operations and Chaos Mesh or LitmusChaos
- GitOps principles (Git as source of truth)
Step 1: Define Experiments as Code
Store all experiment definitions in your Git repository:
# experiments/production/pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pipeline-pod-kill
spec:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: web-service
duration: 30s
# Add experiments to version control
git add experiments/
git commit -m "Add production pod kill experiment"
# Expected output:
# [main 1a2b3c4] Add production pod kill experiment
# 1 file changed, 11 insertions(+)
Step 2: Create the Resilience Pipeline
Build a CI/CD pipeline that runs chaos experiments after deployment:
# .github/workflows/resilience-pipeline.yml
name: Resilience Pipeline
on:
deployment_status:
types: [success]
jobs:
chaos-tests:
runs-on: ubuntu-latest
steps:
- name: Deploy Application
run: kubectl apply -f k8s/deployment.yaml
- name: Wait for Rollout
run: kubectl rollout status deployment/web-service --timeout=5m
- name: Run Pod Kill Experiment
run: |
kubectl apply -f experiments/pod-kill.yaml
sleep 45
- name: Verify SLOs
run: |
ERROR_RATE=$(curl -s prometheus:9090/api/v1/query?query=error_rate | jq -r .data.result[0].value[1])
if (( $(echo "$ERROR_RATE > 1.0" | bc -l) )); then
echo "SLO violation: error rate $ERROR_RATE%"
exit 1
fi
Expected pipeline output:
Deploy Application ......... ✅
Wait for Rollout .......... ✅
Run Pod Kill Experiment ... ✅
Verify SLOs ............... ✅
Step 3: Gate Deployments on Resilience
Add quality gates that block releases if experiments fail:
# resilience-gate.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: deployment-gates
data:
gates: |
- name: pod-failure-resilience
experiment: pod-kill
threshold:
error_rate: 1.0
p99_latency: 1000
- name: network-resilience
experiment: network-latency
threshold:
error_rate: 0.5
p99_latency: 2000
# Apply the gate configuration
kubectl apply -f resilience-gate.yaml
# The pipeline checks gates before promoting to next stage
echo "Gate check: pod-failure-resilience PASSED"
# Expected output:
# Gate check: pod-failure-resilience PASSED
Step 4: Use GitOps for Experiment Management
Store experiments with application manifests and sync them automatically:
# ArgoCD syncs experiments from Git to cluster
argocd app sync chaos-experiments
# Expected output:
# NAME STATUS HEALTH SYNC
# chaos-experiments Synced Healthy ✅
# Update experiment by pushing to Git
git push origin main
# Expected output:
# remote: Resolving deltas: 100% (6/6)
# ArgoCD automatically syncs the updated experiment
Step 5: Track Resilience Metrics
Measure and trend the resilience score over time:
# Query resilience pass rate for the last 30 days
curl -s "http://prometheus:9090/api/v1/query_range?query=resilience_experiment_result&start=$(date -d '30 days ago' +%s)&end=$(date +%s)&step=1d"
# Expected output:
# {
# "data": {
# "result": [
# {"values": [["1718400000", "1"], ["1718486400", "1"], ["1718572800", "0"]]}
# ]
# }
# }
# A value of 1 means the experiment passed. 0 means it failed.
# Current trend: 96.7% pass rate over 30 days
Learning Path
flowchart LR A[Kubernetes Chaos] --> B[Chaos Engineering Pipeline] style B fill:#f90,color:#fff
Common Errors
- Running experiments before the deployment is stable: If the pod is still starting when you kill it the deployment may be marked as failed. Add a stabilization wait period.
- Using the same experiment suite for all environments: Staging should run a full suite while production runs only safety-net experiments. Different environments need different risk profiles.
- Not setting pipeline timeouts: A Chaos Experiment that hangs will also hang your pipeline. Always set explicit timeouts on pipeline steps.
- Ignoring experiment result trends: A single experiment failure is less important than a declining trend. Monitor the rolling pass rate over time.
- Failing to alert on resilience degradation: The pipeline should not only block releases but also alert the team when resilience metrics decline.
Practice Questions
- What is the purpose of a resilience pipeline in CI/CD?
- How do you define experiments as code in a Git repository?
- What is a deployment quality gate and how does it use chaos results?
- How does GitOps help manage chaos experiments?
- How do you measure resilience trends over time?
Challenge
Build a complete Chaos Engineering Pipeline for a microservices application. Define three experiment files (pod-kill, network-latency, cpu-stress), create a GitHub Actions workflow that runs them sequentially after deployment, add quality gates that block the release if any experiment fails, and set up a Grafana dashboard showing the 30-day resilience trend.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro