Designing Chaos Experiments — From Idea to Execution
In this tutorial, you'll learn about Designing Chaos Experiments. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Designing a Chaos Engineering experiment requires moving from a vague concern about system reliability to a specific, testable, and safe Fault Injection. This tutorial covers the complete experiment design lifecycle from idea to post-experiment analysis.
What You Will Learn
This tutorial walks through the full experiment design Process: identifying weaknesses, formulating hypotheses, selecting fault types, executing with safety controls, and analyzing results.
Why It Matters
A well-designed experiment yields clear answers about system resilience. A poorly designed experiment wastes time, creates risk, and produces inconclusive results. Following a structured Process ensures every experiment generates actionable learning.
Real-World Use
DodaTech runs a weekly experiment review where engineers present their experiment designs for peer feedback before execution. This practice has caught design flaws in 30 percent of proposed experiments before they ran.
Prerequisites
Before starting you should understand:
- Chaos engineering principles and Steady State hypotheses
- How blast radius controls work
- Basic Kubernetes operations (kubectl, pods, deployments)
Step 1: Identify a System Weakness
Start with a specific concern about your system. Common sources of experiment ideas:
- Recent incidents or near-misses
- Architecture review findings
- Known single points of failure
- Services with insufficient redundancy
# Identify services with low replica count (potential weakness)
kubectl get deployments --all-namespaces | awk '{if($3<3) print $1, $2, $3}'
# Expected output:
# production payment-service 2
# production notification-svc 1
Services with fewer than three replicas are good candidates for experiments.
Step 2: Write the Experiment Card
Document the experiment using a standardized template:
# experiment-card.yaml
experiment:
title: "Payment Service Single Pod Failure"
concern: "Payment service has only 2 replicas"
hypothesis: "Killing 1 of 2 pods causes p99 under 1s and error rate under 2%"
fault: pod-kill
target:
service: payment-service
namespace: staging
replicas: 2
metrics:
- p99_latency < 1000ms
- error_rate < 2%
- throughput > 80% of baseline
blast_radius: 1 pod, 50% of traffic
duration: 60s
guardrails:
error_rate_breach: abort
latency_breach: warn
Step 3: Implement the Fault Injection
Use the appropriate tool to inject the fault defined in the experiment card.
# Execute the experiment using Chaos Mesh
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: experiment-payment-kill-001
spec:
action: pod-kill
mode: one
selector:
namespaces: ["staging"]
labelSelectors:
app: payment-service
duration: 60s
EOF
Expected output:
podchaos.chaos-mesh.org/experiment-payment-kill-001 created
Step 4: Observe in Real Time
Monitor the metrics during the experiment to confirm the guardrails are working.
# Real-time monitoring during experiment
kubectl get pods -l app=payment-service -w
# Expected output showing pod restart:
# payment-service-7d9f8c6b4f-abc1 1/1 Running
# payment-service-7d9f8c6b4f-abc2 1/1 Running
# payment-service-7d9f8c6b4f-abc1 0/1 Terminating
# payment-service-7d9f8c6b4f-abc1 0/1 Completed
# payment-service-7d9f8c6b4f-abc3 0/1 Pending
# payment-service-7d9f8c6b4f-abc3 1/1 Running
Step 5: Analyze and Share Results
After the experiment completes analyze the data and share findings with the team.
# Compare pre and post experiment metrics
# Pre-experiment:
echo "Pre: p99=$(curl -s http://prometheus:9090/api/v1/query?query=p99_latency | jq -r .data.result[0].value[1])ms"
# Post-experiment:
echo "Post: p99=$(curl -s http://prometheus:9090/api/v1/query?query=p99_latency | jq -r .data.result[0].value[1])ms"
# Expected output:
# Pre: p99=342ms
# Post: p99=412ms
Learning Path
flowchart LR A[Steady State Hypothesis] --> B[Designing Experiments] B --> C[Game Days] C --> D[Chaos Mesh Platform] D --> E[Automated Pipeline] style B fill:#f90,color:#fff
Common Errors
- Designing experiments without a specific concern: "Lets see what happens" is not a plan. Start with a known weakness.
- Choosing faults that are too complex: Begin with simple faults like pod kills. Graduate to network partitions and complex failures.
- Running experiments without peer review: Always have another engineer review the experiment card before execution. They will spot things you missed.
- Skipping the rollback plan: Know exactly how to stop the experiment and restore normal operation. Automate this when possible.
- Not scheduling follow-up experiments: If the hypothesis was rejected the fix should be verified with a repeat experiment.
Practice Questions
- What information should be included in an experiment card?
- Why should you start with simple fault types like pod kills?
- How do you decide which service to target for an experiment?
- What is the purpose of peer review in experiment design?
- How do you determine if an experiment produced useful results?
Challenge
Design a complete experiment for a database Connection Pool exhaustion scenario. Write the experiment card, implement the Fault Injection using a proxy-based approach, set guardrails, and define success criteria. Execute the experiment in a staging environment and document the results.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro