Designing Chaos Experiments — From Idea to Execution

DodaTech Updated 2026-06-21 4 min read

In this tutorial, you'll learn about Designing Chaos Experiments. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Designing a Chaos Engineering experiment requires moving from a vague concern about system reliability to a specific, testable, and safe Fault Injection. This tutorial covers the complete experiment design lifecycle from idea to post-experiment analysis.

What You Will Learn

This tutorial walks through the full experiment design Process: identifying weaknesses, formulating hypotheses, selecting fault types, executing with safety controls, and analyzing results.

Why It Matters

A well-designed experiment yields clear answers about system resilience. A poorly designed experiment wastes time, creates risk, and produces inconclusive results. Following a structured Process ensures every experiment generates actionable learning.

Real-World Use

DodaTech runs a weekly experiment review where engineers present their experiment designs for peer feedback before execution. This practice has caught design flaws in 30 percent of proposed experiments before they ran.

Prerequisites

Before starting you should understand:

Chaos engineering principles and Steady State hypotheses
How blast radius controls work
Basic Kubernetes operations (kubectl, pods, deployments)

Step 1: Identify a System Weakness

Start with a specific concern about your system. Common sources of experiment ideas:

Recent incidents or near-misses
Architecture review findings
Known single points of failure
Services with insufficient redundancy

# Identify services with low replica count (potential weakness)
kubectl get deployments --all-namespaces | awk '{if($3<3) print $1, $2, $3}'
# Expected output:
# production    payment-service   2
# production    notification-svc  1

Services with fewer than three replicas are good candidates for experiments.

Step 2: Write the Experiment Card

Document the experiment using a standardized template:

# experiment-card.yaml
experiment:
  title: "Payment Service Single Pod Failure"
  concern: "Payment service has only 2 replicas"
  hypothesis: "Killing 1 of 2 pods causes p99 under 1s and error rate under 2%"
  fault: pod-kill
  target:
    service: payment-service
    namespace: staging
    replicas: 2
  metrics:
    - p99_latency < 1000ms
    - error_rate < 2%
    - throughput > 80% of baseline
  blast_radius: 1 pod, 50% of traffic
  duration: 60s
  guardrails:
    error_rate_breach: abort
    latency_breach: warn

Step 3: Implement the Fault Injection

Use the appropriate tool to inject the fault defined in the experiment card.

# Execute the experiment using Chaos Mesh
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: experiment-payment-kill-001
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: ["staging"]
    labelSelectors:
      app: payment-service
  duration: 60s
EOF

Expected output:

podchaos.chaos-mesh.org/experiment-payment-kill-001 created

Step 4: Observe in Real Time

Monitor the metrics during the experiment to confirm the guardrails are working.

# Real-time monitoring during experiment
kubectl get pods -l app=payment-service -w
# Expected output showing pod restart:
# payment-service-7d9f8c6b4f-abc1   1/1   Running
# payment-service-7d9f8c6b4f-abc2   1/1   Running
# payment-service-7d9f8c6b4f-abc1   0/1   Terminating
# payment-service-7d9f8c6b4f-abc1   0/1   Completed
# payment-service-7d9f8c6b4f-abc3   0/1   Pending
# payment-service-7d9f8c6b4f-abc3   1/1   Running

After the experiment completes analyze the data and share findings with the team.

# Compare pre and post experiment metrics
# Pre-experiment:
echo "Pre: p99=$(curl -s http://prometheus:9090/api/v1/query?query=p99_latency | jq -r .data.result[0].value[1])ms"
# Post-experiment:
echo "Post: p99=$(curl -s http://prometheus:9090/api/v1/query?query=p99_latency | jq -r .data.result[0].value[1])ms"
# Expected output:
# Pre: p99=342ms
# Post: p99=412ms

Learning Path

flowchart LR
  A[Steady State Hypothesis] --> B[Designing Experiments]
  B --> C[Game Days]
  C --> D[Chaos Mesh Platform]
  D --> E[Automated Pipeline]
  style B fill:#f90,color:#fff

Common Errors

Designing experiments without a specific concern: "Lets see what happens" is not a plan. Start with a known weakness.
Choosing faults that are too complex: Begin with simple faults like pod kills. Graduate to network partitions and complex failures.
Running experiments without peer review: Always have another engineer review the experiment card before execution. They will spot things you missed.
Skipping the rollback plan: Know exactly how to stop the experiment and restore normal operation. Automate this when possible.
Not scheduling follow-up experiments: If the hypothesis was rejected the fix should be verified with a repeat experiment.

Practice Questions

What information should be included in an experiment card?
Why should you start with simple fault types like pod kills?
How do you decide which service to target for an experiment?
What is the purpose of peer review in experiment design?
How do you determine if an experiment produced useful results?

Challenge

Design a complete experiment for a database Connection Pool exhaustion scenario. Write the experiment card, implement the Fault Injection using a proxy-based approach, set guardrails, and define success criteria. Execute the experiment in a staging environment and document the results.

FAQ

What is an experiment card?

An experiment card is a standardized document that captures the hypothesis, fault type, target, metrics, Blast Radius, duration, and guardrails for a Chaos Experiment.

How many experiments should I run per week?

Start with one or two experiments per week. Quality matters more than quantity. Increase frequency as your team gains confidence and automation matures.

Can I design an experiment without knowing the expected outcome?

You should always have an expected outcome — that is your hypothesis. If you have no expectation you are not ready to run.

What is the most common experiment for beginners?

Killing one pod in a multi-replica deployment is the safest and most common first experiment. It teaches the fundamentals without high risk.

How do I know if an experiment is too risky?

If you cannot confidently predict the Blast Radius or if the service has no redundancy the experiment is too risky. Reduce scope or move to staging.

← Previous Steady State Hypothesis — Defining Normal Behavior Next → Game Days — Running Chaos Drills with Your Team

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Chaos Engineering