Skip to content

Designing Chaos Experiments — Structured Fault Injection for Resilient Systems

DodaTech Updated 2026-06-23 5 min read

In this tutorial, you'll learn about Designing Chaos Experiments. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Designing a Chaos Engineering experiment requires a structured Process that transforms a reliability concern into a testable, safe, and repeatable Fault Injection. This tutorial covers the full experiment lifecycle from hypothesis to post-mortem.

What You Will Learn

This tutorial teaches you how to design chaos experiments using a repeatable framework: formulating falsifiable hypotheses, selecting appropriate fault types, defining Blast Radius boundaries, implementing guardrails, and analyzing results with Python.

Why It Matters

A well-designed experiment produces clear, actionable insights about system resilience. A poorly designed experiment wastes engineering time, introduces unnecessary risk, and generates inconclusive data. Following a structured experiment design Process ensures every run produces measurable learning.

Real-World Use

DodaTech uses a standardized experiment card template for every chaos run. Engineers submit cards for peer review before execution. This Process has caught design flaws in over 30 percent of proposed experiments, preventing potential incidents and ensuring every experiment contributes to the team's resilience knowledge base.

Prerequisites

Before starting you should understand:

  • Core Chaos Engineering principles and steady state hypotheses
  • How blast radius controls limit experiment scope
  • Basic Kubernetes operations (kubectl, pods, deployments)
  • Familiarity with YAML and Python scripting

Step 1: Formulate a Falsifiable Hypothesis

Every experiment begins with a specific concern about system behavior. Translate that concern into a falsifiable hypothesis using the standard template: fault, expected outcome, metric thresholds, and reasoning.

# experiment-hypothesis.yaml
hypothesis:
  concern: "Payment service has only 2 replicas"
  fault: pod-kill
  target:
    service: payment-service
    namespace: staging
  prediction: "p99 latency stays below 1000ms"
  metric_thresholds:
    p99_latency: 1000ms
    error_rate: 2%
    throughput_degradation: 20%
  reasoning: >
    The remaining replica can handle full traffic
    because the load balancer detects pod failure
    within 5 seconds and redirects all traffic.
# Validate hypothesis structure
cat <<'EOF'
If we kill 1 of 2 payment-service pods,
then p99 latency remains under 1000ms and error rate under 2%
because the remaining replica and load balancer
handle the full traffic within 5 seconds.
EOF

Expected output: The hypothesis is printed to stdout for documentation and peer review.

Step 2: Define the Experiment Card

The experiment card captures all details in a single document that serves as the source of truth for execution and post-experiment analysis.

# experiment-card.yaml
experiment:
  id: "EXP-2026-06-001"
  title: "Payment Service Single Pod Failure"
  hypothesis: "Kill 1 of 2 pods, p99 under 1000ms, errors under 2%"
  fault:
    type: pod-kill
    tool: chaos-mesh
  target:
    namespace: staging
    label: app=payment-service
  metrics:
    - name: p99_latency
      query: 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))'
      threshold: 1.0
    - name: error_rate
      query: 'sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))'
      threshold: 0.02
  blast_radius: "1 pod, 50% of traffic"
  duration: 60s
  guardrails:
    error_rate_breach: abort_experiment
    latency_breach: warn_only
  rollback: "kubectl delete podchaos experiment-001"

Step 3: Implement Fault Injection

Execute the experiment using the chosen Chaos Engineering tool. This example uses Chaos Mesh to kill a single pod.

# Apply the experiment
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: experiment-payment-kill-001
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: ["staging"]
    labelSelectors:
      app: payment-service
  duration: 60s
EOF

Expected output:

podchaos.chaos-mesh.org/experiment-payment-kill-001 created

Step 4: Analyze Results with Python

After the experiment completes, analyze the metrics to determine whether the hypothesis was confirmed or rejected.

import json
import urllib.request

prometheus_url = "http://prometheus:9090/api/v1/query"

def query_metric(query):
    params = f"?query={urllib.request.quote(query)}"
    with urllib.request.urlopen(prometheus_url + params) as resp:
        data = json.loads(resp.read())
    return float(data["data"]["result"][0]["value"][1])

# Query metrics after experiment
p99 = query_metric(
    'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))'
)
error_rate = query_metric(
    'sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))'
)

print(f"Observed p99 latency: {p99*1000:.0f}ms")
print(f"Observed error rate: {error_rate*100:.2f}%")
print(f"Hypothesis confirmed: {p99 < 1.0 and error_rate < 0.02}")

Expected output:

Observed p99 latency: 412ms
Observed error rate: 0.30%
Hypothesis confirmed: True

Step 5: Document Findings

Record the experiment outcome, observations, and action items. Store this report alongside the experiment card for future reference.

# experiment-report.yaml
report:
  experiment_id: "EXP-2026-06-001"
  hypothesis: "Kill 1 of 2 pods, p99 under 1000ms, errors under 2%"
  result: confirmed
  observed_metrics:
    p99_latency: 412ms
    error_rate: 0.3%
    throughput_degradation: 12%
  findings:
    - "System handled single pod failure correctly"
    - "Load balancer detected failure in 4 seconds"
    - "Recovery time was 8 seconds total"
  action_items:
    - "Reduce readiness probe interval from 10s to 5s"
    - "Add third replica for safety margin"
  next_steps:
    - "Run experiment with 500ms latency injection"
    - "Test cascading failure scenario"

Learning Path

flowchart LR
  A[Steady State Hypothesis] --> B[Designing Experiments]
  B --> C[Game Days]
  C --> D[Chaos Mesh]
  D --> E[Automated Pipeline]
  style B fill:#f90,color:#fff

Common Errors

  1. Running experiments without a specific hypothesis: "Let me see what happens" is not a plan. Every experiment must have a falsifiable prediction.
  2. Choosing overly complex faults for first experiments: Start with simple pod kills. Graduate to network partitions and complex scenarios only after mastering basics.
  3. Skipping the Blast Radius analysis: Always document which users, services, or data could be affected. Start with staging and expand to production gradually.
  4. Failing to set guardrails: Without abort conditions an experiment can cause cascading failures. Define metric thresholds that trigger automatic rollback.
  5. Not scheduling follow-up experiments: If the hypothesis is rejected, the fix must be verified with a repeat experiment. Treat rejected hypotheses as feature requests.

Practice Questions

  1. What are the four mandatory sections of an experiment card?
  2. Why must a hypothesis include a reasoning clause?
  3. How do you determine the appropriate Blast Radius for an experiment?
  4. What metrics should you monitor during a pod-kill experiment?
  5. How do you decide whether an experiment hypothesis was confirmed or rejected?

Challenge

Design a complete experiment for a database Connection Pool exhaustion scenario. Write the hypothesis, define metric thresholds, set guardrails, implement the Fault Injection using a proxy, and create a Python script that analyzes the results. Run through the full cycle in a staging environment.

FAQ

What is an experiment card in Chaos Engineering?

An experiment card is a standardized document that captures the hypothesis, fault type, target, metrics, blast radius, duration, and guardrails for a Chaos Experiment.

How long should a Chaos Experiment run?

Most experiments last between 30 seconds and 5 minutes. Duration should be long enough to observe system behavior but short enough to limit Blast Radius if something goes wrong.

What is the most common beginner experiment?

Killing one pod in a multi-replica deployment is the safest and most informative first experiment. It teaches hypothesis formulation, metric monitoring, and result analysis with minimal risk.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro