Chaos Testing: Building Resilient Systems Through Experimentation

Q: Can chaos testing be automated?

Yes. Tools like Litmus, Chaos Mesh , and Gremlin run experiments on schedules or triggered by events (like deployments). Automated chaos testing is a key part of a mature resilience strategy .

DodaTech Updated 2026-06-22 8 min read

In this tutorial, you'll learn about Chaos Testing: Building Resilient Systems Through Experimentation. We cover key concepts, practical examples, and best practices.

Chaos testing is the practice of intentionally injecting failures into a system to uncover weaknesses before they cause real outages, building confidence that the system can withstand unexpected disruptions.

What You'll Learn

In this tutorial, you'll learn chaos engineering principles, how to design and run chaos experiments, fault injection techniques, tools like Chaos Monkey and Litmus, and how to build resilience into distributed systems.

Why This Matters

Every system fails eventually. Servers crash, networks partition, databases time out. Chaos testing finds out how your system behaves when these happen — in a controlled experiment, not during a customer-facing outage. Netflix's Chaos Monkey, which randomly terminates production instances, has become the standard example of proactive resilience testing. At DodaTech, Doda Browser's sync service regularly undergoes chaos experiments that kill backend instances to verify the browser continues syncing bookmarks without data loss.

Learning Path

flowchart LR
  A[Testing Microservices] --> B[Chaos Testing
You are here]
  B --> C[Chaos Experiments]
  B --> D[Fault Injection]
  C --> E[Chaos Monkey]
  D --> F[Litmus Chaos]
  E --> G[Production Resilience]
  style B fill:#f90,color:#fff

Chaos Engineering Principles

Chaos engineering follows four principles based on the scientific method:

Principle	Description
Define steady state	Measure normal system behavior (latency, error rate, throughput)
Form a hypothesis	"If database fails, the service returns cached data within 500ms"
Run the experiment	Inject the failure in a controlled environment
Verify or disprove	Compare results against the hypothesis

Running a Chaos Experiment

Here's a simple chaos experiment using Python to test how a service behaves when its database connection is lost:

# chaos_experiment.py
import time
import requests
import subprocess

STEADY_STATE_LATENCY = []

def measure_steady_state():
    """Establish baseline metrics."""
    for _ in range(20):
        start = time.time()
        response = requests.get("http://localhost:8080/api/users/1")
        STEADY_STATE_LATENCY.append(time.time() - start)
    avg_latency = sum(STEADY_STATE_LATENCY) / len(STEADY_STATE_LATENCY)
    print(f"Steady state latency: {avg_latency*1000:.1f}ms")
    return avg_latency

def inject_failure():
    """Kill the database container."""
    subprocess.run(["docker", "stop", "postgres"], check=False)

def measure_during_failure():
    """Measure response when database is down."""
    failure_latency = []
    for _ in range(10):
        try:
            start = time.time()
            response = requests.get(
                "http://localhost:8080/api/users/1",
                timeout=5
            )
            failure_latency.append(time.time() - start)
        except requests.RequestException:
            failure_latency.append(None)

    available = [l for l in failure_latency if l is not None]
    success_rate = len(available) / len(failure_latency) * 100
    print(f"Success rate during failure: {success_rate:.0f}%")
    return failure_latency

def restore_service():
    """Restart the database."""
    subprocess.run(["docker", "start", "postgres"], check=False)

if __name__ == "__main__":
    hypothesis = "Service returns cached data within 1000ms when database is down"

    steady_state = measure_steady_state()
    inject_failure()
    time.sleep(2)

    try:
        failure_results = measure_during_failure()
    finally:
        restore_service()

    print(f"\nHypothesis: {hypothesis}")
    print("Experiment complete")

Expected output:

Steady state latency: 45.2ms
Success rate during failure: 100.0%

Hypothesis: Service returns cached data within 1000ms when database is down
Experiment complete

If the hypothesis is disproven (services timeout or return errors), you've found a resilience gap.

Fault Injection with Toxiproxy

Toxiproxy is a proxy that sits between services and introduces network faults:

# network_chaos.py
from toxiproxy import ToxicProxy, ApiClient

def add_latency_toast():
    proxy = ToxicProxy("http://localhost:8474")
    api = ApiClient(proxy)

    # Add 2000ms latency to database connections
    api.create_toxic(
        proxy_name="postgres_proxy",
        toxic_type="latency",
        stream="upstream",
        toxicity=1.0,
        attributes={"latency": 2000, "jitter": 500}
    )

def add_connection_errors():
    proxy = ToxicProxy("http://localhost:8474")
    api = ApiClient(proxy)

    # Drop 50% of connections
    api.create_toxic(
        proxy_name="payment_service_proxy",
        toxic_type="timeout",
        stream="upstream",
        toxicity=0.5,
        attributes={"timeout": 0}
    )

def add_bandwidth_limit():
    proxy = ToxicProxy("http://localhost:8474")
    api = ApiClient(proxy)

    # Limit bandwidth to 10KB/s
    api.create_toxic(
        proxy_name="file_storage_proxy",
        toxic_type="bandwidth",
        stream="upstream",
        toxicity=1.0,
        attributes={"rate": 10}
    )

Expected test results with Toxiproxy active:

test session starts
collecting ... collected 3 items

test_database_latency.py .                                    [33%]
PASS: Service returns results within 3000ms (actual: 2100ms)

test_payment_timeout.py .                                     [66%]
PASS: Service retries and falls back to alternative gateway

test_bandwidth_limits.py .                                    [100%]
PASS: File upload shows progress indicator during slow transfer

Chaos Monkey with Litmus

Litmus is a Kubernetes-native chaos engineering platform. Here's a Litmus experiment that kills pods in a namespace:

# pod-delete-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  appinfo:
    appns: "default"
    applabel: "app=nginx"
    appkind: "deployment"
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "true"
        probe:
          - name: "check-service-health"
            type: "httpProbe"
            httpProbe/inputs:
              url: "http://nginx-service.default.svc.cluster.local"
              insecureSkipVerify: true
            mode: "Continuous"
            runProperties:
              probeTimeout: 5
              interval: 2
              retry: 1

Apply the experiment to your Kubernetes cluster:

kubectl apply -f pod-delete-experiment.yaml

Expected experiment output:

Experiment: pod-delete (Running)
  Pods targeted: nginx-deployment-7d8b9c6f4-abcde
  Pod killed: nginx-deployment-7d8b9c6f4-abcde
  Pod killed: nginx-deployment-7d8b9c6f4-fghij
  Health probe: All targets healthy
  Experiment status: Completed

Result: Hypothesis validated. Service remained healthy through 2 pod terminations.

GameDays: Scheduled Chaos

A GameDay is a scheduled, structured chaos experiment involving the whole team:

Phase	Activity	Duration
Planning	Define scope, hypothesis, rollback plan	1 hour
Preparation	Set up monitoring, notifiy stakeholders	30 min
Execution	Run the experiment	30-60 min
Analysis	Review results, update runbooks	1 hour
Remediation	Fix issues found, write new tests	As needed

Metrics to Track

Metric	Why It Matters
Time to detection	How fast does monitoring detect the failure?
Time to mitigation	How fast does the system recover?
Error rate impact	What percentage of requests fail?
User impact	Does the failure degrade the user experience?
Blast radius	How many services are affected by one failure?

Common Errors

1. Running Experiments in Production Without Preparation

Always start in staging. Even then, have a rollback plan. Netflix runs Chaos Monkey in production because they've spent years building resilience — don't start there.

2. No Rollback Plan

If an experiment reveals an unexpected failure, you must be able to restore normal operations immediately. Define the rollback before running the experiment.

3. Testing Only One Failure at a Time

Real incidents often involve multiple simultaneous failures. After mastering single-failure experiments, try combinations: database failure + network latency + pod restart.

4. Ignoring Blast Radius

A chaos experiment that brings down the entire system is not a controlled experiment. Limit the blast radius to a subset of instances or a single service.

5. Not Automating Experiments

Manual experiments don't scale. Automate your chaos experiments to run on a schedule (weekly, monthly) so resilience is continuously validated.

Practice Questions

1. What is the first step in a chaos experiment? Define the steady state — measure normal system behavior including latency, error rate, and throughput. This provides the baseline for comparison during the experiment.

2. What is Chaos Monkey? Netflix's tool that randomly terminates instances in production to ensure the system survives instance failures without user impact.

3. What is a GameDay? A scheduled, structured chaos experiment involving the whole team, run like a fire drill to test system resilience and team response procedures.

4. What is the blast radius of a chaos experiment? The set of components affected by the experiment. A well-designed experiment limits the blast radius to a subset of instances or a single service.

5. How do you know if a chaos experiment succeeded? The system maintained steady state (metrics stayed within acceptable thresholds) throughout the failure injection, or the hypothesis was validated.

Challenge: Design and run a chaos experiment that tests what happens when the authentication service goes down. Define the hypothesis, set up the experiment, run it in a staging environment with monitoring, and document the results and any resilience gaps found.

Real-World Task: E-Commerce Resilience Testing

Build a chaos testing strategy for an e-commerce platform with these services:

ProductCatalog — read-heavy, can tolerate brief outages
ShoppingCart — must remain available during writes
OrderProcessing — needs database and payment gateway
PaymentGateway — external dependency, simulate timeout

Run experiments in this order:

Kill one instance of ProductCatalog — service should continue
Add latency to database connection — verify caching kicks in
Kill all instances of PaymentGateway — verify graceful error handling
Network partition between OrderProcessing and database — test retry logic

Each experiment should document the hypothesis, results, and any code changes needed to improve resilience.

FAQ

Is chaos testing the same as chaos engineering?

Chaos testing is the activity of injecting failures. Chaos engineering is the broader discipline that includes designing experiments, forming hypotheses, and using results to build more resilient systems.

Can chaos testing be automated?

Yes. Tools like Litmus, Chaos Mesh, and Gremlin run experiments on schedules or triggered by events (like deployments). Automated chaos testing is a key part of a mature resilience strategy.

Do I need Kubernetes to do chaos testing?

No. Chaos testing works at any level: kill processes on a VM, disconnect network interfaces, fill up disk space, or throttle CPU. The tools vary by infrastructure type.

How often should chaos experiments run?

Start monthly. As your team gains confidence, increase to weekly for critical services. Some teams run small experiments continuously (Netflix's Chaos Monkey runs every weekday).

What if a chaos experiment breaks production?

This is why you start in staging and have rollback plans. Even experienced teams occasionally cause real incidents. The key is learning from them — document what went wrong and improve your safeguards.

What's Next

Tutorial	What You'll Learn
Testing Microservices Guide	Broader strategies for distributed system testing
CI/CD	Automated resilience testing in CI
Docker	Container-level fault injection

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Testing Microservices: Strategies, Challenges and Best Practices Next → Building a Test Automation Framework from Scratch

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Testing