Chaos Testing: Building Resilient Systems Through Experimentation
In this tutorial, you'll learn about Chaos Testing: Building Resilient Systems Through Experimentation. We cover key concepts, practical examples, and best practices.
Chaos testing is the practice of intentionally injecting failures into a system to uncover weaknesses before they cause real outages, building confidence that the system can withstand unexpected disruptions.
What You'll Learn
In this tutorial, you'll learn chaos engineering principles, how to design and run chaos experiments, fault injection techniques, tools like Chaos Monkey and Litmus, and how to build resilience into distributed systems.
Why This Matters
Every system fails eventually. Servers crash, networks partition, databases time out. Chaos testing finds out how your system behaves when these happen — in a controlled experiment, not during a customer-facing outage. Netflix's Chaos Monkey, which randomly terminates production instances, has become the standard example of proactive resilience testing. At DodaTech, Doda Browser's sync service regularly undergoes chaos experiments that kill backend instances to verify the browser continues syncing bookmarks without data loss.
Learning Path
flowchart LR A[Testing Microservices] --> B[Chaos Testing
You are here] B --> C[Chaos Experiments] B --> D[Fault Injection] C --> E[Chaos Monkey] D --> F[Litmus Chaos] E --> G[Production Resilience] style B fill:#f90,color:#fff
Chaos Engineering Principles
Chaos engineering follows four principles based on the scientific method:
| Principle | Description |
|---|---|
| Define steady state | Measure normal system behavior (latency, error rate, throughput) |
| Form a hypothesis | "If database fails, the service returns cached data within 500ms" |
| Run the experiment | Inject the failure in a controlled environment |
| Verify or disprove | Compare results against the hypothesis |
Running a Chaos Experiment
Here's a simple chaos experiment using Python to test how a service behaves when its database connection is lost:
# chaos_experiment.py
import time
import requests
import subprocess
STEADY_STATE_LATENCY = []
def measure_steady_state():
"""Establish baseline metrics."""
for _ in range(20):
start = time.time()
response = requests.get("http://localhost:8080/api/users/1")
STEADY_STATE_LATENCY.append(time.time() - start)
avg_latency = sum(STEADY_STATE_LATENCY) / len(STEADY_STATE_LATENCY)
print(f"Steady state latency: {avg_latency*1000:.1f}ms")
return avg_latency
def inject_failure():
"""Kill the database container."""
subprocess.run(["docker", "stop", "postgres"], check=False)
def measure_during_failure():
"""Measure response when database is down."""
failure_latency = []
for _ in range(10):
try:
start = time.time()
response = requests.get(
"http://localhost:8080/api/users/1",
timeout=5
)
failure_latency.append(time.time() - start)
except requests.RequestException:
failure_latency.append(None)
available = [l for l in failure_latency if l is not None]
success_rate = len(available) / len(failure_latency) * 100
print(f"Success rate during failure: {success_rate:.0f}%")
return failure_latency
def restore_service():
"""Restart the database."""
subprocess.run(["docker", "start", "postgres"], check=False)
if __name__ == "__main__":
hypothesis = "Service returns cached data within 1000ms when database is down"
steady_state = measure_steady_state()
inject_failure()
time.sleep(2)
try:
failure_results = measure_during_failure()
finally:
restore_service()
print(f"\nHypothesis: {hypothesis}")
print("Experiment complete")
Expected output:
Steady state latency: 45.2ms
Success rate during failure: 100.0%
Hypothesis: Service returns cached data within 1000ms when database is down
Experiment complete
If the hypothesis is disproven (services timeout or return errors), you've found a resilience gap.
Fault Injection with Toxiproxy
Toxiproxy is a proxy that sits between services and introduces network faults:
# network_chaos.py
from toxiproxy import ToxicProxy, ApiClient
def add_latency_toast():
proxy = ToxicProxy("http://localhost:8474")
api = ApiClient(proxy)
# Add 2000ms latency to database connections
api.create_toxic(
proxy_name="postgres_proxy",
toxic_type="latency",
stream="upstream",
toxicity=1.0,
attributes={"latency": 2000, "jitter": 500}
)
def add_connection_errors():
proxy = ToxicProxy("http://localhost:8474")
api = ApiClient(proxy)
# Drop 50% of connections
api.create_toxic(
proxy_name="payment_service_proxy",
toxic_type="timeout",
stream="upstream",
toxicity=0.5,
attributes={"timeout": 0}
)
def add_bandwidth_limit():
proxy = ToxicProxy("http://localhost:8474")
api = ApiClient(proxy)
# Limit bandwidth to 10KB/s
api.create_toxic(
proxy_name="file_storage_proxy",
toxic_type="bandwidth",
stream="upstream",
toxicity=1.0,
attributes={"rate": 10}
)
Expected test results with Toxiproxy active:
test session starts
collecting ... collected 3 items
test_database_latency.py . [33%]
PASS: Service returns results within 3000ms (actual: 2100ms)
test_payment_timeout.py . [66%]
PASS: Service retries and falls back to alternative gateway
test_bandwidth_limits.py . [100%]
PASS: File upload shows progress indicator during slow transfer
Chaos Monkey with Litmus
Litmus is a Kubernetes-native chaos engineering platform. Here's a Litmus experiment that kills pods in a namespace:
# pod-delete-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
appinfo:
appns: "default"
applabel: "app=nginx"
appkind: "deployment"
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "true"
probe:
- name: "check-service-health"
type: "httpProbe"
httpProbe/inputs:
url: "http://nginx-service.default.svc.cluster.local"
insecureSkipVerify: true
mode: "Continuous"
runProperties:
probeTimeout: 5
interval: 2
retry: 1
Apply the experiment to your Kubernetes cluster:
kubectl apply -f pod-delete-experiment.yaml
Expected experiment output:
Experiment: pod-delete (Running)
Pods targeted: nginx-deployment-7d8b9c6f4-abcde
Pod killed: nginx-deployment-7d8b9c6f4-abcde
Pod killed: nginx-deployment-7d8b9c6f4-fghij
Health probe: All targets healthy
Experiment status: Completed
Result: Hypothesis validated. Service remained healthy through 2 pod terminations.
GameDays: Scheduled Chaos
A GameDay is a scheduled, structured chaos experiment involving the whole team:
| Phase | Activity | Duration |
|---|---|---|
| Planning | Define scope, hypothesis, rollback plan | 1 hour |
| Preparation | Set up monitoring, notifiy stakeholders | 30 min |
| Execution | Run the experiment | 30-60 min |
| Analysis | Review results, update runbooks | 1 hour |
| Remediation | Fix issues found, write new tests | As needed |
Metrics to Track
| Metric | Why It Matters |
|---|---|
| Time to detection | How fast does monitoring detect the failure? |
| Time to mitigation | How fast does the system recover? |
| Error rate impact | What percentage of requests fail? |
| User impact | Does the failure degrade the user experience? |
| Blast radius | How many services are affected by one failure? |
Common Errors
1. Running Experiments in Production Without Preparation
Always start in staging. Even then, have a rollback plan. Netflix runs Chaos Monkey in production because they've spent years building resilience — don't start there.
2. No Rollback Plan
If an experiment reveals an unexpected failure, you must be able to restore normal operations immediately. Define the rollback before running the experiment.
3. Testing Only One Failure at a Time
Real incidents often involve multiple simultaneous failures. After mastering single-failure experiments, try combinations: database failure + network latency + pod restart.
4. Ignoring Blast Radius
A chaos experiment that brings down the entire system is not a controlled experiment. Limit the blast radius to a subset of instances or a single service.
5. Not Automating Experiments
Manual experiments don't scale. Automate your chaos experiments to run on a schedule (weekly, monthly) so resilience is continuously validated.
Practice Questions
1. What is the first step in a chaos experiment? Define the steady state — measure normal system behavior including latency, error rate, and throughput. This provides the baseline for comparison during the experiment.
2. What is Chaos Monkey? Netflix's tool that randomly terminates instances in production to ensure the system survives instance failures without user impact.
3. What is a GameDay? A scheduled, structured chaos experiment involving the whole team, run like a fire drill to test system resilience and team response procedures.
4. What is the blast radius of a chaos experiment? The set of components affected by the experiment. A well-designed experiment limits the blast radius to a subset of instances or a single service.
5. How do you know if a chaos experiment succeeded? The system maintained steady state (metrics stayed within acceptable thresholds) throughout the failure injection, or the hypothesis was validated.
Challenge: Design and run a chaos experiment that tests what happens when the authentication service goes down. Define the hypothesis, set up the experiment, run it in a staging environment with monitoring, and document the results and any resilience gaps found.
Real-World Task: E-Commerce Resilience Testing
Build a chaos testing strategy for an e-commerce platform with these services:
- ProductCatalog — read-heavy, can tolerate brief outages
- ShoppingCart — must remain available during writes
- OrderProcessing — needs database and payment gateway
- PaymentGateway — external dependency, simulate timeout
Run experiments in this order:
- Kill one instance of ProductCatalog — service should continue
- Add latency to database connection — verify caching kicks in
- Kill all instances of PaymentGateway — verify graceful error handling
- Network partition between OrderProcessing and database — test retry logic
Each experiment should document the hypothesis, results, and any code changes needed to improve resilience.
FAQ
What's Next
| Tutorial | What You'll Learn |
|---|---|
| Testing Microservices Guide | Broader strategies for distributed system testing |
| CI/CD | Automated resilience testing in CI |
| Docker | Container-level fault injection |
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro