Resilience Testing — Circuit Breakers, Retries & Timeouts
In this tutorial, you'll learn about Resilience Testing. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Resilience Testing in Chaos Engineering focuses on verifying that the protective patterns in your application — circuit breakers, retries, timeouts, and bulkheads — work correctly under real failure conditions. These patterns are designed to prevent failures from cascading, but they are often misconfigured or untested.
What You Will Learn
This tutorial teaches you how to test circuit breaker behavior, validate retry and backoff logic, and verify timeout configurations using chaos experiments.
Why It Matters
Misconfigured circuit breakers, infinite retries, and missing timeouts are among the most common causes of cascading failures in Distributed Systems. A single slow upstream service can exhaust connection pools across the entire fleet if retries and timeouts are not configured correctly.
Real-World Use
DodaTech tests every circuit breaker configuration with chaos experiments before deploying to production. In one experiment the team discovered that a 500ms timeout at the application layer combined with a 450ms client timeout created a Race Condition that caused periodic 5xx errors.
Prerequisites
Before starting you should understand:
- Chaos Engineering basics (steady state, hypothesis, blast radius)
- How circuit breakers work (closed, open, half-open states)
- Application code patterns for retries and timeouts
- Basic Python or JavaScript for reading examples
Step 1: Create a Sample Application with Resilience Patterns
Build a simple service that uses a circuit breaker for resilience:
import pybreaker
import requests
from flask import Flask, jsonify
app = Flask(__name__)
breaker = pybreaker.CircuitBreaker(fail_max=3, reset_timeout=30)
@breaker
def call_external_api():
result = requests.get("http://localhost:5001/data", timeout=2)
return result.json()
@app.route("/api/data")
def get_data():
try:
data = call_external_api()
return jsonify(data)
except pybreaker.CircuitBreakerError:
return jsonify({"error": "service_unavailable", "cached": True}), 503
except requests.Timeout:
return jsonify({"error": "timeout", "retry": True}), 504
app.run(port=5000)
Step 2: Inject Latency to Trigger Timeouts
Use Toxiproxy to add latency that exceeds the timeout:
# timeout-test.yaml
proxies:
- name: slow-api
listen: "0.0.0.0:5001"
upstream: "localhost:5002"
toxicity:
- type: latency
stream: downstream
attributes:
latency: 5000
toxiproxy-cli create slow-api \
--listen 0.0.0.0:5001 \
--upstream localhost:5002 && \
toxiproxy-cli toxic add slow-api \
--type latency \
--attribute latency=5000
# Test the endpoint
curl -s http://localhost:5000/api/data | jq .
# Expected output:
# {
# "error": "timeout",
# "retry": true
# }
Step 3: Test Circuit Breaker Opening
Trigger the circuit breaker by injecting multiple failures:
# Set the upstream to always return errors
toxiproxy-cli toxic add slow-api \
--type abort \
--attribute abort_probability=100
# Make 5 requests to open the circuit breaker
for i in $(seq 1 5); do
curl -s http://localhost:5000/api/data
echo ""
done
# Expected output (first 3 fail, then circuit breaker opens):
# {"error": "timeout", "retry": true}
# {"error": "timeout", "retry": true}
# {"error": "timeout", "retry": true}
# {"error": "service_unavailable", "cached": true}
# {"error": "service_unavailable", "cached": true}
Step 4: Verify Circuit Breaker Half-Open Recovery
After the reset timeout the circuit breaker should allow a single test request:
# Remove the fault
toxiproxy-cli toxic remove slow-api abort
# Wait for the circuit breaker reset timeout (30 seconds)
sleep 30
# The next request should succeed
curl -s http://localhost:5000/api/data | jq .
# Expected output (successful response):
# {"result": "ok", "source": "api"}
Step 5: Test Retry with Exponential Backoff
Verify that retries use exponential backoff to avoid thundering herd:
import time
def call_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
start = time.time()
response = requests.get(url, timeout=1)
return response.json()
except requests.RequestException:
wait = 2 ** attempt # 1s, 2s, 4s
print(f"Attempt {attempt + 1} failed, waiting {wait}s")
time.sleep(wait)
raise Exception("All retries exhausted")
# Test retry behavior with a failing endpoint
python -c "
import requests
import time
for attempt in range(3):
start = time.time()
try:
requests.get('http://localhost:5001/slow', timeout=1)
except:
wait = 2 ** attempt
print(f'Attempt {attempt+1} failed, waited {wait}s')
"
# Expected output:
# Attempt 1 failed, waited 1s
# Attempt 2 failed, waited 2s
# Attempt 3 failed, waited 4s
Learning Path
flowchart LR A[Dependency Testing] --> B[Resilience Testing] B --> C[Database Faults] C --> D[Network Partitioning] D --> E[Infrastructure Faults] style B fill:#f90,color:#fff
Common Errors
- Setting circuit breaker thresholds too high: If the threshold is 100 failures before opening the breaker it will never protect the system during a real incident.
- Using constant retry intervals without backoff: Constant retries create a thundering herd that overwhelms the failing service further.
- Mixing timeout values across layers: Application, client, and infrastructure timeouts must be consistent. A timeout chain where each layer has a longer timeout is essential.
- Not testing circuit breaker recovery: The half-open state is the most complex part of circuit breaker logic. Always test that recovery works correctly.
- Retrying on non-retryable errors: Not all errors should be retried. 4xx errors indicate client mistakes and retrying them is wasted effort.
Practice Questions
- What are the three states of a circuit breaker?
- How does exponential backoff prevent the thundering herd problem?
- Why should you test the half-open state of a circuit breaker?
- What happens when application timeouts and client timeouts are inconsistent?
- How do you verify that a retry policy is working correctly?
Challenge
Build a service with a circuit breaker, retry with exponential backoff, and a configurable timeout. Use chaos experiments to find the minimum timeout value that keeps the system stable under 500ms of injected latency. Document the relationship between latency, timeout, and circuit breaker threshold.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro