Skip to content

Resilience Testing — Circuit Breakers, Retries & Timeouts

DodaTech Updated 2026-06-21 5 min read

In this tutorial, you'll learn about Resilience Testing. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Resilience Testing in Chaos Engineering focuses on verifying that the protective patterns in your application — circuit breakers, retries, timeouts, and bulkheads — work correctly under real failure conditions. These patterns are designed to prevent failures from cascading, but they are often misconfigured or untested.

What You Will Learn

This tutorial teaches you how to test circuit breaker behavior, validate retry and backoff logic, and verify timeout configurations using chaos experiments.

Why It Matters

Misconfigured circuit breakers, infinite retries, and missing timeouts are among the most common causes of cascading failures in Distributed Systems. A single slow upstream service can exhaust connection pools across the entire fleet if retries and timeouts are not configured correctly.

Real-World Use

DodaTech tests every circuit breaker configuration with chaos experiments before deploying to production. In one experiment the team discovered that a 500ms timeout at the application layer combined with a 450ms client timeout created a Race Condition that caused periodic 5xx errors.

Prerequisites

Before starting you should understand:

  • Chaos Engineering basics (steady state, hypothesis, blast radius)
  • How circuit breakers work (closed, open, half-open states)
  • Application code patterns for retries and timeouts
  • Basic Python or JavaScript for reading examples

Step 1: Create a Sample Application with Resilience Patterns

Build a simple service that uses a circuit breaker for resilience:

import pybreaker
import requests
from flask import Flask, jsonify

app = Flask(__name__)
breaker = pybreaker.CircuitBreaker(fail_max=3, reset_timeout=30)

@breaker
def call_external_api():
    result = requests.get("http://localhost:5001/data", timeout=2)
    return result.json()

@app.route("/api/data")
def get_data():
    try:
        data = call_external_api()
        return jsonify(data)
    except pybreaker.CircuitBreakerError:
        return jsonify({"error": "service_unavailable", "cached": True}), 503
    except requests.Timeout:
        return jsonify({"error": "timeout", "retry": True}), 504

app.run(port=5000)

Step 2: Inject Latency to Trigger Timeouts

Use Toxiproxy to add latency that exceeds the timeout:

# timeout-test.yaml
proxies:
  - name: slow-api
    listen: "0.0.0.0:5001"
    upstream: "localhost:5002"
    toxicity:
      - type: latency
        stream: downstream
        attributes:
          latency: 5000
toxiproxy-cli create slow-api \
  --listen 0.0.0.0:5001 \
  --upstream localhost:5002 && \
toxiproxy-cli toxic add slow-api \
  --type latency \
  --attribute latency=5000

# Test the endpoint
curl -s http://localhost:5000/api/data | jq .
# Expected output:
# {
#   "error": "timeout",
#   "retry": true
# }

Step 3: Test Circuit Breaker Opening

Trigger the circuit breaker by injecting multiple failures:

# Set the upstream to always return errors
toxiproxy-cli toxic add slow-api \
  --type abort \
  --attribute abort_probability=100

# Make 5 requests to open the circuit breaker
for i in $(seq 1 5); do
  curl -s http://localhost:5000/api/data
  echo ""
done
# Expected output (first 3 fail, then circuit breaker opens):
# {"error": "timeout", "retry": true}
# {"error": "timeout", "retry": true}
# {"error": "timeout", "retry": true}
# {"error": "service_unavailable", "cached": true}
# {"error": "service_unavailable", "cached": true}

Step 4: Verify Circuit Breaker Half-Open Recovery

After the reset timeout the circuit breaker should allow a single test request:

# Remove the fault
toxiproxy-cli toxic remove slow-api abort

# Wait for the circuit breaker reset timeout (30 seconds)
sleep 30

# The next request should succeed
curl -s http://localhost:5000/api/data | jq .
# Expected output (successful response):
# {"result": "ok", "source": "api"}

Step 5: Test Retry with Exponential Backoff

Verify that retries use exponential backoff to avoid thundering herd:

import time

def call_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            start = time.time()
            response = requests.get(url, timeout=1)
            return response.json()
        except requests.RequestException:
            wait = 2 ** attempt  # 1s, 2s, 4s
            print(f"Attempt {attempt + 1} failed, waiting {wait}s")
            time.sleep(wait)
    raise Exception("All retries exhausted")
# Test retry behavior with a failing endpoint
python -c "
import requests
import time

for attempt in range(3):
    start = time.time()
    try:
        requests.get('http://localhost:5001/slow', timeout=1)
    except:
        wait = 2 ** attempt
        print(f'Attempt {attempt+1} failed, waited {wait}s')
"
# Expected output:
# Attempt 1 failed, waited 1s
# Attempt 2 failed, waited 2s
# Attempt 3 failed, waited 4s

Learning Path

flowchart LR
  A[Dependency Testing] --> B[Resilience Testing]
  B --> C[Database Faults]
  C --> D[Network Partitioning]
  D --> E[Infrastructure Faults]
  style B fill:#f90,color:#fff

Common Errors

  1. Setting circuit breaker thresholds too high: If the threshold is 100 failures before opening the breaker it will never protect the system during a real incident.
  2. Using constant retry intervals without backoff: Constant retries create a thundering herd that overwhelms the failing service further.
  3. Mixing timeout values across layers: Application, client, and infrastructure timeouts must be consistent. A timeout chain where each layer has a longer timeout is essential.
  4. Not testing circuit breaker recovery: The half-open state is the most complex part of circuit breaker logic. Always test that recovery works correctly.
  5. Retrying on non-retryable errors: Not all errors should be retried. 4xx errors indicate client mistakes and retrying them is wasted effort.

Practice Questions

  1. What are the three states of a circuit breaker?
  2. How does exponential backoff prevent the thundering herd problem?
  3. Why should you test the half-open state of a circuit breaker?
  4. What happens when application timeouts and client timeouts are inconsistent?
  5. How do you verify that a retry policy is working correctly?

Challenge

Build a service with a circuit breaker, retry with exponential backoff, and a configurable timeout. Use chaos experiments to find the minimum timeout value that keeps the system stable under 500ms of injected latency. Document the relationship between latency, timeout, and circuit breaker threshold.

FAQ

What is Resilience Testing in Chaos Engineering?

Resilience Testing uses controlled Fault Injection to verify that protective patterns like circuit breakers, retries, and timeouts work as expected.

How does a circuit breaker improve system resilience?

A circuit breaker prevents cascading failures by stopping requests to a failing service, allowing it time to recover, and gradually restoring traffic.

What is exponential backoff?

Exponential backoff is a retry strategy where the wait time between retries increases exponentially (1s, 2s, 4s, 8s) to reduce load on the recovering service.

How do you choose the right timeout value?

Measure the p99 response time of the dependency and add a buffer. A common starting point is p99 times 2 or p99 plus 500ms.

What is the thundering herd problem?

When many clients retry simultaneously after a failure they create a flood of requests that overwhelms the recovering service. Exponential backoff prevents this.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro