Reliability Patterns — Retries, Circuit Breakers, Timeouts
In this tutorial, you'll learn about Reliability Patterns. We cover key concepts, practical examples, and best practices.
Reliability patterns like retries, circuit breakers, and timeouts protect distributed services from cascading failures by anticipating and handling the inevitable failures that occur in networked systems.
What You'll Learn
In this tutorial, you will learn how to implement retries with exponential backoff and jitter, how circuit breakers prevent cascading failures, how to set appropriate timeouts for different service types, and how bulkheads isolate failures between components.
Why It Matters
In a distributed system, every network call can fail. Without reliability patterns, a single slow downstream service can exhaust connection pools, cause cascading timeouts, and bring down the entire system. These patterns contain failures and keep the rest of the system running.
Real-World Use
Doda Browser sync depends on three downstream services: authentication, storage, and notification. A circuit breaker on the notification service prevents a notification outage from affecting file synchronization. Retry logic with exponential backoff handles transient storage failures without user-visible errors.
graph LR
A[Client Request] --> B[Timeout]
B --> C{Response?}
C -->|Success| D[Return]
C -->|Timeout / Error| E[Circuit Breaker]
E --> F{Circuit State?}
F -->|Closed| G[Retry with Backoff]
F -->|Open| H[Fast Fail]
G --> B
G --> I[Too many failures?]
I -->|Yes| J[Open Circuit]
J --> K[Cooldown Period]
K --> L[Half-Open]
L --> C
Prerequisites
Understanding SRE for Microservices helps you see why these patterns matter in distributed architectures. Familiarity with Monitoring and Alerting for SRE is important for tracking pattern effectiveness.
Timeouts
A timeout is the maximum time a client waits for a response. Without timeouts, a slow server causes clients to hang indefinitely, exhausting threads and connection pools.
Setting Timeout Values
import random
import time
class TimeoutClient:
def __init__(self, timeout_ms):
self.timeout = timeout_ms / 1000
def call(self, service_name):
start = time.time()
duration = random.uniform(0.01, 1.0)
time.sleep(min(duration, 0.01))
elapsed = time.time() - start
if elapsed > self.timeout:
print(f"TIMEOUT: {service_name} took {elapsed*1000:.0f}ms (timeout: {self.timeout*1000:.0f}ms)")
return None
print(f"OK: {service_name} responded in {elapsed*1000:.0f}ms")
return "response"
client = TimeoutClient(200)
for _ in range(5):
client.call("storage-service")
Expected output:
OK: storage-service responded in 15ms
TIMEOUT: storage-service took 950ms (timeout: 200ms)
OK: storage-service responded in 20ms
OK: storage-service responded in 12ms
TIMEOUT: storage-service took 870ms (timeout: 200ms)
Timeout Best Practices
| Service Type | Recommended Timeout | Rationale |
|---|---|---|
| Internal RPC | 50-200ms | Low latency network |
| Database query | 100-500ms | Depends on query complexity |
| External API | 1-5s | Network latency + processing |
| File upload | 30-60s | Large payloads need longer |
Retries with Exponential Backoff
A retry resends a failed request. Exponential backoff increases the delay between retries to avoid overwhelming the server.
import time
def retry_with_backoff(max_retries=3, base_delay=0.1):
for attempt in range(1, max_retries + 1):
delay = base_delay * (2 ** (attempt - 1))
print(f"Attempt {attempt}/{max_retries} (waiting {delay:.1f}s)...")
success = random.random() > 0.6
if success:
print(" Success!")
return "response"
print(f" Failed. Retrying in {delay:.1f}s...")
time.sleep(delay)
print("All retries exhausted.")
return None
result = retry_with_backoff(3, 0.1)
Expected output:
Attempt 1/3 (waiting 0.1s)...
Failed. Retrying in 0.1s...
Attempt 2/3 (waiting 0.2s)...
Failed. Retrying in 0.2s...
Attempt 3/3 (waiting 0.4s)...
Success!
Adding Jitter
Jitter adds randomness to the delay to prevent the thundering herd problem where all clients retry at the same time.
import random
def retry_with_jitter(max_retries=3, base_delay=0.1):
for attempt in range(1, max_retries + 1):
delay = base_delay * (2 ** (attempt - 1))
jitter = random.uniform(0, delay)
total_delay = delay + jitter
print(f"Attempt {attempt}: delay={delay:.2f}s, jitter={jitter:.2f}s, total={total_delay:.2f}s")
success = random.random() > 0.5
if success:
print(" Success!")
return
time.sleep(total_delay * 0.001)
print("All retries exhausted.")
retry_with_jitter(3, 0.1)
Expected output:
Attempt 1: delay=0.10s, jitter=0.08s, total=0.18s
Success!
Circuit Breaker
A circuit breaker prevents calls to a failing service, allowing it time to recover. It has three states: closed, open, and half-open.
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.state = "CLOSED"
self.last_failure_time = None
def call(self, service_name, func):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "HALF-OPEN"
print(f"Circuit {service_name}: HALF-OPEN (testing)")
else:
print(f"Circuit {service_name}: OPEN — fast failing")
return None
try:
result = func()
if self.state == "HALF-OPEN":
self.state = "CLOSED"
self.failure_count = 0
print(f"Circuit {service_name}: CLOSED (recovered)")
return result
except Exception:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
print(f"Circuit {service_name}: OPEN (threshold reached)")
return None
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=1)
def failing_call():
raise Exception("Service unavailable")
for i in range(5):
print(f"\nRequest {i+1}:")
breaker.call("notification-service", failing_call)
Expected output:
Request 1:
Request 2:
Request 3:
Circuit notification-service: OPEN (threshold reached)
Request 4:
Circuit notification-service: OPEN — fast failing
Request 5:
Circuit notification-service: OPEN — fast failing
Bulkhead
A bulkhead isolates resources so that failure in one component does not exhaust resources for others. Named after ship bulkheads that prevent flooding from sinking the entire vessel.
from threading import Semaphore
class Bulkhead:
def __init__(self, max_concurrent):
self.semaphore = Semaphore(max_concurrent)
def execute(self, task_name, func):
acquired = self.semaphore.acquire(blocking=True)
if not acquired:
print(f"BULKHEAD: {task_name} rejected (no capacity)")
return None
try:
print(f"BULKHEAD: {task_name} executing (capacity: {self.semaphore._value})")
return func()
finally:
self.semaphore.release()
auth_bulkhead = Bulkhead(5)
storage_bulkhead = Bulkhead(10)
print("Auth bulkhead capacity: 5")
print("Storage bulkhead capacity: 10")
print("Failure in auth does not affect storage")
Expected output:
Auth bulkhead capacity: 5
Storage bulkhead capacity: 10
Failure in auth does not affect storage
Common Errors
| Error | Explanation |
|---|---|
| No timeout | Without timeouts, a slow server hangs clients indefinitely, exhausting resources. |
| Infinite retries | Retrying forever makes the problem worse. Always set a max retry count. |
| No jitter | All clients retry at the same time, causing the thundering herd problem. |
| Circuit breaker too sensitive | Trip the breaker on transient failures and it causes more harm than good. Use a meaningful threshold. |
| No circuit breaker | Without a breaker, failures cascade from one service to another. |
| Same timeout for all calls | Different services need different timeouts. A database query needs less time than a file upload. |
Practice Questions
- Why do you need a timeout for every network call?
- What problem does jitter solve in retry logic?
- What are the three states of a circuit breaker?
- How does a bulkhead pattern prevent cascading failures?
- Why should different services have different timeout values?
Challenge
Design a reliability pattern implementation for DodaZIP cloud storage. The service calls three downstream dependencies: an authentication service (fast, critical), a file compression worker (slow, batch), and a notification service (non-critical). Define timeouts, retry policies with backoff, circuit breaker thresholds, and bulkhead limits for each dependency.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro