Skip to content

Reliability Patterns — Retries, Circuit Breakers, Timeouts

DodaTech Updated 2026-06-23 6 min read

In this tutorial, you'll learn about Reliability Patterns. We cover key concepts, practical examples, and best practices.

Reliability patterns like retries, circuit breakers, and timeouts protect distributed services from cascading failures by anticipating and handling the inevitable failures that occur in networked systems.

What You'll Learn

In this tutorial, you will learn how to implement retries with exponential backoff and jitter, how circuit breakers prevent cascading failures, how to set appropriate timeouts for different service types, and how bulkheads isolate failures between components.

Why It Matters

In a distributed system, every network call can fail. Without reliability patterns, a single slow downstream service can exhaust connection pools, cause cascading timeouts, and bring down the entire system. These patterns contain failures and keep the rest of the system running.

Real-World Use

Doda Browser sync depends on three downstream services: authentication, storage, and notification. A circuit breaker on the notification service prevents a notification outage from affecting file synchronization. Retry logic with exponential backoff handles transient storage failures without user-visible errors.

graph LR
    A[Client Request] --> B[Timeout]
    B --> C{Response?}
    C -->|Success| D[Return]
    C -->|Timeout / Error| E[Circuit Breaker]
    E --> F{Circuit State?}
    F -->|Closed| G[Retry with Backoff]
    F -->|Open| H[Fast Fail]
    G --> B
    G --> I[Too many failures?]
    I -->|Yes| J[Open Circuit]
    J --> K[Cooldown Period]
    K --> L[Half-Open]
    L --> C

Prerequisites

Understanding SRE for Microservices helps you see why these patterns matter in distributed architectures. Familiarity with Monitoring and Alerting for SRE is important for tracking pattern effectiveness.

Timeouts

A timeout is the maximum time a client waits for a response. Without timeouts, a slow server causes clients to hang indefinitely, exhausting threads and connection pools.

Setting Timeout Values

import random
import time

class TimeoutClient:
    def __init__(self, timeout_ms):
        self.timeout = timeout_ms / 1000

    def call(self, service_name):
        start = time.time()
        duration = random.uniform(0.01, 1.0)
        time.sleep(min(duration, 0.01))
        elapsed = time.time() - start
        if elapsed > self.timeout:
            print(f"TIMEOUT: {service_name} took {elapsed*1000:.0f}ms (timeout: {self.timeout*1000:.0f}ms)")
            return None
        print(f"OK: {service_name} responded in {elapsed*1000:.0f}ms")
        return "response"

client = TimeoutClient(200)
for _ in range(5):
    client.call("storage-service")

Expected output:

OK: storage-service responded in 15ms
TIMEOUT: storage-service took 950ms (timeout: 200ms)
OK: storage-service responded in 20ms
OK: storage-service responded in 12ms
TIMEOUT: storage-service took 870ms (timeout: 200ms)

Timeout Best Practices

Service Type Recommended Timeout Rationale
Internal RPC 50-200ms Low latency network
Database query 100-500ms Depends on query complexity
External API 1-5s Network latency + processing
File upload 30-60s Large payloads need longer

Retries with Exponential Backoff

A retry resends a failed request. Exponential backoff increases the delay between retries to avoid overwhelming the server.

import time

def retry_with_backoff(max_retries=3, base_delay=0.1):
    for attempt in range(1, max_retries + 1):
        delay = base_delay * (2 ** (attempt - 1))
        print(f"Attempt {attempt}/{max_retries} (waiting {delay:.1f}s)...")
        success = random.random() > 0.6
        if success:
            print("  Success!")
            return "response"
        print(f"  Failed. Retrying in {delay:.1f}s...")
        time.sleep(delay)
    print("All retries exhausted.")
    return None

result = retry_with_backoff(3, 0.1)

Expected output:

Attempt 1/3 (waiting 0.1s)...
  Failed. Retrying in 0.1s...
Attempt 2/3 (waiting 0.2s)...
  Failed. Retrying in 0.2s...
Attempt 3/3 (waiting 0.4s)...
  Success!

Adding Jitter

Jitter adds randomness to the delay to prevent the thundering herd problem where all clients retry at the same time.

import random

def retry_with_jitter(max_retries=3, base_delay=0.1):
    for attempt in range(1, max_retries + 1):
        delay = base_delay * (2 ** (attempt - 1))
        jitter = random.uniform(0, delay)
        total_delay = delay + jitter
        print(f"Attempt {attempt}: delay={delay:.2f}s, jitter={jitter:.2f}s, total={total_delay:.2f}s")
        success = random.random() > 0.5
        if success:
            print("  Success!")
            return
        time.sleep(total_delay * 0.001)
    print("All retries exhausted.")

retry_with_jitter(3, 0.1)

Expected output:

Attempt 1: delay=0.10s, jitter=0.08s, total=0.18s
  Success!

Circuit Breaker

A circuit breaker prevents calls to a failing service, allowing it time to recover. It has three states: closed, open, and half-open.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.state = "CLOSED"
        self.last_failure_time = None

    def call(self, service_name, func):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "HALF-OPEN"
                print(f"Circuit {service_name}: HALF-OPEN (testing)")
            else:
                print(f"Circuit {service_name}: OPEN — fast failing")
                return None

        try:
            result = func()
            if self.state == "HALF-OPEN":
                self.state = "CLOSED"
                self.failure_count = 0
                print(f"Circuit {service_name}: CLOSED (recovered)")
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
                print(f"Circuit {service_name}: OPEN (threshold reached)")
            return None

breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=1)

def failing_call():
    raise Exception("Service unavailable")

for i in range(5):
    print(f"\nRequest {i+1}:")
    breaker.call("notification-service", failing_call)

Expected output:

Request 1:
Request 2:
Request 3:
Circuit notification-service: OPEN (threshold reached)
Request 4:
Circuit notification-service: OPEN — fast failing
Request 5:
Circuit notification-service: OPEN — fast failing

Bulkhead

A bulkhead isolates resources so that failure in one component does not exhaust resources for others. Named after ship bulkheads that prevent flooding from sinking the entire vessel.

from threading import Semaphore

class Bulkhead:
    def __init__(self, max_concurrent):
        self.semaphore = Semaphore(max_concurrent)

    def execute(self, task_name, func):
        acquired = self.semaphore.acquire(blocking=True)
        if not acquired:
            print(f"BULKHEAD: {task_name} rejected (no capacity)")
            return None
        try:
            print(f"BULKHEAD: {task_name} executing (capacity: {self.semaphore._value})")
            return func()
        finally:
            self.semaphore.release()

auth_bulkhead = Bulkhead(5)
storage_bulkhead = Bulkhead(10)

print("Auth bulkhead capacity: 5")
print("Storage bulkhead capacity: 10")
print("Failure in auth does not affect storage")

Expected output:

Auth bulkhead capacity: 5
Storage bulkhead capacity: 10
Failure in auth does not affect storage

Common Errors

Error Explanation
No timeout Without timeouts, a slow server hangs clients indefinitely, exhausting resources.
Infinite retries Retrying forever makes the problem worse. Always set a max retry count.
No jitter All clients retry at the same time, causing the thundering herd problem.
Circuit breaker too sensitive Trip the breaker on transient failures and it causes more harm than good. Use a meaningful threshold.
No circuit breaker Without a breaker, failures cascade from one service to another.
Same timeout for all calls Different services need different timeouts. A database query needs less time than a file upload.

Practice Questions

  1. Why do you need a timeout for every network call?
  2. What problem does jitter solve in retry logic?
  3. What are the three states of a circuit breaker?
  4. How does a bulkhead pattern prevent cascading failures?
  5. Why should different services have different timeout values?

Challenge

Design a reliability pattern implementation for DodaZIP cloud storage. The service calls three downstream dependencies: an authentication service (fast, critical), a file compression worker (slow, batch), and a notification service (non-critical). Define timeouts, retry policies with backoff, circuit breaker thresholds, and bulkhead limits for each dependency.

FAQ

What is exponential backoff?

Exponential backoff is a retry strategy where the delay between retries doubles after each attempt, preventing overwhelming the failing service.

What is a circuit breaker pattern?

A circuit breaker monitors for failures. When failures exceed a threshold, it opens the circuit and subsequent calls fail fast without reaching the failing service.

When should I use a bulkhead?

Use a bulkhead when you need to guarantee that one slow or failing component does not consume all threads or connections in the pool.

What is the thundering herd problem?

When many clients retry simultaneously after a failure, the retry storm can overwhelm the recovering server. Jitter randomizes retry timing to prevent this.

How many retries should I use?

Three to five retries is standard for most services. More than five retries usually means the problem is not transient and needs human intervention.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro