SRE for Microservices — Distributed Systems Reliability

DodaTech Updated 2026-06-23 8 min read

Microservices architectures distribute functionality across many small, independently deployable services — which also distributes failure modes, latency sources, and debugging complexity across a much wider attack surface that SRE must manage.

What You'll Learn

In this tutorial, you will learn the unique reliability challenges of microservices compared to monoliths, how to use service mesh for traffic management and observability, how to implement distributed tracing for debugging and latency analysis, and how to design for graceful degradation when dependencies fail.

Why It Matters

A monolith has one failure mode: it is either up or down. A microservices system has hundreds of failure modes: any service can be slow, unavailable, or returning wrong data, and each failure can cascade to other services. Without SRE practices designed for distributed systems, microservices become unmanageable.

Real-World Use

DodaTech runs over 40 microservices for the Doda Browser platform. A service mesh (Istio) handles traffic management, retries, and observability. Distributed tracing using OpenTelemetry tracks requests across services. Circuit breakers prevent cascading failures. The DodaTech SRE team manages this complexity through standardized patterns and centralized observability.

graph TD
    A[API Gateway] --> B[Auth Service]
    A --> C[Sync Service]
    A --> D[Storage Service]
    C --> E[Database]
    D --> F[Object Store]
    C --> G[Notification Service]
    G --> H[Push Notifications]
    C -.->|Circuit Breaker| G
    B --> I[User DB]
    style G fill:#f96

Prerequisites

Understanding Reliability Patterns is essential since retries, circuit breakers, and timeouts are the building blocks of microservices reliability. Familiarity with Monitoring and Alerting for SRE is needed for the observability layer.

Microservices Reliability Challenges

Challenge	Monolith	Microservices
Failure modes	One (up or down)	Many (partial failures)
Debugging	Single process	Distributed traces across services
Latency	Predictable	Variable due to network hops
Deployments	One deployment	Independent deployments per service
Dependencies	In-process calls	Network calls (can fail)

Service Mesh

A service mesh provides a dedicated infrastructure layer for service-to-service communication, handling retries, timeouts, circuit breaking, and observability without modifying application code.

class ServiceMeshConfig:
    def __init__(self, service_name):
        self.service = service_name
        self.rules = []

    def add_retry_rule(self, max_retries, timeout_ms):
        self.rules.append({
            "type": "retry",
            "max_retries": max_retries,
            "timeout_ms": timeout_ms
        })
        print(f"Added retry rule: {max_retries} retries, {timeout_ms}ms timeout")

    def add_circuit_breaker(self, consecutive_errors, cooldown_sec):
        self.rules.append({
            "type": "circuit_breaker",
            "consecutive_errors": consecutive_errors,
            "cooldown_sec": cooldown_sec
        })
        print(f"Added circuit breaker: {consecutive_errors} errors, {cooldown_sec}s cooldown")

    def add_timeout(self, timeout_ms):
        self.rules.append({
            "type": "timeout",
            "timeout_ms": timeout_ms
        })
        print(f"Added timeout: {timeout_ms}ms")

    def apply(self):
        print(f"\nApplying mesh config to {self.service}:")
        for rule in self.rules:
            print(f"  - {rule['type']}: {rule}")

mesh = ServiceMeshConfig("sync-service")
mesh.add_retry_rule(3, 200)
mesh.add_circuit_breaker(5, 30)
mesh.add_timeout(500)
mesh.apply()

Expected output:

Added retry rule: 3 retries, 200ms timeout
Added circuit breaker: 5 errors, 30s cooldown
Added timeout: 500ms timeout

Applying mesh config to sync-service:
  - retry: {'type': 'retry', 'max_retries': 3, 'timeout_ms': 200}
  - circuit_breaker: {'type': 'circuit_breaker', 'consecutive_errors': 5, 'cooldown_sec': 30}
  - timeout: {'type': 'timeout', 'timeout_ms': 500}

Distributed Tracing

Distributed tracing tracks a single request as it traverses multiple services. Each trace has a trace ID, spans representing operations, and timing data.

import random
import time
import uuid

class DistributedTrace:
    def __init__(self, trace_id=None):
        self.trace_id = trace_id or str(uuid.uuid4())[:8]
        self.spans = []

    def start_span(self, service, operation):
        span = {
            "trace_id": self.trace_id,
            "span_id": str(uuid.uuid4())[:8],
            "service": service,
            "operation": operation,
            "start": time.time(),
            "end": None,
            "duration_ms": None
        }
        self.spans.append(span)
        print(f"[{self.trace_id}] START {service}/{operation}")
        return span

    def end_span(self, span):
        span["end"] = time.time()
        span["duration_ms"] = (span["end"] - span["start"]) * 1000
        print(f"[{self.trace_id}] END   {span['service']}/{span['operation']} ({span['duration_ms']:.1f}ms)")

    def trace_report(self):
        total_ms = sum(s["duration_ms"] for s in self.spans if s["duration_ms"])
        print(f"\nTrace {self.trace_id} — Total: {total_ms:.1f}ms")
        for s in self.spans:
            if s["duration_ms"]:
                pct = (s["duration_ms"] / total_ms) * 100 if total_ms > 0 else 0
                print(f"  {s['service']:20s} {s['operation']:30s} {s['duration_ms']:8.1f}ms ({pct:.0f}%)")

trace = DistributedTrace()
span1 = trace.start_span("api-gateway", "handle_request")
time.sleep(random.uniform(0.01, 0.05))
span2 = trace.start_span("sync-service", "sync_files")
time.sleep(random.uniform(0.02, 0.08))
span3 = trace.start_span("storage-service", "store_file")
time.sleep(random.uniform(0.05, 0.1))
trace.end_span(span3)
trace.end_span(span2)
trace.end_span(span1)
trace.trace_report()

Expected output:

[abc123] START api-gateway/handle_request
[abc123] START sync-service/sync_files
[abc123] START storage-service/store_file
[abc123] END   storage-service/store_file (80.2ms)
[abc123] END   sync-service/sync_files (110.5ms)
[abc123] END   api-gateway/handle_request (150.8ms)

Trace abc123 — Total: 150.8ms
  api-gateway          handle_request                    150.8ms (100%)
  sync-service         sync_files                        110.5ms (73%)
  storage-service      store_file                         80.2ms (53%)

Graceful Degradation

When a dependency fails, the service should degrade gracefully rather than failing entirely.

class GracefulDegradation:
    def __init__(self, critical_deps, non_critical_deps):
        self.critical = critical_deps
        self.non_critical = non_critical_deps

    def handle_request(self, feature, deps_needed):
        available = []
        unavailable = []

        for dep in deps_needed:
            is_healthy = random.random() > 0.3
            if is_healthy:
                available.append(dep)
            else:
                unavailable.append(dep)

        print(f"Feature: {feature}")
        print(f"  Dependencies needed: {deps_needed}")
        print(f"  Available: {available}")
        print(f"  Unavailable: {unavailable}")

        can_degrade = all(d in self.non_critical for d in unavailable)
        if len(available) == len(deps_needed):
            print(f"  Result: FULL SERVICE")
        elif can_degrade:
            print(f"  Result: DEGRADED SERVICE (non-critical deps missing)")
        else:
            print(f"  Result: SERVICE UNAVAILABLE (critical deps missing)")

dg = GracefulDegradation(
    critical_deps=["auth", "database"],
    non_critical_deps=["notifications", "analytics", "recommendations"]
)

dg.handle_request("File Sync", ["auth", "database", "notifications"])
dg.handle_request("Recommendations", ["auth", "recommendations"])

Expected output:

Feature: File Sync
  Dependencies needed: ['auth', 'database', 'notifications']
  Available: ['auth', 'database']
  Unavailable: ['notifications']
  Result: DEGRADED SERVICE (non-critical deps missing)
Feature: Recommendations
  Dependencies needed: ['auth', 'recommendations']
  Available: ['recommendations']
  Unavailable: ['auth']
  Result: SERVICE UNAVAILABLE (critical deps missing)

Dependency Management

In a microservices architecture, every service depends on other services. Managing these dependencies is critical for reliability.

Dependency Graph

Maintain a dependency graph that shows which services depend on which. This is essential for understanding blast radius when a service fails.

class DependencyGraph:
    def __init__(self):
        self.services = {}
        self.dependencies = {}

    def add_service(self, name, critical=True):
        self.services[name] = {"critical": critical}
        self.dependencies[name] = []

    def add_dependency(self, service, depends_on):
        self.dependencies[service].append(depends_on)

    def impact_analysis(self, failed_service):
        print(f"Impact analysis: {failed_service} failed")
        affected = []
        for svc, deps in self.dependencies.items():
            if failed_service in deps:
                affected.append(svc)
        for svc in affected:
            critical = self.services[svc]["critical"]
            print(f"  {svc} affected ({'CRITICAL' if critical else 'NON-CRITICAL'})")
        if not affected:
            print("  No downstream services affected")

graph = DependencyGraph()
graph.add_service("api-gateway", critical=True)
graph.add_service("sync-service", critical=True)
graph.add_service("notification-service", critical=False)
graph.add_service("storage-service", critical=True)

graph.add_dependency("api-gateway", "sync-service")
graph.add_dependency("sync-service", "storage-service")
graph.add_dependency("sync-service", "notification-service")

graph.impact_analysis("notification-service")
print()
graph.impact_analysis("storage-service")

Expected output:

Impact analysis: notification-service failed
  sync-service affected (CRITICAL)
  No downstream services affected

Impact analysis: storage-service failed
  sync-service affected (CRITICAL)
  api-gateway affected (CRITICAL)

Service Level Objectives Per Service

Each microservice should have its own SLOs. The SLOs of downstream services inform the SLOs of upstream services. If the storage service has a P99 latency SLO of 200ms, the sync service that depends on it cannot have a P99 SLO of 100ms.

def validate_slo_chain(upstream_slo, downstream_slo):
    print(f"Upstream SLO (sync): P99 < {upstream_slo}ms")
    print(f"Downstream SLO (storage): P99 < {downstream_slo}ms")
    if downstream_slo > upstream_slo * 0.5:
        print("WARNING: Downstream latency consumes most of the upstream budget")
    elif downstream_slo > upstream_slo * 0.3:
        print("ADVISORY: Downstream latency is significant but manageable")
    else:
        print("OK: Downstream latency leaves reasonable headroom")

validate_slo_chain(200, 80)
validate_slo_chain(200, 150)

Expected output:

Upstream SLO (sync): P99 < 200ms
Downstream SLO (storage): P99 < 80ms
OK: Downstream latency leaves reasonable headroom
Upstream SLO (sync): P99 < 200ms
Downstream SLO (storage): P99 < 150ms
WARNING: Downstream latency consumes most of the upstream budget

Observability in Microservices

Observability is critical in microservices because requests cross many service boundaries. You need three pillars: metrics (Prometheus), logs (Loki or ELK), and traces (Jaeger or OpenTelemetry). Each pillar must carry correlation IDs so you can connect data across services.

The correlation ID pattern is simple: each incoming request gets a unique ID. That ID is passed to every downstream service call and included in all metrics, logs, and traces.

Common Errors

Error	Explanation
No service mesh	Without a service mesh, every microservice must implement its own retry, timeout, and circuit breaker logic — inconsistently.
No distributed tracing	In a monolith, you can step through code. In microservices, a single request spans 10+ services. Without tracing, debugging is impossible.
Tight coupling	If every service depends on every other service, a single failure cascades everywhere. Design for loose coupling.
No graceful degradation	If a non-critical dependency (like notifications) fails and takes down the entire service, the system is poorly designed.
Synchronous calls for everything	Use async messaging and queues to decouple services. Synchronous calls create chain dependencies.
No dependency graph	You must know which services depend on which. Maintain a dependency graph and monitor dependency health.

Practice Questions

Why do microservices have more failure modes than monoliths?
What problem does a service mesh solve?
How does distributed tracing help debug microservices issues?
What is graceful degradation and why does it matter?
Why should you use asynchronous messaging to decouple microservices?

Challenge

You are designing the microservices architecture for a new DodaTech feature: real-time collaborative document editing. Identify the services needed, draw the dependency graph, define the service mesh configuration (retries, timeouts, circuit breakers), plan distributed tracing instrumentation, and design graceful degradation behavior for each non-critical dependency failure.

FAQ

What is a service mesh?

A service mesh is an infrastructure layer that manages service-to-service communication, providing built-in retries, timeouts, circuit breaking, and observability.

What is distributed tracing?

Distributed tracing tracks a request across multiple services, assigning a trace ID that correlates log entries, spans, and timing data for the entire request flow.

How many microservices is too many?

There is no fixed number, but each service should have a clear boundary. If services are smaller than a team can manage, they are too small.

What is graceful degradation?

Graceful degradation means the system continues to function, possibly with reduced features, when a non-critical dependency fails.

Should every microservice have its own database?

Yes, each microservice should own its data. Shared databases create tight coupling between services.

← Previous Building SRE Culture in Your Organization Next → Site Reliability Engineering Tools — PagerDuty, Opsgenie, Incident.io

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Site Reliability Engineering