SRE for Microservices — Distributed Systems Reliability
Microservices architectures distribute functionality across many small, independently deployable services — which also distributes failure modes, latency sources, and debugging complexity across a much wider attack surface that SRE must manage.
What You'll Learn
In this tutorial, you will learn the unique reliability challenges of microservices compared to monoliths, how to use service mesh for traffic management and observability, how to implement distributed tracing for debugging and latency analysis, and how to design for graceful degradation when dependencies fail.
Why It Matters
A monolith has one failure mode: it is either up or down. A microservices system has hundreds of failure modes: any service can be slow, unavailable, or returning wrong data, and each failure can cascade to other services. Without SRE practices designed for distributed systems, microservices become unmanageable.
Real-World Use
DodaTech runs over 40 microservices for the Doda Browser platform. A service mesh (Istio) handles traffic management, retries, and observability. Distributed tracing using OpenTelemetry tracks requests across services. Circuit breakers prevent cascading failures. The DodaTech SRE team manages this complexity through standardized patterns and centralized observability.
graph TD
A[API Gateway] --> B[Auth Service]
A --> C[Sync Service]
A --> D[Storage Service]
C --> E[Database]
D --> F[Object Store]
C --> G[Notification Service]
G --> H[Push Notifications]
C -.->|Circuit Breaker| G
B --> I[User DB]
style G fill:#f96
Prerequisites
Understanding Reliability Patterns is essential since retries, circuit breakers, and timeouts are the building blocks of microservices reliability. Familiarity with Monitoring and Alerting for SRE is needed for the observability layer.
Microservices Reliability Challenges
| Challenge | Monolith | Microservices |
|---|---|---|
| Failure modes | One (up or down) | Many (partial failures) |
| Debugging | Single process | Distributed traces across services |
| Latency | Predictable | Variable due to network hops |
| Deployments | One deployment | Independent deployments per service |
| Dependencies | In-process calls | Network calls (can fail) |
Service Mesh
A service mesh provides a dedicated infrastructure layer for service-to-service communication, handling retries, timeouts, circuit breaking, and observability without modifying application code.
class ServiceMeshConfig:
def __init__(self, service_name):
self.service = service_name
self.rules = []
def add_retry_rule(self, max_retries, timeout_ms):
self.rules.append({
"type": "retry",
"max_retries": max_retries,
"timeout_ms": timeout_ms
})
print(f"Added retry rule: {max_retries} retries, {timeout_ms}ms timeout")
def add_circuit_breaker(self, consecutive_errors, cooldown_sec):
self.rules.append({
"type": "circuit_breaker",
"consecutive_errors": consecutive_errors,
"cooldown_sec": cooldown_sec
})
print(f"Added circuit breaker: {consecutive_errors} errors, {cooldown_sec}s cooldown")
def add_timeout(self, timeout_ms):
self.rules.append({
"type": "timeout",
"timeout_ms": timeout_ms
})
print(f"Added timeout: {timeout_ms}ms")
def apply(self):
print(f"\nApplying mesh config to {self.service}:")
for rule in self.rules:
print(f" - {rule['type']}: {rule}")
mesh = ServiceMeshConfig("sync-service")
mesh.add_retry_rule(3, 200)
mesh.add_circuit_breaker(5, 30)
mesh.add_timeout(500)
mesh.apply()
Expected output:
Added retry rule: 3 retries, 200ms timeout
Added circuit breaker: 5 errors, 30s cooldown
Added timeout: 500ms timeout
Applying mesh config to sync-service:
- retry: {'type': 'retry', 'max_retries': 3, 'timeout_ms': 200}
- circuit_breaker: {'type': 'circuit_breaker', 'consecutive_errors': 5, 'cooldown_sec': 30}
- timeout: {'type': 'timeout', 'timeout_ms': 500}
Distributed Tracing
Distributed tracing tracks a single request as it traverses multiple services. Each trace has a trace ID, spans representing operations, and timing data.
import random
import time
import uuid
class DistributedTrace:
def __init__(self, trace_id=None):
self.trace_id = trace_id or str(uuid.uuid4())[:8]
self.spans = []
def start_span(self, service, operation):
span = {
"trace_id": self.trace_id,
"span_id": str(uuid.uuid4())[:8],
"service": service,
"operation": operation,
"start": time.time(),
"end": None,
"duration_ms": None
}
self.spans.append(span)
print(f"[{self.trace_id}] START {service}/{operation}")
return span
def end_span(self, span):
span["end"] = time.time()
span["duration_ms"] = (span["end"] - span["start"]) * 1000
print(f"[{self.trace_id}] END {span['service']}/{span['operation']} ({span['duration_ms']:.1f}ms)")
def trace_report(self):
total_ms = sum(s["duration_ms"] for s in self.spans if s["duration_ms"])
print(f"\nTrace {self.trace_id} — Total: {total_ms:.1f}ms")
for s in self.spans:
if s["duration_ms"]:
pct = (s["duration_ms"] / total_ms) * 100 if total_ms > 0 else 0
print(f" {s['service']:20s} {s['operation']:30s} {s['duration_ms']:8.1f}ms ({pct:.0f}%)")
trace = DistributedTrace()
span1 = trace.start_span("api-gateway", "handle_request")
time.sleep(random.uniform(0.01, 0.05))
span2 = trace.start_span("sync-service", "sync_files")
time.sleep(random.uniform(0.02, 0.08))
span3 = trace.start_span("storage-service", "store_file")
time.sleep(random.uniform(0.05, 0.1))
trace.end_span(span3)
trace.end_span(span2)
trace.end_span(span1)
trace.trace_report()
Expected output:
[abc123] START api-gateway/handle_request
[abc123] START sync-service/sync_files
[abc123] START storage-service/store_file
[abc123] END storage-service/store_file (80.2ms)
[abc123] END sync-service/sync_files (110.5ms)
[abc123] END api-gateway/handle_request (150.8ms)
Trace abc123 — Total: 150.8ms
api-gateway handle_request 150.8ms (100%)
sync-service sync_files 110.5ms (73%)
storage-service store_file 80.2ms (53%)
Graceful Degradation
When a dependency fails, the service should degrade gracefully rather than failing entirely.
class GracefulDegradation:
def __init__(self, critical_deps, non_critical_deps):
self.critical = critical_deps
self.non_critical = non_critical_deps
def handle_request(self, feature, deps_needed):
available = []
unavailable = []
for dep in deps_needed:
is_healthy = random.random() > 0.3
if is_healthy:
available.append(dep)
else:
unavailable.append(dep)
print(f"Feature: {feature}")
print(f" Dependencies needed: {deps_needed}")
print(f" Available: {available}")
print(f" Unavailable: {unavailable}")
can_degrade = all(d in self.non_critical for d in unavailable)
if len(available) == len(deps_needed):
print(f" Result: FULL SERVICE")
elif can_degrade:
print(f" Result: DEGRADED SERVICE (non-critical deps missing)")
else:
print(f" Result: SERVICE UNAVAILABLE (critical deps missing)")
dg = GracefulDegradation(
critical_deps=["auth", "database"],
non_critical_deps=["notifications", "analytics", "recommendations"]
)
dg.handle_request("File Sync", ["auth", "database", "notifications"])
dg.handle_request("Recommendations", ["auth", "recommendations"])
Expected output:
Feature: File Sync
Dependencies needed: ['auth', 'database', 'notifications']
Available: ['auth', 'database']
Unavailable: ['notifications']
Result: DEGRADED SERVICE (non-critical deps missing)
Feature: Recommendations
Dependencies needed: ['auth', 'recommendations']
Available: ['recommendations']
Unavailable: ['auth']
Result: SERVICE UNAVAILABLE (critical deps missing)
Dependency Management
In a microservices architecture, every service depends on other services. Managing these dependencies is critical for reliability.
Dependency Graph
Maintain a dependency graph that shows which services depend on which. This is essential for understanding blast radius when a service fails.
class DependencyGraph:
def __init__(self):
self.services = {}
self.dependencies = {}
def add_service(self, name, critical=True):
self.services[name] = {"critical": critical}
self.dependencies[name] = []
def add_dependency(self, service, depends_on):
self.dependencies[service].append(depends_on)
def impact_analysis(self, failed_service):
print(f"Impact analysis: {failed_service} failed")
affected = []
for svc, deps in self.dependencies.items():
if failed_service in deps:
affected.append(svc)
for svc in affected:
critical = self.services[svc]["critical"]
print(f" {svc} affected ({'CRITICAL' if critical else 'NON-CRITICAL'})")
if not affected:
print(" No downstream services affected")
graph = DependencyGraph()
graph.add_service("api-gateway", critical=True)
graph.add_service("sync-service", critical=True)
graph.add_service("notification-service", critical=False)
graph.add_service("storage-service", critical=True)
graph.add_dependency("api-gateway", "sync-service")
graph.add_dependency("sync-service", "storage-service")
graph.add_dependency("sync-service", "notification-service")
graph.impact_analysis("notification-service")
print()
graph.impact_analysis("storage-service")
Expected output:
Impact analysis: notification-service failed
sync-service affected (CRITICAL)
No downstream services affected
Impact analysis: storage-service failed
sync-service affected (CRITICAL)
api-gateway affected (CRITICAL)
Service Level Objectives Per Service
Each microservice should have its own SLOs. The SLOs of downstream services inform the SLOs of upstream services. If the storage service has a P99 latency SLO of 200ms, the sync service that depends on it cannot have a P99 SLO of 100ms.
def validate_slo_chain(upstream_slo, downstream_slo):
print(f"Upstream SLO (sync): P99 < {upstream_slo}ms")
print(f"Downstream SLO (storage): P99 < {downstream_slo}ms")
if downstream_slo > upstream_slo * 0.5:
print("WARNING: Downstream latency consumes most of the upstream budget")
elif downstream_slo > upstream_slo * 0.3:
print("ADVISORY: Downstream latency is significant but manageable")
else:
print("OK: Downstream latency leaves reasonable headroom")
validate_slo_chain(200, 80)
validate_slo_chain(200, 150)
Expected output:
Upstream SLO (sync): P99 < 200ms
Downstream SLO (storage): P99 < 80ms
OK: Downstream latency leaves reasonable headroom
Upstream SLO (sync): P99 < 200ms
Downstream SLO (storage): P99 < 150ms
WARNING: Downstream latency consumes most of the upstream budget
Observability in Microservices
Observability is critical in microservices because requests cross many service boundaries. You need three pillars: metrics (Prometheus), logs (Loki or ELK), and traces (Jaeger or OpenTelemetry). Each pillar must carry correlation IDs so you can connect data across services.
The correlation ID pattern is simple: each incoming request gets a unique ID. That ID is passed to every downstream service call and included in all metrics, logs, and traces.
Common Errors
| Error | Explanation |
|---|---|
| No service mesh | Without a service mesh, every microservice must implement its own retry, timeout, and circuit breaker logic — inconsistently. |
| No distributed tracing | In a monolith, you can step through code. In microservices, a single request spans 10+ services. Without tracing, debugging is impossible. |
| Tight coupling | If every service depends on every other service, a single failure cascades everywhere. Design for loose coupling. |
| No graceful degradation | If a non-critical dependency (like notifications) fails and takes down the entire service, the system is poorly designed. |
| Synchronous calls for everything | Use async messaging and queues to decouple services. Synchronous calls create chain dependencies. |
| No dependency graph | You must know which services depend on which. Maintain a dependency graph and monitor dependency health. |
Practice Questions
- Why do microservices have more failure modes than monoliths?
- What problem does a service mesh solve?
- How does distributed tracing help debug microservices issues?
- What is graceful degradation and why does it matter?
- Why should you use asynchronous messaging to decouple microservices?
Challenge
You are designing the microservices architecture for a new DodaTech feature: real-time collaborative document editing. Identify the services needed, draw the dependency graph, define the service mesh configuration (retries, timeouts, circuit breakers), plan distributed tracing instrumentation, and design graceful degradation behavior for each non-critical dependency failure.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro