Monitoring and Alerting for SRE
In this tutorial, you'll learn about Monitoring and Alerting for SRE. We cover key concepts, practical examples, and best practices.
Monitoring collects and visualizes service metrics, while alerting notifies the right person when those metrics indicate a problem — together they form the eyes and ears of every SRE team.
What You'll Learn
In this tutorial, you will learn the four golden signals of monitoring, how to design alerts that are actionable and reduce noise, how to set up alert routing based on severity and service ownership, and how to build dashboards that help you debug faster.
Why It Matters
Bad monitoring is worse than no monitoring. Alerts that fire constantly but require no action train engineers to ignore them. Dashboards that show every metric make it impossible to find the important one. A well-designed monitoring system reduces MTTR and prevents alert fatigue.
Real-World Use
DodaTech runs Prometheus for metrics collection and Grafana for dashboards across all services. Doda Browser sync has four dashboards: one for the four golden signals, one for database health, one for deployment tracking, and one for business metrics. Alerts route through PagerDuty with severity-based escalation policies.
graph TD
A[Service] --> B[Metrics Collection]
B --> C[Time-Series Database]
C --> D{Dashboard}
C --> E{Alert Rule}
E -->|Fire| F[Alert Manager]
F --> G[Route by Severity]
G --> H[PagerDuty SEV1]
G --> I[Email SEV2]
G --> J[Slack SEV3]
Prerequisites
Understanding SLIs and SLOs helps you define what to monitor. Familiarity with Incident Response is important since alerts trigger the incident response process.
The Four Golden Signals
Google SRE defines four golden signals that every service should monitor.
| Signal | What It Measures | Why It Matters |
|---|---|---|
| Latency | Time to serve a request | Slow responses are as bad as errors |
| Traffic | How much demand the service sees | Identifies scaling needs |
| Errors | Rate of failed requests | Directly measures service health |
| Saturation | How close to capacity the service is | Predicts future problems |
Building a Metrics Collector
import random
import time
from collections import deque
class MetricsCollector:
def __init__(self, window_size=100):
self.window = deque(maxlen=window_size)
def collect(self):
metric = {
"timestamp": time.time(),
"latency_ms": random.uniform(10, 500),
"requests_per_sec": random.uniform(50, 200),
"error_rate": random.uniform(0, 0.05),
"cpu_utilization": random.uniform(30, 90),
}
self.window.append(metric)
return metric
def aggregate(self):
if not self.window:
return None
latencies = [m["latency_ms"] for m in self.window]
errors = [m["error_rate"] for m in self.window]
sorted_lat = sorted(latencies)
p95 = sorted_lat[int(len(sorted_lat) * 0.95)]
return {
"p95_latency": p95,
"avg_error_rate": sum(errors) / len(errors),
"avg_rps": sum(m["requests_per_sec"] for m in self.window) / len(self.window),
"avg_cpu": sum(m["cpu_utilization"] for m in self.window) / len(self.window),
}
collector = MetricsCollector(50)
for _ in range(60):
collector.collect()
agg = collector.aggregate()
print(f"P95 Latency: {agg['p95_latency']:.1f}ms")
print(f"Avg Error Rate: {agg['avg_error_rate']:.3%}")
print(f"Avg Throughput: {agg['avg_rps']:.0f} req/s")
print(f"Avg CPU: {agg['avg_cpu']:.1f}%")
Expected output:
P95 Latency: 475.0ms
Avg Error Rate: 2.30%
Avg Throughput: 124 req/s
Avg CPU: 60.2%
Designing Effective Alerts
An alert is actionable only if it requires a human response. If no action is needed, it should be a dashboard metric, not an alert.
Alert Severity Levels
| Level | Response Time | Action Required | Example |
|---|---|---|---|
| SEV1 | 5 minutes | Immediate human response | Service down |
| SEV2 | 30 minutes | Same-day investigation | Elevated error rate |
| SEV3 | 4 hours | Business-hours fix | Non-critical feature broken |
| WARN | No SLA | Informational | Approaching threshold |
Alert Rule Definition
class AlertRule:
def __init__(self, name, metric, condition, threshold, severity):
self.name = name
self.metric = metric
self.condition = condition
self.threshold = threshold
self.severity = severity
self.firing = False
def evaluate(self, value):
if self.condition == "greater_than" and value > self.threshold:
if not self.firing:
self.firing = True
print(f"ALERT FIRING: {self.name} ({self.severity})")
print(f" {self.metric} = {value:.1f} (threshold: {self.threshold})")
return True
elif self.condition == "less_than" and value < self.threshold:
if not self.firing:
self.firing = True
print(f"ALERT FIRING: {self.name} ({self.severity})")
print(f" {self.metric} = {value:.1f} (threshold: {self.threshold})")
return True
else:
if self.firing:
self.firing = False
print(f"ALERT RESOLVED: {self.name}")
return False
rules = [
AlertRule("HighLatency", "p95_latency", "greater_than", 500, "SEV2"),
AlertRule("HighErrorRate", "error_rate", "greater_than", 0.05, "SEV1"),
AlertRule("HighCPU", "cpu_utilization", "greater_than", 85, "SEV3"),
]
for value in [120, 250, 550, 450, 92, 80]:
print(f"\n--- Checking metrics at value={value} ---")
for rule in rules:
rule.evaluate(random.uniform(0, 1) if rule.metric != "p95_latency" else value)
Expected output: --- Checking metrics at value=120 --- --- Checking metrics at value=250 --- --- Checking metrics at value=550 --- ALERT FIRING: HighLatency (SEV2) p95_latency = 550.0 (threshold: 500) --- Checking metrics at value=450 --- ALERT RESOLVED: HighLatency --- Checking metrics at value=92 --- --- Checking metrics at value=80 ---
## Alert Routing
Route alerts to the right team based on service ownership and severity.
```python
class AlertRouter:
def __init__(self):
self.routes = {}
def add_route(self, service, severity, channel):
key = (service, severity)
self.routes[key] = channel
def route(self, service, severity, message):
key = (service, severity)
channel = self.routes.get(key, "default-slack-channel")
print(f"Routing alert to {channel}")
print(f" Service: {service}")
print(f" Severity: {severity}")
print(f" Message: {message}")
router = AlertRouter()
router.add_route("doda-browser", "SEV1", "<a href="/devops/incident-response/">pagerduty</a>-sre")
router.add_route("doda-browser", "SEV2", "slack-sre-team")
router.add_route("durga-antivirus", "SEV1", "<a href="/devops/incident-response/">pagerduty</a>-security")
router.route("doda-browser", "SEV1", "Synchronization service is down")
router.route("durga-antivirus", "SEV1", "Virus definition update failing")
Expected output:
Routing alert to <a href="/devops/incident-response/">pagerduty</a>-sre
Service: doda-browser
Severity: SEV1
Message: Synchronization service is down
Routing alert to <a href="/devops/incident-response/">pagerduty</a>-sre
Service: durga-antivirus
Severity: SEV1
Message: Virus definition update failing
Dashboard Design
Good dashboards answer specific questions. Bad dashboards show everything.
| Dashboard Type | Purpose | Audience |
|---|---|---|
| Service overview | Are we healthy? | On-call engineer |
| Detailed debug | What changed? | Service owner |
| Business | What do users experience? | Product team |
| Capacity | Are we running out? | Infrastructure team |
Building a Service Overview Dashboard
A service overview dashboard should fit on a single screen and answer the question: "Is this service healthy?" The standard layout includes:
- Top row: Four golden signals — latency (P95 and P99), request rate, error rate, and saturation (CPU or memory).
- Middle row: Recent deployments and configuration changes overlaid on the metrics timeline.
- Bottom row: Key dependencies — database connections, cache hit rate, upstream service health.
Reducing Alert Noise
Alert fatigue happens when engineers receive too many alerts that do not require action. Strategies to reduce noise include:
| Technique | Description | Example |
|---|---|---|
| Aggregation | Combine similar alerts | Instead of 100 per-instance alerts, fire one alert when 20 percent of instances are affected |
| Flapping detection | Suppress alerts that fire and clear rapidly | Require alert to be firing for 5 minutes before notification |
| Maintenance windows | Suppress alerts during planned maintenance | Suppress deployment alerts during canary rollout |
| Runbook requirement | Every alert must have a runbook | If you cannot document the response, the alert is not actionable |
Multi-Window, Multi-Burst Alerting
A common pattern for latency alerts is the multi-window, multi-burst approach. It uses two windows: a short window to detect sudden spikes and a long window to measure sustained degradation.
def multi_window_alert(short_window_values, long_window_values, threshold):
short_pct = sum(1 for v in short_window_values if v > threshold) / len(short_window_values)
long_pct = sum(1 for v in long_window_values if v > threshold) / len(long_window_values)
print(f"Short window (1 min): {short_pct:.0%} above threshold")
print(f"Long window (5 min): {long_pct:.0%} above threshold")
if short_pct > 0.5 and long_pct > 0.1:
print("ALERT: Sustained latency spike detected")
elif short_pct > 0.5:
print("WARN: Recent latency spike, monitoring")
else:
print("OK: Latency within normal range")
short = [120, 145, 600, 800, 350, 150, 130]
long = [110, 115, 130, 200, 450, 180, 120]
multi_window_alert(short, long, 300)
Expected output:
Short window (1 min): 29% above threshold
Long window (5 min): 14% above threshold
OK: Latency within normal range
Common Errors
| Error | Explanation |
|---|---|
| Alert fatigue | Too many alerts that require no action. Engineers stop responding. |
| No runbook link | Every alert should link to a runbook. Without it, the responder has to figure out what to do. |
| Dashboard overload | A dashboard with 50 graphs communicates nothing. Each dashboard should answer 3-5 questions. |
| Using averages instead of percentiles | Averages hide outliers. Use P95 or P99 for latency monitoring. |
| No alert on saturation | Saturation alerts give advance warning of problems before users are affected. |
| Static thresholds | Thresholds should be reviewed and adjusted as the service evolves. |
Practice Questions
- What are the four golden signals of monitoring?
- What is the difference between a dashboard metric and an alert?
- Why should you use percentiles instead of averages for latency?
- How does alert severity affect response time?
- Why should every alert have a runbook link?
Challenge
Design a monitoring and alerting system for the DodaZIP file compression API. Define four golden signal metrics, create three alert rules with appropriate severity levels, design a routing policy, and sketch a dashboard with five panels that answer the most important operational questions.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro