Monitoring and Alerting for SRE

DodaTech Updated 2026-06-23 8 min read

In this tutorial, you'll learn about Monitoring and Alerting for SRE. We cover key concepts, practical examples, and best practices.

Monitoring collects and visualizes service metrics, while alerting notifies the right person when those metrics indicate a problem — together they form the eyes and ears of every SRE team.

What You'll Learn

In this tutorial, you will learn the four golden signals of monitoring, how to design alerts that are actionable and reduce noise, how to set up alert routing based on severity and service ownership, and how to build dashboards that help you debug faster.

Why It Matters

Bad monitoring is worse than no monitoring. Alerts that fire constantly but require no action train engineers to ignore them. Dashboards that show every metric make it impossible to find the important one. A well-designed monitoring system reduces MTTR and prevents alert fatigue.

Real-World Use

DodaTech runs Prometheus for metrics collection and Grafana for dashboards across all services. Doda Browser sync has four dashboards: one for the four golden signals, one for database health, one for deployment tracking, and one for business metrics. Alerts route through PagerDuty with severity-based escalation policies.

graph TD
    A[Service] --> B[Metrics Collection]
    B --> C[Time-Series Database]
    C --> D{Dashboard}
    C --> E{Alert Rule}
    E -->|Fire| F[Alert Manager]
    F --> G[Route by Severity]
    G --> H[PagerDuty SEV1]
    G --> I[Email SEV2]
    G --> J[Slack SEV3]

Prerequisites

Understanding SLIs and SLOs helps you define what to monitor. Familiarity with Incident Response is important since alerts trigger the incident response process.

The Four Golden Signals

Google SRE defines four golden signals that every service should monitor.

Signal	What It Measures	Why It Matters
Latency	Time to serve a request	Slow responses are as bad as errors
Traffic	How much demand the service sees	Identifies scaling needs
Errors	Rate of failed requests	Directly measures service health
Saturation	How close to capacity the service is	Predicts future problems

Building a Metrics Collector

import random
import time
from collections import deque

class MetricsCollector:
    def __init__(self, window_size=100):
        self.window = deque(maxlen=window_size)

    def collect(self):
        metric = {
            "timestamp": time.time(),
            "latency_ms": random.uniform(10, 500),
            "requests_per_sec": random.uniform(50, 200),
            "error_rate": random.uniform(0, 0.05),
            "cpu_utilization": random.uniform(30, 90),
        }
        self.window.append(metric)
        return metric

    def aggregate(self):
        if not self.window:
            return None
        latencies = [m["latency_ms"] for m in self.window]
        errors = [m["error_rate"] for m in self.window]
        sorted_lat = sorted(latencies)
        p95 = sorted_lat[int(len(sorted_lat) * 0.95)]
        return {
            "p95_latency": p95,
            "avg_error_rate": sum(errors) / len(errors),
            "avg_rps": sum(m["requests_per_sec"] for m in self.window) / len(self.window),
            "avg_cpu": sum(m["cpu_utilization"] for m in self.window) / len(self.window),
        }

collector = MetricsCollector(50)
for _ in range(60):
    collector.collect()

agg = collector.aggregate()
print(f"P95 Latency:       {agg['p95_latency']:.1f}ms")
print(f"Avg Error Rate:    {agg['avg_error_rate']:.3%}")
print(f"Avg Throughput:    {agg['avg_rps']:.0f} req/s")
print(f"Avg CPU:           {agg['avg_cpu']:.1f}%")

Expected output:

P95 Latency:       475.0ms
Avg Error Rate:    2.30%
Avg Throughput:    124 req/s
Avg CPU:           60.2%

Designing Effective Alerts

An alert is actionable only if it requires a human response. If no action is needed, it should be a dashboard metric, not an alert.

Alert Severity Levels

Level	Response Time	Action Required	Example
SEV1	5 minutes	Immediate human response	Service down
SEV2	30 minutes	Same-day investigation	Elevated error rate
SEV3	4 hours	Business-hours fix	Non-critical feature broken
WARN	No SLA	Informational	Approaching threshold

Alert Rule Definition

class AlertRule:
    def __init__(self, name, metric, condition, threshold, severity):
        self.name = name
        self.metric = metric
        self.condition = condition
        self.threshold = threshold
        self.severity = severity
        self.firing = False

    def evaluate(self, value):
        if self.condition == "greater_than" and value > self.threshold:
            if not self.firing:
                self.firing = True
                print(f"ALERT FIRING: {self.name} ({self.severity})")
                print(f"  {self.metric} = {value:.1f} (threshold: {self.threshold})")
            return True
        elif self.condition == "less_than" and value < self.threshold:
            if not self.firing:
                self.firing = True
                print(f"ALERT FIRING: {self.name} ({self.severity})")
                print(f"  {self.metric} = {value:.1f} (threshold: {self.threshold})")
            return True
        else:
            if self.firing:
                self.firing = False
                print(f"ALERT RESOLVED: {self.name}")
            return False

rules = [
    AlertRule("HighLatency", "p95_latency", "greater_than", 500, "SEV2"),
    AlertRule("HighErrorRate", "error_rate", "greater_than", 0.05, "SEV1"),
    AlertRule("HighCPU", "cpu_utilization", "greater_than", 85, "SEV3"),
]

for value in [120, 250, 550, 450, 92, 80]:
    print(f"\n--- Checking metrics at value={value} ---")
    for rule in rules:
        rule.evaluate(random.uniform(0, 1) if rule.metric != "p95_latency" else value)

Expected output: --- Checking metrics at value=120 --- --- Checking metrics at value=250 --- --- Checking metrics at value=550 --- ALERT FIRING: HighLatency (SEV2) p95_latency = 550.0 (threshold: 500) --- Checking metrics at value=450 --- ALERT RESOLVED: HighLatency --- Checking metrics at value=92 --- --- Checking metrics at value=80 ---


## Alert Routing

Route alerts to the right team based on service ownership and severity.

```python
class AlertRouter:
    def __init__(self):
        self.routes = {}

    def add_route(self, service, severity, channel):
        key = (service, severity)
        self.routes[key] = channel

    def route(self, service, severity, message):
        key = (service, severity)
        channel = self.routes.get(key, "default-slack-channel")
        print(f"Routing alert to {channel}")
        print(f"  Service: {service}")
        print(f"  Severity: {severity}")
        print(f"  Message: {message}")

router = AlertRouter()
router.add_route("doda-browser", "SEV1", "<a href="/devops/incident-response/">pagerduty</a>-sre")
router.add_route("doda-browser", "SEV2", "slack-sre-team")
router.add_route("durga-antivirus", "SEV1", "<a href="/devops/incident-response/">pagerduty</a>-security")

router.route("doda-browser", "SEV1", "Synchronization service is down")
router.route("durga-antivirus", "SEV1", "Virus definition update failing")

Expected output:

Routing alert to <a href="/devops/incident-response/">pagerduty</a>-sre
  Service: doda-browser
  Severity: SEV1
  Message: Synchronization service is down
Routing alert to <a href="/devops/incident-response/">pagerduty</a>-sre
  Service: durga-antivirus
  Severity: SEV1
  Message: Virus definition update failing

Dashboard Design

Good dashboards answer specific questions. Bad dashboards show everything.

Dashboard Type	Purpose	Audience
Service overview	Are we healthy?	On-call engineer
Detailed debug	What changed?	Service owner
Business	What do users experience?	Product team
Capacity	Are we running out?	Infrastructure team

Building a Service Overview Dashboard

A service overview dashboard should fit on a single screen and answer the question: "Is this service healthy?" The standard layout includes:

Top row: Four golden signals — latency (P95 and P99), request rate, error rate, and saturation (CPU or memory).
Middle row: Recent deployments and configuration changes overlaid on the metrics timeline.
Bottom row: Key dependencies — database connections, cache hit rate, upstream service health.

Reducing Alert Noise

Alert fatigue happens when engineers receive too many alerts that do not require action. Strategies to reduce noise include:

Technique	Description	Example
Aggregation	Combine similar alerts	Instead of 100 per-instance alerts, fire one alert when 20 percent of instances are affected
Flapping detection	Suppress alerts that fire and clear rapidly	Require alert to be firing for 5 minutes before notification
Maintenance windows	Suppress alerts during planned maintenance	Suppress deployment alerts during canary rollout
Runbook requirement	Every alert must have a runbook	If you cannot document the response, the alert is not actionable

Multi-Window, Multi-Burst Alerting

A common pattern for latency alerts is the multi-window, multi-burst approach. It uses two windows: a short window to detect sudden spikes and a long window to measure sustained degradation.

def multi_window_alert(short_window_values, long_window_values, threshold):
    short_pct = sum(1 for v in short_window_values if v > threshold) / len(short_window_values)
    long_pct = sum(1 for v in long_window_values if v > threshold) / len(long_window_values)
    print(f"Short window (1 min): {short_pct:.0%} above threshold")
    print(f"Long window (5 min):  {long_pct:.0%} above threshold")

    if short_pct > 0.5 and long_pct > 0.1:
        print("ALERT: Sustained latency spike detected")
    elif short_pct > 0.5:
        print("WARN: Recent latency spike, monitoring")
    else:
        print("OK: Latency within normal range")

short = [120, 145, 600, 800, 350, 150, 130]
long = [110, 115, 130, 200, 450, 180, 120]
multi_window_alert(short, long, 300)

Expected output:

Short window (1 min): 29% above threshold
Long window (5 min):  14% above threshold
OK: Latency within normal range

Common Errors

Error	Explanation
Alert fatigue	Too many alerts that require no action. Engineers stop responding.
No runbook link	Every alert should link to a runbook. Without it, the responder has to figure out what to do.
Dashboard overload	A dashboard with 50 graphs communicates nothing. Each dashboard should answer 3-5 questions.
Using averages instead of percentiles	Averages hide outliers. Use P95 or P99 for latency monitoring.
No alert on saturation	Saturation alerts give advance warning of problems before users are affected.
Static thresholds	Thresholds should be reviewed and adjusted as the service evolves.

Practice Questions

What are the four golden signals of monitoring?
What is the difference between a dashboard metric and an alert?
Why should you use percentiles instead of averages for latency?
How does alert severity affect response time?
Why should every alert have a runbook link?

Challenge

Design a monitoring and alerting system for the DodaZIP file compression API. Define four golden signal metrics, create three alert rules with appropriate severity levels, design a routing policy, and sketch a dashboard with five panels that answer the most important operational questions.

FAQ

What is the difference between monitoring and alerting?

Monitoring collects and displays metrics. Alerting notifies humans when metrics indicate a problem that needs attention.

How many alerts is too many?

If your team receives more than 5-10 actionable alerts per day, you likely have alert fatigue. Aim for alerts that require action.

What is the best monitoring tool for SRE?

Prometheus with Grafana is the most common open-source stack. Commercial alternatives include Datadog, New Relic, and SignalFx.

How often should alert rules be reviewed?

Review alert rules quarterly. Remove rules that have not fired in 6 months or that consistently fire without requiring action.

What is the four golden signals approach?

A monitoring strategy focusing on latency, traffic, errors, and saturation — the four metrics that give the most complete picture of service health.

← Previous Service Level Agreements (SLAs) vs SLOs vs SLIs Next → Reliability Patterns — Retries, Circuit Breakers, Timeouts

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Site Reliability Engineering