Incident Response with Monitoring: Runbooks and Escalation
In this tutorial, you'll learn about Incident Response with Monitoring: Runbooks and Escalation. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
What You Will Learn
This tutorial teaches you how to build an Incident Response process integrated with your monitoring stack -- including automated runbooks, escalation policies, on-call rotations, and post-incident reviews that drive continuous improvement.
Why It Matters
Alerting without a response process is noise. When a critical alert fires, every second counts. A well-designed Incident Response process reduces Mean Time to Respond (MTTR) from hours to minutes by giving responders clear procedures, automated diagnostic data, and immediate escalation paths.
Real-World Use
The DodaTech security team received a PagerDuty alert for unusual traffic patterns. The runbook automatically collected network logs, recent deployment history, and affected endpoint lists. Within 3 minutes, the on-call engineer identified a DDoS attack and activated the mitigation runbook. Without the runbook, gathering the same information would have taken 20 minutes.
Incident Response is the process of detecting, responding to, and learning from service disruptions. When integrated with monitoring, alerts trigger runbooks that automate diagnosis, notify the right people, and track the incident lifecycle. PagerDuty and Opsgenie are popular incident management platforms that integrate with Prometheus alerting.
Prerequisites
- A Prometheus and Alertmanager instance (see Alerting with Alertmanager)
- An incident management account (PagerDuty, Opsgenie)
- Basic Python scripting ability
- Docker installed
Step-by-Step Tutorial
Step 1: Define Incident Severity Levels
severity_levels:
critical:
label: "SEV-1"
response_time: "5 minutes"
description: "Service unavailable or data loss"
notification: "PagerDuty phone call + SMS"
warning:
label: "SEV-2"
response_time: "15 minutes"
description: "Degraded performance or partial outage"
notification: "PagerDuty push + Slack"
info:
label: "SEV-3"
response_time: "1 hour"
description: "Non-urgent issue requiring investigation"
notification: "Slack only"
Step 2: Configure Alertmanager for PagerDuty
receivers:
- name: "pagerduty-critical"
pagerduty_configs:
- routing_key: "YOUR_PD_ROUTING_KEY"
severity: critical
description: '{{ template "pd.description" . }}'
details:
firing: '{{ template "pd.details" . }}'
runbook: 'https://runbooks.dodatech.com/{{ .GroupLabels.alertname }}'
Step 3: Create a Runbook Template
Create runbooks/high-cpu-usage.md:
# High CPU Usage Runbook
## Alert
`HighCPUUsage` fires when CPU exceeds 80% for 10 minutes.
## Immediate Actions
1. Acknowledge the alert in PagerDuty
2. Check the Grafana dashboard: [CPU Dashboard](https://grafana.dodatech.com/d/cpu)
3. Run initial diagnostics:
```bash
ssh $INSTANCE
top -b -n 1 | head -20
free -m
df -h
Diagnosis Steps
- Check for runaway processes:
ps aux --sort=-%cpu | head -10 - Check memory pressure:
vmstat 1 5 - Check disk I/O:
iostat -x 1 5 - Check network connections:
ss -tuln
Resolution
- If a specific process is causing high CPU: restart or kill the process
- If the instance is undersized: add more resources or scale horizontally
- If it is a deployment regression: roll back the last deployment
Escalation
If unresolved after 15 minutes: escalate to the infrastructure team lead.
### Step 4: Build an Automated Runbook Script
```python
#!/usr/bin/env python3
"""Automated runbook executor for high CPU alerts"""
import sys
import json
import paramiko
import requests
def gather_info(instance_ip):
info = {}
try:
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(instance_ip, username="admin")
_, stdout, _ = ssh.exec_command("top -b -n 1 | head -5")
info["top_output"] = stdout.read().decode()
_, stdout, _ = ssh.exec_command("ps aux --sort=-%cpu | head -5")
info["top_processes"] = stdout.read().decode()
ssh.close()
except Exception as e:
info["error"] = str(e)
return info
if __name__ == "__main__":
alert_data = json.loads(sys.stdin.read())
instance = alert_data["labels"]["instance"]
info = gather_info(instance)
print(json.dumps(info, indent=2))
Step 5: Set Up Escalation Policies in PagerDuty
escalation_policies:
- name: "Infrastructure Critical"
escalation_rules:
- target: "primary-oncall"
delay: 0
- target: "secondary-oncall"
delay: 5
- target: "infra-manager"
delay: 15
Step 6: Implement Status Page Integration
from flask import Flask, jsonify
import requests
app = Flask(__name__)
# Simple status page that reflects alert state
status = {
"api": "operational",
"database": "operational",
"cdn": "operational"
}
@app.route("/api/status")
def get_status():
return jsonify({
"page": "DodaTech Status",
"status": "all_good" if all(
s == "operational" for s in status.values()
) else "degraded",
"components": status
})
# Called by Alertmanager webhook on alert fire/resolve
@app.route("/api/status/update", methods=["POST"])
def update_status():
data = request.get_json()
service = data["labels"].get("service", "unknown")
status_value = data["status"] # "firing" or "resolved"
status[service] = "degraded" if status_value == "firing" else "operational"
return jsonify({"updated": service, "to": status[service]})
Step 7: Create a Post-Incident Review Template
# Post-Incident Review
## Incident Summary
- **Date**: {{ date }}
- **Duration**: {{ duration }}
- **Severity**: {{ severity }}
- **Services Affected**: {{ services }}
## Timeline
| Time | Event |
|------|-------|
| {{ time }} | Alert fired |
| {{ time }} | Engineer acknowledged |
| {{ time }} | Root cause identified |
| {{ time }} | Mitigation applied |
| {{ time }} | Alert resolved |
## Root Cause
{{ root_cause }}
## Action Items
- [ ] Add monitoring for the missing metric
- [ ] Update the runbook with the resolution steps
- [ ] Schedule a follow-up to implement the permanent fix
## Lessons Learned
{{ lessons }}
Step 8: Automate Post-Incident Data Collection
import requests
from datetime import datetime, timedelta
def collect_incident_data(alert_name, start_time, end_time):
prometheus = "http://prometheus:9090"
data = {}
# Collect metric data around the incident
for metric in ["node_cpu_seconds_total", "node_memory_MemAvailable_bytes"]:
query = f"avg({metric}[5m])"
params = {
"query": query,
"start": start_time.isoformat(),
"end": end_time.isoformat(),
"step": "60s"
}
resp = requests.get(f"{prometheus}/api/v1/query_range", params=params)
data[metric] = resp.json()
# Collect alert timeline
params = {
"query": f'ALERTS{{alertname="{alert_name}"}}',
"start": start_time.isoformat(),
"end": end_time.isoformat(),
"step": "60s"
}
resp = requests.get(f"{prometheus}/api/v1/query_range", params=params)
data["alert_timeline"] = resp.json()
return data
Learning Path
flowchart LR
A[Alert Fires] --> B[PagerDuty/Opsgenie]
B --> C[Responder Acknowledges]
C --> D[Runbook Executes]
D --> E[Auto-Diagnostics]
E --> F[Mitigation]
F --> G[Incident Resolved]
G --> H[Post-Incident Review]
H --> I[Action Items]
I --> A
style A fill:#e74c3c,color:#fff
style G fill:#2ecc71,color:#fff
style H fill:#e67e22,color:#fff
Common Errors
Runbook is outdated and points to wrong dashboards -- Runbooks are not version-controlled. Store runbooks in a Git Repository with the same review process as code.
Escalation does not happen because PagerDuty schedule is empty -- The on-call schedule is not configured or the user did not confirm their shifts. Verify schedules weekly.
Automated runbook script fails due to SSH key issues -- The SSH keys are not deployed to the runbook executor. Use a bastion host with proper key management.
Status page shows operational during an outage -- The webhook from Alertmanager to the status page API is failing. Monitor the webhook endpoint with a synthetic check.
Post-incident review is never written -- The review process is not enforced. Block feature deploys until the PIR is completed.
Multiple alerts for the same incident cause notification storm -- Alert grouping in Alertmanager is not configured. Group by
alertnameandinstance.On-call engineer misses the notification -- The only notification channel is Slack. Always use multiple channels (push, SMS, phone) for critical alerts.
Practice Questions
What is the purpose of a runbook in Incident Response? Answer: A runbook provides step-by-step procedures for diagnosing and resolving a specific alert, reducing cognitive load during high-pressure situations.
How does escalation work in PagerDuty? Answer: If the primary responder does not acknowledge within a set delay, the alert escalates to the secondary responder, then to a manager.
What should a post-incident review include? Answer: Timeline of events, root cause analysis, action items, and lessons learned -- all focused on preventing recurrence.
How can monitoring help during Incident Response? Answer: Monitoring provides real-time metrics, trace data, and logs that help responders understand the scope and impact of the incident.
What is the difference between MTTR and MTTD? Answer: MTTR (Mean Time to Resolve) is the time from detection to resolution. MTTD (Mean Time to Detect) is the time from the incident occurring to the alert firing.
Challenge
Build a complete Incident Response integration for a production service. Configure Alertmanager to send critical alerts to PagerDuty with an escalation policy: primary on-call (5-minute acknowledgement window), secondary (10-minute escalation), engineering manager (15-minute). Create runbooks for three alert types: InstanceDown, HighErrorRate, and HighLatency. Build an automated runbook script that, when triggered by a webhook, SSHes into the affected instance and collects top output, disk usage, and recent logs. Create a status page API that automatically updates component status based on alert state. Set up a post-incident review template that pre-populates the incident timeline from Prometheus data. Test the entire flow by simulating an instance failure and verifying each step.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro