Skip to content

Incident Response with Monitoring: Runbooks and Escalation

DodaTech Updated 2026-06-23 8 min read

In this tutorial, you'll learn about Incident Response with Monitoring: Runbooks and Escalation. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You Will Learn

This tutorial teaches you how to build an Incident Response process integrated with your monitoring stack -- including automated runbooks, escalation policies, on-call rotations, and post-incident reviews that drive continuous improvement.

Why It Matters

Alerting without a response process is noise. When a critical alert fires, every second counts. A well-designed Incident Response process reduces Mean Time to Respond (MTTR) from hours to minutes by giving responders clear procedures, automated diagnostic data, and immediate escalation paths.

Real-World Use

The DodaTech security team received a PagerDuty alert for unusual traffic patterns. The runbook automatically collected network logs, recent deployment history, and affected endpoint lists. Within 3 minutes, the on-call engineer identified a DDoS attack and activated the mitigation runbook. Without the runbook, gathering the same information would have taken 20 minutes.

Incident Response is the process of detecting, responding to, and learning from service disruptions. When integrated with monitoring, alerts trigger runbooks that automate diagnosis, notify the right people, and track the incident lifecycle. PagerDuty and Opsgenie are popular incident management platforms that integrate with Prometheus alerting.


Prerequisites

  • A Prometheus and Alertmanager instance (see Alerting with Alertmanager)
  • An incident management account (PagerDuty, Opsgenie)
  • Basic Python scripting ability
  • Docker installed

Step-by-Step Tutorial

Step 1: Define Incident Severity Levels

severity_levels:
  critical:
    label: "SEV-1"
    response_time: "5 minutes"
    description: "Service unavailable or data loss"
    notification: "PagerDuty phone call + SMS"

  warning:
    label: "SEV-2"
    response_time: "15 minutes"
    description: "Degraded performance or partial outage"
    notification: "PagerDuty push + Slack"

  info:
    label: "SEV-3"
    response_time: "1 hour"
    description: "Non-urgent issue requiring investigation"
    notification: "Slack only"

Step 2: Configure Alertmanager for PagerDuty

receivers:
  - name: "pagerduty-critical"
    pagerduty_configs:
      - routing_key: "YOUR_PD_ROUTING_KEY"
        severity: critical
        description: '{{ template "pd.description" . }}'
        details:
          firing: '{{ template "pd.details" . }}'
          runbook: 'https://runbooks.dodatech.com/{{ .GroupLabels.alertname }}'

Step 3: Create a Runbook Template

Create runbooks/high-cpu-usage.md:

# High CPU Usage Runbook

## Alert
`HighCPUUsage` fires when CPU exceeds 80% for 10 minutes.

## Immediate Actions
1. Acknowledge the alert in PagerDuty
2. Check the Grafana dashboard: [CPU Dashboard](https://grafana.dodatech.com/d/cpu)
3. Run initial diagnostics:
   ```bash
   ssh $INSTANCE
   top -b -n 1 | head -20
   free -m
   df -h

Diagnosis Steps

  1. Check for runaway processes: ps aux --sort=-%cpu | head -10
  2. Check memory pressure: vmstat 1 5
  3. Check disk I/O: iostat -x 1 5
  4. Check network connections: ss -tuln

Resolution

  • If a specific process is causing high CPU: restart or kill the process
  • If the instance is undersized: add more resources or scale horizontally
  • If it is a deployment regression: roll back the last deployment

Escalation

If unresolved after 15 minutes: escalate to the infrastructure team lead.


### Step 4: Build an Automated Runbook Script

```python
#!/usr/bin/env python3
"""Automated runbook executor for high CPU alerts"""

import sys
import json
import paramiko
import requests

def gather_info(instance_ip):
    info = {}
    try:
        ssh = paramiko.SSHClient()
        ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        ssh.connect(instance_ip, username="admin")

        _, stdout, _ = ssh.exec_command("top -b -n 1 | head -5")
        info["top_output"] = stdout.read().decode()

        _, stdout, _ = ssh.exec_command("ps aux --sort=-%cpu | head -5")
        info["top_processes"] = stdout.read().decode()

        ssh.close()
    except Exception as e:
        info["error"] = str(e)
    return info

if __name__ == "__main__":
    alert_data = json.loads(sys.stdin.read())
    instance = alert_data["labels"]["instance"]
    info = gather_info(instance)
    print(json.dumps(info, indent=2))

Step 5: Set Up Escalation Policies in PagerDuty

escalation_policies:
  - name: "Infrastructure Critical"
    escalation_rules:
      - target: "primary-oncall"
        delay: 0
      - target: "secondary-oncall"
        delay: 5
      - target: "infra-manager"
        delay: 15

Step 6: Implement Status Page Integration

from flask import Flask, jsonify
import requests

app = Flask(__name__)

# Simple status page that reflects alert state
status = {
    "api": "operational",
    "database": "operational",
    "cdn": "operational"
}

@app.route("/api/status")
def get_status():
    return jsonify({
        "page": "DodaTech Status",
        "status": "all_good" if all(
            s == "operational" for s in status.values()
        ) else "degraded",
        "components": status
    })

# Called by Alertmanager webhook on alert fire/resolve
@app.route("/api/status/update", methods=["POST"])
def update_status():
    data = request.get_json()
    service = data["labels"].get("service", "unknown")
    status_value = data["status"]  # "firing" or "resolved"
    status[service] = "degraded" if status_value == "firing" else "operational"
    return jsonify({"updated": service, "to": status[service]})

Step 7: Create a Post-Incident Review Template

# Post-Incident Review

## Incident Summary
- **Date**: {{ date }}
- **Duration**: {{ duration }}
- **Severity**: {{ severity }}
- **Services Affected**: {{ services }}

## Timeline
| Time | Event |
|------|-------|
| {{ time }} | Alert fired |
| {{ time }} | Engineer acknowledged |
| {{ time }} | Root cause identified |
| {{ time }} | Mitigation applied |
| {{ time }} | Alert resolved |

## Root Cause
{{ root_cause }}

## Action Items
- [ ] Add monitoring for the missing metric
- [ ] Update the runbook with the resolution steps
- [ ] Schedule a follow-up to implement the permanent fix

## Lessons Learned
{{ lessons }}

Step 8: Automate Post-Incident Data Collection

import requests
from datetime import datetime, timedelta

def collect_incident_data(alert_name, start_time, end_time):
    prometheus = "http://prometheus:9090"
    data = {}

    # Collect metric data around the incident
    for metric in ["node_cpu_seconds_total", "node_memory_MemAvailable_bytes"]:
        query = f"avg({metric}[5m])"
        params = {
            "query": query,
            "start": start_time.isoformat(),
            "end": end_time.isoformat(),
            "step": "60s"
        }
        resp = requests.get(f"{prometheus}/api/v1/query_range", params=params)
        data[metric] = resp.json()

    # Collect alert timeline
    params = {
        "query": f'ALERTS{{alertname="{alert_name}"}}',
        "start": start_time.isoformat(),
        "end": end_time.isoformat(),
        "step": "60s"
    }
    resp = requests.get(f"{prometheus}/api/v1/query_range", params=params)
    data["alert_timeline"] = resp.json()

    return data

Learning Path

flowchart LR
    A[Alert Fires] --> B[PagerDuty/Opsgenie]
    B --> C[Responder Acknowledges]
    C --> D[Runbook Executes]
    D --> E[Auto-Diagnostics]
    E --> F[Mitigation]
    F --> G[Incident Resolved]
    G --> H[Post-Incident Review]
    H --> I[Action Items]
    I --> A
    style A fill:#e74c3c,color:#fff
    style G fill:#2ecc71,color:#fff
    style H fill:#e67e22,color:#fff

Common Errors

  1. Runbook is outdated and points to wrong dashboards -- Runbooks are not version-controlled. Store runbooks in a Git Repository with the same review process as code.

  2. Escalation does not happen because PagerDuty schedule is empty -- The on-call schedule is not configured or the user did not confirm their shifts. Verify schedules weekly.

  3. Automated runbook script fails due to SSH key issues -- The SSH keys are not deployed to the runbook executor. Use a bastion host with proper key management.

  4. Status page shows operational during an outage -- The webhook from Alertmanager to the status page API is failing. Monitor the webhook endpoint with a synthetic check.

  5. Post-incident review is never written -- The review process is not enforced. Block feature deploys until the PIR is completed.

  6. Multiple alerts for the same incident cause notification storm -- Alert grouping in Alertmanager is not configured. Group by alertname and instance.

  7. On-call engineer misses the notification -- The only notification channel is Slack. Always use multiple channels (push, SMS, phone) for critical alerts.


Practice Questions

  1. What is the purpose of a runbook in Incident Response? Answer: A runbook provides step-by-step procedures for diagnosing and resolving a specific alert, reducing cognitive load during high-pressure situations.

  2. How does escalation work in PagerDuty? Answer: If the primary responder does not acknowledge within a set delay, the alert escalates to the secondary responder, then to a manager.

  3. What should a post-incident review include? Answer: Timeline of events, root cause analysis, action items, and lessons learned -- all focused on preventing recurrence.

  4. How can monitoring help during Incident Response? Answer: Monitoring provides real-time metrics, trace data, and logs that help responders understand the scope and impact of the incident.

  5. What is the difference between MTTR and MTTD? Answer: MTTR (Mean Time to Resolve) is the time from detection to resolution. MTTD (Mean Time to Detect) is the time from the incident occurring to the alert firing.


Challenge

Build a complete Incident Response integration for a production service. Configure Alertmanager to send critical alerts to PagerDuty with an escalation policy: primary on-call (5-minute acknowledgement window), secondary (10-minute escalation), engineering manager (15-minute). Create runbooks for three alert types: InstanceDown, HighErrorRate, and HighLatency. Build an automated runbook script that, when triggered by a webhook, SSHes into the affected instance and collects top output, disk usage, and recent logs. Create a status page API that automatically updates component status based on alert state. Set up a post-incident review template that pre-populates the incident timeline from Prometheus data. Test the entire flow by simulating an instance failure and verifying each step.


FAQ

Do I need PagerDuty for Incident Response?

No, you can use Opsgenie, Splunk On-Call, or even a dedicated Slack channel with a rotation bot. PagerDuty is the most widely used but not required.

What is the ideal response time for critical alerts?

5 minutes or less for critical production outages. 15 minutes for high-severity issues. 1 hour for warnings.

How often should runbooks be updated?

Review runbooks after every incident where the runbook was used. At minimum, review all runbooks quarterly.

Can I automate the entire Incident Response?

Many steps can be automated (diagnostics, data collection, status page updates), but human judgment is still needed for root cause analysis and complex mitigation.

What is a "war room" in Incident Response?

A war room is a dedicated communication channel (Slack, Zoom) where responders collaborate during an active incident. It should auto-create when a critical alert fires.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro