Runbooks — Documenting Operational Procedures
In this tutorial, you'll learn about Runbooks. We cover key concepts, practical examples, and best practices.
A runbook is a documented procedure that explains how to perform a specific operational task — from restarting a service to responding to a SEV1 incident — enabling consistent execution regardless of who is on call.
What You'll Learn
In this tutorial, you will learn the anatomy of an effective runbook, how to write clear step-by-step procedures, how to integrate runbooks with alerting systems, and how to keep runbooks up to date through regular reviews and testing.
Why It Matters
When an alert fires at 3 AM, the on-call engineer needs clear instructions, not guesswork. Without runbooks, every incident is a first-time experience. Response times are slower, mistakes are more common, and knowledge is locked in individual engineers heads. Runbooks turn tribal knowledge into organizational capability.
Real-World Use
DodaTech maintains a runbook repository for every service. When Prometheus fires an alert, the alert payload includes a direct link to the relevant runbook. Doda Browser sync has runbooks for database failover, cache warming, deployment rollback, and certificate renewal. Each runbook is tested quarterly in a staging environment.
graph TD
A[Alert Fires] --> B[On-Call Gets Alert]
B --> C[Opens Runbook]
C --> D{Runbook Has Steps?}
D -->|Yes| E[Follow Procedure]
D -->|No| F[Debug Manually]
F --> G[Write Runbook After]
E --> H[Resolve Incident]
H --> I[Update Runbook]
Prerequisites
Understanding Incident Response is essential since runbooks are primarily used during incidents. Familiarity with Monitoring and Alerting for SRE helps you connect alerts to the right runbooks.
Runbook Structure
Every runbook should follow a consistent template.
| Field | Description | Example |
|---|---|---|
| Title | Clear task name | "Database Failover Runbook" |
| Alert | Which alert triggers this | "PostgreSQLReplicationLag > 30s" |
| Severity | Expected severity level | SEV1 or SEV2 |
| Prerequisites | Tools and access needed | "Access to RDS console, VPN connected" |
| Steps | Numbered instructions | See below |
| Verification | How to confirm success | "Check replication lag < 5s" |
| Rollback | How to undo if needed | "Reverse the failover procedure" |
Writing Clear Steps
Each step should be a single action with a clear expected outcome.
class Runbook:
def __init__(self, title, alert, severity):
self.title = title
self.alert = alert
self.severity = severity
self.steps = []
def add_step(self, number, action, expected):
self.steps.append({
"step": number,
"action": action,
"expected": expected
})
def execute(self):
print(f"Runbook: {self.title}")
print(f"Alert: {self.alert}")
print(f"Severity: {self.severity}")
print("-" * 50)
for s in self.steps:
print(f"Step {s['step']}: {s['action']}")
print(f" Expected: {s['expected']}")
confirm = input(" Done? (y/n): ") if False else "y"
if confirm.lower() != "y":
print(" WARNING: Step not confirmed. Escalate.")
rb = Runbook("Database Failover", "PostgreSQLReplicationLag > 30s", "SEV1")
rb.add_step(1, "SSH into bastion host", "Connected to bastion")
rb.add_step(2, "Run `pg_isready -h replica`", "Replica is accepting connections")
rb.add_step(3, "Promote replica: `pg_ctl promote`", "Replica becomes primary")
rb.add_step(4, "Update DNS to point to new primary", "DNS TTL propagated")
rb.add_step(5, "Verify application connectivity", "Application healthy")
print("Runbook defined successfully")
Expected output:
Runbook: Database Failover
Alert: PostgreSQLReplicationLag > 30s
Severity: SEV1
--------------------------------------------------
Step 1: SSH into bastion host
Expected: Connected to bastion
Step 2: Run `pg_isready -h replica`
Expected: Replica is accepting connections
Step 3: Promote replica: `pg_ctl promote`
Expected: Replica becomes primary
Step 4: Update DNS to point to new primary
Expected: DNS TTL propagated
Step 5: Verify application connectivity
Expected: Application healthy
Runbook defined successfully
Integrating Runbooks with Alerting
Every alert should have a runbook. The alert notification should include a direct link.
class AlertWithRunbook:
def __init__(self, alert_name, severity, runbook_url):
self.alert = alert_name
self.severity = severity
self.runbook_url = runbook_url
def notify(self):
print(f"ALERT: {self.alert} ({self.severity})")
print(f"Runbook: {self.runbook_url}")
print("Action: Open runbook and follow steps 1-5")
alerts = [
AlertWithRunbook("HighCPULoad", "SEV2",
"https://runbooks.dodatech.com/database-failover"),
AlertWithRunbook("DiskSpaceLow", "SEV3",
"https://runbooks.dodatech.com/disk-cleanup"),
AlertWithRunbook("CertExpiring", "SEV3",
"https://runbooks.dodatech.com/cert-renewal"),
]
for a in alerts:
print("-" * 40)
a.notify()
Expected output:
ALERT: HighCPULoad (SEV2) Runbook: https://runbooks.dodatech.com/database-failover Action: Open runbook and follow steps 1-5
ALERT: DiskSpaceLow (SEV3) Runbook: https://runbooks.dodatech.com/disk-cleanup Action: Open runbook and follow steps 1-5
ALERT: CertExpiring (SEV3) Runbook: https://runbooks.dodatech.com/cert-renewal Action: Open runbook and follow steps 1-5
## Runbook Maintenance
Runbooks decay. Services change, commands change, and team members move. A stale runbook is worse than no runbook because it gives false confidence.
### Version Control for Runbooks
Store runbooks in a version-controlled repository alongside your application code. This provides history, review, and attribution for every runbook change. When a procedure is updated, the team can see what changed and why.
```python
class RunbookRepository:
def __init__(self):
self.runbooks = {}
self.versions = {}
def save(self, runbook, author):
import time
version = int(time.time())
self.runbooks[runbook.title] = runbook
self.versions.setdefault(runbook.title, [])
self.versions[runbook.title].append({
"version": version,
"author": author,
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
})
print(f"Saved '{runbook.title}' (v{version}) by {author}")
def history(self, title):
if title in self.versions:
print(f"Version history for '{title}':")
for v in self.versions[title]:
print(f" v{v['version']} by {v['author']} on {v['timestamp']}")
else:
print(f"No versions found for '{title}'")
repo = RunbookRepository()
repo.save(rb, "alice")
repo.history("Database Failover")
Expected output:
Saved 'Database Failover' (v1719123456) by alice
Version history for 'Database Failover':
v1719123456 by alice on 2026-06-23 14:00:00
Runbook Review Process
Establish a regular review cadence. Each runbook should have an owner who reviews it quarterly for accuracy. The review process includes:
- Verify all command syntax is current
- Test each step in a staging environment
- Update screenshots and links
- Remove obsolete steps
- Add steps for new failure modes discovered since the last review
From Runbooks to Automation
The ultimate goal of a runbook is to automate itself. Each time a runbook step is performed manually, ask: "Could this step be automated?" Over time, more steps are automated, and the runbook shrinks to cover only the steps that require human judgment.
| Runbook State | Description | Example |
|---|---|---|
| Manual | All steps done by human | SSH into server, check logs, restart |
| Semi-automated | Some steps automated | Script detects issue, human approves fix |
| Fully automated | No human needed | Auto-scaling handles high CPU without alert |
| Self-healing | System detects and fixes proactively | Database auto-failover before users notice |
Runbook Template
Standardize on a template so every runbook has the same structure, making it easy for on-call engineers to find what they need quickly.
# [Runbook Title]
**Alert**: [Which alert triggers this]
**Severity**: [SEV1/SEV2/SEV3]
**Owner**: [Team or individual]
## Prerequisites
- Access to [system]
- VPN connected
- [Other prerequisites]
## Steps
1. [Action] — Expected: [Outcome]
2. [Action] — Expected: [Outcome]
## Verification
[How to confirm the procedure worked]
## Rollback
[How to undo the procedure if needed]
## Notes
[Any additional context or tips]
Maintenance Activities
| Maintenance Activity | Frequency | Owner |
|---|---|---|
| Review for accuracy | Quarterly | Service owner |
| Test in staging | Quarterly | On-call team |
| Update after incident | Within 48 hours | Incident responder |
| Review for readability | Annually | Tech writer |
Automated Runbook Testing
def test_runbook(runbook):
print(f"Testing: {runbook.title}")
failures = []
for step in runbook.steps:
success = simulate_step(step)
if not success:
failures.append(step["step"])
if failures:
print(f"FAILED steps: {failures}")
else:
print("All steps passed")
def simulate_step(step):
return True
test_runbook(rb)
Expected output:
Testing: Database Failover
All steps passed
Common Errors
| Error | Explanation |
|---|---|
| Runbook too long | More than 15 steps is overwhelming. Break into sub-runbooks. |
| Missing rollback steps | Every procedure must say how to undo if something goes wrong. |
| No prerequisites listed | The on-call engineer may not have VPN access or the right permissions. List everything needed. |
| Stale runbooks | Runbooks that reference old infrastructure or outdated commands are dangerous. Review quarterly. |
| Assuming context | Do not assume the reader knows why a step exists. Explain briefly but clearly. |
| Runbook not linked to alert | If the engineer has to search for the runbook, the runbook is useless. Link directly. |
Practice Questions
- What are the essential fields in a runbook?
- Why should every alert have a runbook?
- How often should runbooks be tested?
- What is the danger of a stale runbook?
- Why should each step include an expected outcome?
Challenge
Write a complete runbook for the DodaZIP cloud storage service. The runbook should cover the procedure for handling a "StorageNodeUnreachable" alert. Include 8-12 steps, prerequisites, verification, and rollback. Test your runbook by mentally walking through each step and verifying the expected outcomes make sense.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro