Skip to content

Runbooks — Documenting Operational Procedures

DodaTech Updated 2026-06-23 8 min read

In this tutorial, you'll learn about Runbooks. We cover key concepts, practical examples, and best practices.

A runbook is a documented procedure that explains how to perform a specific operational task — from restarting a service to responding to a SEV1 incident — enabling consistent execution regardless of who is on call.

What You'll Learn

In this tutorial, you will learn the anatomy of an effective runbook, how to write clear step-by-step procedures, how to integrate runbooks with alerting systems, and how to keep runbooks up to date through regular reviews and testing.

Why It Matters

When an alert fires at 3 AM, the on-call engineer needs clear instructions, not guesswork. Without runbooks, every incident is a first-time experience. Response times are slower, mistakes are more common, and knowledge is locked in individual engineers heads. Runbooks turn tribal knowledge into organizational capability.

Real-World Use

DodaTech maintains a runbook repository for every service. When Prometheus fires an alert, the alert payload includes a direct link to the relevant runbook. Doda Browser sync has runbooks for database failover, cache warming, deployment rollback, and certificate renewal. Each runbook is tested quarterly in a staging environment.

graph TD
    A[Alert Fires] --> B[On-Call Gets Alert]
    B --> C[Opens Runbook]
    C --> D{Runbook Has Steps?}
    D -->|Yes| E[Follow Procedure]
    D -->|No| F[Debug Manually]
    F --> G[Write Runbook After]
    E --> H[Resolve Incident]
    H --> I[Update Runbook]

Prerequisites

Understanding Incident Response is essential since runbooks are primarily used during incidents. Familiarity with Monitoring and Alerting for SRE helps you connect alerts to the right runbooks.

Runbook Structure

Every runbook should follow a consistent template.

Field Description Example
Title Clear task name "Database Failover Runbook"
Alert Which alert triggers this "PostgreSQLReplicationLag > 30s"
Severity Expected severity level SEV1 or SEV2
Prerequisites Tools and access needed "Access to RDS console, VPN connected"
Steps Numbered instructions See below
Verification How to confirm success "Check replication lag < 5s"
Rollback How to undo if needed "Reverse the failover procedure"

Writing Clear Steps

Each step should be a single action with a clear expected outcome.

class Runbook:
    def __init__(self, title, alert, severity):
        self.title = title
        self.alert = alert
        self.severity = severity
        self.steps = []

    def add_step(self, number, action, expected):
        self.steps.append({
            "step": number,
            "action": action,
            "expected": expected
        })

    def execute(self):
        print(f"Runbook: {self.title}")
        print(f"Alert: {self.alert}")
        print(f"Severity: {self.severity}")
        print("-" * 50)
        for s in self.steps:
            print(f"Step {s['step']}: {s['action']}")
            print(f"  Expected: {s['expected']}")
            confirm = input("  Done? (y/n): ") if False else "y"
            if confirm.lower() != "y":
                print("  WARNING: Step not confirmed. Escalate.")

rb = Runbook("Database Failover", "PostgreSQLReplicationLag > 30s", "SEV1")
rb.add_step(1, "SSH into bastion host", "Connected to bastion")
rb.add_step(2, "Run `pg_isready -h replica`", "Replica is accepting connections")
rb.add_step(3, "Promote replica: `pg_ctl promote`", "Replica becomes primary")
rb.add_step(4, "Update DNS to point to new primary", "DNS TTL propagated")
rb.add_step(5, "Verify application connectivity", "Application healthy")
print("Runbook defined successfully")

Expected output:

Runbook: Database Failover
Alert: PostgreSQLReplicationLag > 30s
Severity: SEV1
--------------------------------------------------
Step 1: SSH into bastion host
  Expected: Connected to bastion
Step 2: Run `pg_isready -h replica`
  Expected: Replica is accepting connections
Step 3: Promote replica: `pg_ctl promote`
  Expected: Replica becomes primary
Step 4: Update DNS to point to new primary
  Expected: DNS TTL propagated
Step 5: Verify application connectivity
  Expected: Application healthy
Runbook defined successfully

Integrating Runbooks with Alerting

Every alert should have a runbook. The alert notification should include a direct link.

class AlertWithRunbook:
    def __init__(self, alert_name, severity, runbook_url):
        self.alert = alert_name
        self.severity = severity
        self.runbook_url = runbook_url

    def notify(self):
        print(f"ALERT: {self.alert} ({self.severity})")
        print(f"Runbook: {self.runbook_url}")
        print("Action: Open runbook and follow steps 1-5")

alerts = [
    AlertWithRunbook("HighCPULoad", "SEV2",
                     "https://runbooks.dodatech.com/database-failover"),
    AlertWithRunbook("DiskSpaceLow", "SEV3",
                     "https://runbooks.dodatech.com/disk-cleanup"),
    AlertWithRunbook("CertExpiring", "SEV3",
                     "https://runbooks.dodatech.com/cert-renewal"),
]

for a in alerts:
    print("-" * 40)
    a.notify()

Expected output:

ALERT: HighCPULoad (SEV2) Runbook: https://runbooks.dodatech.com/database-failover Action: Open runbook and follow steps 1-5

ALERT: DiskSpaceLow (SEV3) Runbook: https://runbooks.dodatech.com/disk-cleanup Action: Open runbook and follow steps 1-5

ALERT: CertExpiring (SEV3) Runbook: https://runbooks.dodatech.com/cert-renewal Action: Open runbook and follow steps 1-5


## Runbook Maintenance

Runbooks decay. Services change, commands change, and team members move. A stale runbook is worse than no runbook because it gives false confidence.

### Version Control for Runbooks

Store runbooks in a version-controlled repository alongside your application code. This provides history, review, and attribution for every runbook change. When a procedure is updated, the team can see what changed and why.

```python
class RunbookRepository:
    def __init__(self):
        self.runbooks = {}
        self.versions = {}

    def save(self, runbook, author):
        import time
        version = int(time.time())
        self.runbooks[runbook.title] = runbook
        self.versions.setdefault(runbook.title, [])
        self.versions[runbook.title].append({
            "version": version,
            "author": author,
            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
        })
        print(f"Saved '{runbook.title}' (v{version}) by {author}")

    def history(self, title):
        if title in self.versions:
            print(f"Version history for '{title}':")
            for v in self.versions[title]:
                print(f"  v{v['version']} by {v['author']} on {v['timestamp']}")
        else:
            print(f"No versions found for '{title}'")

repo = RunbookRepository()
repo.save(rb, "alice")
repo.history("Database Failover")

Expected output:

Saved 'Database Failover' (v1719123456) by alice
Version history for 'Database Failover':
  v1719123456 by alice on 2026-06-23 14:00:00

Runbook Review Process

Establish a regular review cadence. Each runbook should have an owner who reviews it quarterly for accuracy. The review process includes:

  • Verify all command syntax is current
  • Test each step in a staging environment
  • Update screenshots and links
  • Remove obsolete steps
  • Add steps for new failure modes discovered since the last review

From Runbooks to Automation

The ultimate goal of a runbook is to automate itself. Each time a runbook step is performed manually, ask: "Could this step be automated?" Over time, more steps are automated, and the runbook shrinks to cover only the steps that require human judgment.

Runbook State Description Example
Manual All steps done by human SSH into server, check logs, restart
Semi-automated Some steps automated Script detects issue, human approves fix
Fully automated No human needed Auto-scaling handles high CPU without alert
Self-healing System detects and fixes proactively Database auto-failover before users notice

Runbook Template

Standardize on a template so every runbook has the same structure, making it easy for on-call engineers to find what they need quickly.

# [Runbook Title]

**Alert**: [Which alert triggers this]
**Severity**: [SEV1/SEV2/SEV3]
**Owner**: [Team or individual]

## Prerequisites
- Access to [system]
- VPN connected
- [Other prerequisites]

## Steps
1. [Action] — Expected: [Outcome]
2. [Action] — Expected: [Outcome]

## Verification
[How to confirm the procedure worked]

## Rollback
[How to undo the procedure if needed]

## Notes
[Any additional context or tips]

Maintenance Activities

Maintenance Activity Frequency Owner
Review for accuracy Quarterly Service owner
Test in staging Quarterly On-call team
Update after incident Within 48 hours Incident responder
Review for readability Annually Tech writer

Automated Runbook Testing

def test_runbook(runbook):
    print(f"Testing: {runbook.title}")
    failures = []
    for step in runbook.steps:
        success = simulate_step(step)
        if not success:
            failures.append(step["step"])
    if failures:
        print(f"FAILED steps: {failures}")
    else:
        print("All steps passed")

def simulate_step(step):
    return True

test_runbook(rb)

Expected output:

Testing: Database Failover
All steps passed

Common Errors

Error Explanation
Runbook too long More than 15 steps is overwhelming. Break into sub-runbooks.
Missing rollback steps Every procedure must say how to undo if something goes wrong.
No prerequisites listed The on-call engineer may not have VPN access or the right permissions. List everything needed.
Stale runbooks Runbooks that reference old infrastructure or outdated commands are dangerous. Review quarterly.
Assuming context Do not assume the reader knows why a step exists. Explain briefly but clearly.
Runbook not linked to alert If the engineer has to search for the runbook, the runbook is useless. Link directly.

Practice Questions

  1. What are the essential fields in a runbook?
  2. Why should every alert have a runbook?
  3. How often should runbooks be tested?
  4. What is the danger of a stale runbook?
  5. Why should each step include an expected outcome?

Challenge

Write a complete runbook for the DodaZIP cloud storage service. The runbook should cover the procedure for handling a "StorageNodeUnreachable" alert. Include 8-12 steps, prerequisites, verification, and rollback. Test your runbook by mentally walking through each step and verifying the expected outcomes make sense.

FAQ

What is the difference between a runbook and a playbook?

Runbooks are task-specific procedures. Playbooks are collections of runbooks organized by scenario or incident type.

Should runbooks be in markdown or a specialized tool?

Markdown in a version-controlled repository is the standard. Tools like Runbook.com and PagerDuty also support integrated runbooks.

Who is responsible for writing runbooks?

The service owner writes the initial runbook. The on-call team reviews and updates it after each incident.

How long should a runbook be?

Aim for 5 to 15 steps. If a procedure needs more steps, consider splitting it into multiple runbooks or adding automation.

Can a runbook be automated?

Yes. Many SRE teams automate runbooks step by step. A fully automated runbook becomes a self-healing action that does not require human intervention.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro