PagerDuty — Incident Management & On-Call Scheduling Guide

DodaTech Updated 2026-06-24 6 min read

In this tutorial, you'll learn about PagerDuty. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

PagerDuty is an incident management platform that integrates with monitoring tools to automatically alert the right person at the right time, using on-call schedules, escalation policies, and automated response actions.

What You'll Learn

Why It Matters

When a critical alert fires in the middle of the night, manually finding the right on-call engineer wastes precious minutes. PagerDuty automatically routes alerts based on schedules and escalates if the first responder does not acknowledge. DodaTech reduced mean time to acknowledge (MTTA) from 12 minutes to under 2 minutes after implementing PagerDuty with layered escalation policies and automated Incident Response.

Real-World Use

DodaZIP's Prometheus Alertmanager fires a critical alert when the API error rate exceeds 5%. Alertmanager routes the alert to PagerDuty, which pages the primary on-call engineer via mobile push, SMS, and phone call. If unacknowledged after 5 minutes, it escalates to the secondary on-call and then to the engineering manager.

flowchart TD
    A[Monitoring Alert] --> B[PagerDuty Integration]
    B --> C[Service: DodaZIP API]
    C --> D[Escalation Policy]
    D --> E[Level 1: Primary On-Call]
    E --> F{Acknowledged?}
    F -->|YES within 5m| G[Incident in Progress]
    F -->|NO| H[Level 2: Secondary On-Call]
    H --> I{Acknowledged?}
    I -->|YES| G
    I -->|NO| J[Level 3: Engineering Manager]
    J --> K{Acknowledged?}
    K -->|NO| L[Stakeholder Notification]
    G --> M[Automation: Slack Channel]
    G --> N[Automation: Status Page]
    G --> O[Automation: Runbook Link]
    style B fill:#06AC38,color:#fff

ℹ️ Info

Prerequisites: A PagerDuty account. Existing monitoring setup (Prometheus, Grafana).

Service and Integration Setup

# PagerDuty API — Create a service
curl -X POST https://api.pagerduty.com/services \
  -H "Authorization: Token token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "service": {
      "name": "DodaZIP API",
      "description": "Production API service for DodaZIP",
      "escalation_policy": {
        "id": "ESC123",
        "type": "escalation_policy_reference"
      },
      "alert_creation": "create_alerts_and_incidents",
      "auto_resolve_timeout": 14400
    }
  }'

# Expected output:
# {"service":{"id":"SVC456","name":"DodaZIP API","status":"active"}}

# Create an integration for a service
curl -X POST https://api.pagerduty.com/services/SVC456/integrations \
  -H "Authorization: Token token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "integration": {
      "type": "prometheus_alertmanager_inbound_integration",
      "name": "Prometheus Alertmanager",
      "service": {
        "id": "SVC456",
        "type": "service_reference"
      }
    }
  }'

# Expected output:
# {"integration":{"id":"INT789","name":"Prometheus Alertmanager","integration_key":"abc123def456"}}

On-Call Schedule Configuration

# Create a schedule
curl -X POST https://api.pagerduty.com/schedules \
  -H "Authorization: Token token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "schedule": {
      "name": "DodaTech Primary On-Call",
      "time_zone": "America/New_York",
      "description": "Primary on-call rotation for platform team",
      "schedule_layers": [
        {
          "name": "Weekly Rotation",
          "start": "2026-06-24T00:00:00-04:00",
          "rotation_virtual_start": "2026-06-24T00:00:00-04:00",
          "rotation_turn_length_seconds": 604800,
          "users": [
            {
              "user": {
                "id": "USER1",
                "type": "user_reference]
              }
            },
            {
              "user": {
                "id": "USER2",
                "type": "user_reference"
              }
            },
            {
              "user": {
                "id": "USER3",
                "type": "user_reference"
              }
            }
          ],
          "restriction_type": "weekly_restriction",
          "restrictions": [
            {
              "start_day_of_week": 6,
              "start_time_of_day": "08:00:00",
              "duration_seconds": 172800
            }
          ]
        }
      ]
    }
  }'

Escalation Policy

# Create escalation policy
curl -X POST https://api.pagerduty.com/escalation_policies \
  -H "Authorization: Token token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "escalation_policy": {
      "name": "DodaTech Critical Service Policy",
      "description": "Escalation policy for production-critical services",
      "escalation_rules": [
        {
          "escalation_delay_in_minutes": 5,
          "targets": [
            {
              "id": "SCHEDULE1",
              "type": "schedule_reference]
            }
          ]
        },
        {
          "escalation_delay_in_minutes": 10,
          "targets": [
            {
              "id": "SCHEDULE2",
              "type": "schedule_reference]
            }
          ]
        },
        {
          "escalation_delay_in_minutes": 15,
          "targets": [
            {
              "id": "USER_MANAGER",
              "type": "user_reference]
            }
          ]
        }
      ],
      "num_loops": 2
    }
  }'

Prometheus Alertmanager Integration

# alertmanager.yml — PagerDuty receiver
route:
  receiver: pagerduty-critical
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
    - match:
        severity: warning
      receiver: slack-warnings

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: abc123def456
        severity: critical
        description: '{{ .GroupLabels.alertname }} - {{ .GroupLabels.service }}'
        client: Prometheus
        client_url: 'https://prometheus.dodatech.com'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          resolved: '{{ .Alerts.Resolved | len }}'
          summary: '{{ .CommonAnnotations.summary }}'

Grafana Integration

# Grafana contact point for PagerDuty
apiVersion: 1

contactPoints:
  - name: PagerDuty Critical
    receivers:
      - uid: pagerduty-critical
        type: pagerduty
        settings:
          integrationKey: abc123def456
          severity: critical
          class: infra
          group: devops

Incident Response Automation

# PagerDuty Incident Workflows (Configure in UI or via API)
name: DodaTech Incident Response
steps:
  - name: Create Slack Channel
    action: slack_create_channel
    params:
      channel_name: "inc-{{incident.id}}"
      team_id: T00XXXX
  - name: Post Incident Details
    action: slack_send_message
    params:
      channel: "inc-{{incident.id}}"
      message: |
        :rotating_light: *Incident {{incident.id}}*
        *Title:* {{incident.title}}
        *Service:* {{incident.service.name}}
        *Urgency:* {{incident.urgency}}
        *Opened by:* {{incident.assigned_to}}
  - name: Update Status Page
    action: statuspage_update
    params:
      component: api
      status: degraded_performance
  - name: Run Diagnostic Playbook
    action: webhook
    params:
      url: https://automation.dodatech.com/runbook/diagnostics
      method: POST
      body: |
        {
          "incident_id": "{{incident.id}}",
          "service": "{{incident.service.name}}"
        }

Runbooks

# Runbook: API High Error Rate

## 1. Acknowledge the incident in PagerDuty
## 2. Join the incident Slack channel (#inc-{id})
## 3. Check Grafana dashboard: DodaTech API Overview
## 4. Identify error pattern:
   - Are errors from a specific endpoint?
   - Are errors from a specific deployment?
   - Are errors correlated with database latency?
## 5. Check recent deployments:
   kubectl rollout status deployment/user-service -n production
## 6. If recent deployment caused issues:
   kubectl rollout undo deployment/user-service -n production
## 7. Check database query performance:
   - Open Jaeger for slow traces
   - Look for slow database spans
## 8. If database is slow:
   - Check pg_stat_activity for long-running queries
   - Check read replica lag
## 9. Resolve incident in <a href="/devops/incident-response/">PagerDuty</a>
## 10. Post-mortem within 48 hours

Common Configuration Mistakes

Not using secondary notification methods: If the primary on-call has their phone on silent, mobile push notifications are missed. Configure SMS and phone call as secondary methods for critical alerts.
Single-person schedules without rotation: A single person on-call 24/7 causes burnout. Use weekly or bi-weekly rotations with at least 3 engineers per schedule.
Escalation delays that are too long: Waiting 30 minutes before escalation wastes Incident Response time. Use 5-minute levels and 10-15 minute escalation delays.
Not silencing alerts during maintenance: Maintenance windows trigger false incidents. Configure PagerDuty maintenance windows when performing planned changes.
Ignoring Incident Response workflows: Manual Incident Response (creating Slack channels, updating status pages) wastes time. Automate these with PagerDuty Incident Workflows.

Practice Questions

What is the difference between an alert and an incident in PagerDuty? Answer: An alert is a raw notification from a monitoring tool. PagerDuty groups related alerts into incidents. An incident represents a problem requiring human response.
How do escalation policies work? Answer: Escalation policies define levels of responders. If the first level does not acknowledge within a set time, the incident escalates to the next level, and so on.
What is an on-call schedule? Answer: A schedule defines which user is on-call at any given time, with weekly or daily rotations, overtime rules, and coverage requirements.
How does PagerDuty integrate with Prometheus? Answer: Prometheus Alertmanager sends alerts to PagerDuty's webhook URL using the PagerDuty receiver (routing_key). PagerDuty creates incidents based on alert severity.

Challenge

Set up complete PagerDuty incident management: create a service for a critical application, configure a weekly on-call rotation with 3 engineers, create an escalation policy (5min primary, 10min secondary, 15min manager), integrate with Prometheus Alertmanager for critical alerts, configure Grafana contact points for PagerDuty, create an Incident Workflow that creates a Slack channel and posts incident details, write a runbook for common incident types, and test the entire flow by triggering a test alert.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Sentry — Error Tracking & Performance Monitoring Guide Next → Slack Notifications for DevOps — CI/CD & Monitoring Integration Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Devops Tools