PagerDuty â Incident Management & On-Call Scheduling Guide
In this tutorial, you'll learn about PagerDuty. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
PagerDuty is an incident management platform that integrates with monitoring tools to automatically alert the right person at the right time, using on-call schedules, escalation policies, and automated response actions.
What You'll Learn
Why It Matters
When a critical alert fires in the middle of the night, manually finding the right on-call engineer wastes precious minutes. PagerDuty automatically routes alerts based on schedules and escalates if the first responder does not acknowledge. DodaTech reduced mean time to acknowledge (MTTA) from 12 minutes to under 2 minutes after implementing PagerDuty with layered escalation policies and automated Incident Response.
Real-World Use
DodaZIP's Prometheus Alertmanager fires a critical alert when the API error rate exceeds 5%. Alertmanager routes the alert to PagerDuty, which pages the primary on-call engineer via mobile push, SMS, and phone call. If unacknowledged after 5 minutes, it escalates to the secondary on-call and then to the engineering manager.
flowchart TD
A[Monitoring Alert] --> B[PagerDuty Integration]
B --> C[Service: DodaZIP API]
C --> D[Escalation Policy]
D --> E[Level 1: Primary On-Call]
E --> F{Acknowledged?}
F -->|YES within 5m| G[Incident in Progress]
F -->|NO| H[Level 2: Secondary On-Call]
H --> I{Acknowledged?}
I -->|YES| G
I -->|NO| J[Level 3: Engineering Manager]
J --> K{Acknowledged?}
K -->|NO| L[Stakeholder Notification]
G --> M[Automation: Slack Channel]
G --> N[Automation: Status Page]
G --> O[Automation: Runbook Link]
style B fill:#06AC38,color:#fff
Prerequisites: A PagerDuty account. Existing monitoring setup (Prometheus, Grafana).
Service and Integration Setup
# PagerDuty API â Create a service
curl -X POST https://api.pagerduty.com/services \
-H "Authorization: Token token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"service": {
"name": "DodaZIP API",
"description": "Production API service for DodaZIP",
"escalation_policy": {
"id": "ESC123",
"type": "escalation_policy_reference"
},
"alert_creation": "create_alerts_and_incidents",
"auto_resolve_timeout": 14400
}
}'
# Expected output:
# {"service":{"id":"SVC456","name":"DodaZIP API","status":"active"}}
# Create an integration for a service
curl -X POST https://api.pagerduty.com/services/SVC456/integrations \
-H "Authorization: Token token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"integration": {
"type": "prometheus_alertmanager_inbound_integration",
"name": "Prometheus Alertmanager",
"service": {
"id": "SVC456",
"type": "service_reference"
}
}
}'
# Expected output:
# {"integration":{"id":"INT789","name":"Prometheus Alertmanager","integration_key":"abc123def456"}}
On-Call Schedule Configuration
# Create a schedule
curl -X POST https://api.pagerduty.com/schedules \
-H "Authorization: Token token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"schedule": {
"name": "DodaTech Primary On-Call",
"time_zone": "America/New_York",
"description": "Primary on-call rotation for platform team",
"schedule_layers": [
{
"name": "Weekly Rotation",
"start": "2026-06-24T00:00:00-04:00",
"rotation_virtual_start": "2026-06-24T00:00:00-04:00",
"rotation_turn_length_seconds": 604800,
"users": [
{
"user": {
"id": "USER1",
"type": "user_reference]
}
},
{
"user": {
"id": "USER2",
"type": "user_reference"
}
},
{
"user": {
"id": "USER3",
"type": "user_reference"
}
}
],
"restriction_type": "weekly_restriction",
"restrictions": [
{
"start_day_of_week": 6,
"start_time_of_day": "08:00:00",
"duration_seconds": 172800
}
]
}
]
}
}'
Escalation Policy
# Create escalation policy
curl -X POST https://api.pagerduty.com/escalation_policies \
-H "Authorization: Token token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"escalation_policy": {
"name": "DodaTech Critical Service Policy",
"description": "Escalation policy for production-critical services",
"escalation_rules": [
{
"escalation_delay_in_minutes": 5,
"targets": [
{
"id": "SCHEDULE1",
"type": "schedule_reference]
}
]
},
{
"escalation_delay_in_minutes": 10,
"targets": [
{
"id": "SCHEDULE2",
"type": "schedule_reference]
}
]
},
{
"escalation_delay_in_minutes": 15,
"targets": [
{
"id": "USER_MANAGER",
"type": "user_reference]
}
]
}
],
"num_loops": 2
}
}'
Prometheus Alertmanager Integration
# alertmanager.yml â PagerDuty receiver
route:
receiver: pagerduty-critical
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
routes:
- match:
severity: critical
receiver: pagerduty-critical
- match:
severity: warning
receiver: slack-warnings
receivers:
- name: pagerduty-critical
pagerduty_configs:
- routing_key: abc123def456
severity: critical
description: '{{ .GroupLabels.alertname }} - {{ .GroupLabels.service }}'
client: Prometheus
client_url: 'https://prometheus.dodatech.com'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
summary: '{{ .CommonAnnotations.summary }}'
Grafana Integration
# Grafana contact point for PagerDuty
apiVersion: 1
contactPoints:
- name: PagerDuty Critical
receivers:
- uid: pagerduty-critical
type: pagerduty
settings:
integrationKey: abc123def456
severity: critical
class: infra
group: devops
Incident Response Automation
# PagerDuty Incident Workflows (Configure in UI or via API)
name: DodaTech Incident Response
steps:
- name: Create Slack Channel
action: slack_create_channel
params:
channel_name: "inc-{{incident.id}}"
team_id: T00XXXX
- name: Post Incident Details
action: slack_send_message
params:
channel: "inc-{{incident.id}}"
message: |
:rotating_light: *Incident {{incident.id}}*
*Title:* {{incident.title}}
*Service:* {{incident.service.name}}
*Urgency:* {{incident.urgency}}
*Opened by:* {{incident.assigned_to}}
- name: Update Status Page
action: statuspage_update
params:
component: api
status: degraded_performance
- name: Run Diagnostic Playbook
action: webhook
params:
url: https://automation.dodatech.com/runbook/diagnostics
method: POST
body: |
{
"incident_id": "{{incident.id}}",
"service": "{{incident.service.name}}"
}
Runbooks
# Runbook: API High Error Rate
## 1. Acknowledge the incident in PagerDuty
## 2. Join the incident Slack channel (#inc-{id})
## 3. Check Grafana dashboard: DodaTech API Overview
## 4. Identify error pattern:
- Are errors from a specific endpoint?
- Are errors from a specific deployment?
- Are errors correlated with database latency?
## 5. Check recent deployments:
kubectl rollout status deployment/user-service -n production
## 6. If recent deployment caused issues:
kubectl rollout undo deployment/user-service -n production
## 7. Check database query performance:
- Open Jaeger for slow traces
- Look for slow database spans
## 8. If database is slow:
- Check pg_stat_activity for long-running queries
- Check read replica lag
## 9. Resolve incident in <a href="/devops/incident-response/">PagerDuty</a>
## 10. Post-mortem within 48 hours
Common Configuration Mistakes
Not using secondary notification methods: If the primary on-call has their phone on silent, mobile push notifications are missed. Configure SMS and phone call as secondary methods for critical alerts.
Single-person schedules without rotation: A single person on-call 24/7 causes burnout. Use weekly or bi-weekly rotations with at least 3 engineers per schedule.
Escalation delays that are too long: Waiting 30 minutes before escalation wastes Incident Response time. Use 5-minute levels and 10-15 minute escalation delays.
Not silencing alerts during maintenance: Maintenance windows trigger false incidents. Configure PagerDuty maintenance windows when performing planned changes.
Ignoring Incident Response workflows: Manual Incident Response (creating Slack channels, updating status pages) wastes time. Automate these with PagerDuty Incident Workflows.
Practice Questions
What is the difference between an alert and an incident in PagerDuty? Answer: An alert is a raw notification from a monitoring tool. PagerDuty groups related alerts into incidents. An incident represents a problem requiring human response.
How do escalation policies work? Answer: Escalation policies define levels of responders. If the first level does not acknowledge within a set time, the incident escalates to the next level, and so on.
What is an on-call schedule? Answer: A schedule defines which user is on-call at any given time, with weekly or daily rotations, overtime rules, and coverage requirements.
How does PagerDuty integrate with Prometheus? Answer: Prometheus Alertmanager sends alerts to PagerDuty's webhook URL using the PagerDuty receiver (
routing_key). PagerDuty creates incidents based on alert severity.
Challenge
Set up complete PagerDuty incident management: create a service for a critical application, configure a weekly on-call rotation with 3 engineers, create an escalation policy (5min primary, 10min secondary, 15min manager), integrate with Prometheus Alertmanager for critical alerts, configure Grafana contact points for PagerDuty, create an Incident Workflow that creates a Slack channel and posts incident details, write a runbook for common incident types, and test the entire flow by triggering a test alert.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro