Alerting — Complete PagerDuty and Slack Integration Guide
In this tutorial, you will learn about Alerting. We cover key concepts, practical examples, and best practices to help you master this topic.
Alerting notifies your team when API metrics exceed thresholds. It is the bridge between monitoring data and human action, ensuring issues are addressed quickly.
What You'll Learn
You'll learn how to configure alerts for key API metrics, integrate with PagerDuty and Slack, and set up escalation policies.
Why It Matters
Metrics without alerts mean waiting for users to report issues. Proper alerting reduces mean time to detection (MTTD) from hours to minutes.
Real-World Use
Prometheus alerts when the API error rate exceeds 5% for 5 minutes. The alert fires to Slack (#alerts channel) and creates a PagerDuty incident. The on-call engineer acknowledges within 2 minutes.
Implementation
# prometheus alerting rules
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(api_requests_total{status=~"5.."}[5m])) /
sum(rate(api_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "API error rate above 5%"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, rate(
api_request_duration_seconds_bucket[5m]
)) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency above 2 seconds"
description: "p99 is {{ $value }}s"
- alert: InstanceDown
expr: up{job="api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "API instance {{ $labels.instance }} down"
# Sending Slack alerts from Python
import requests
def send_slack_alert(message, severity="warning"):
webhook_url = "https://hooks.slack.com/services/xxx/yyy/zzz"
color = {"critical": "danger", "warning": "warning", "info": "good"}
payload = {
"attachments": [{
"color": color.get(severity, "good"),
"text": message,
"fields": [
{"title": "Severity", "value": severity, "short": True},
{"title": "Time", "value": datetime.utcnow().isoformat(), "short": True}
]
}]
}
requests.post(webhook_url, json=payload)
# Send PagerDuty alert
def send_pagerduty_alert(summary, severity="critical"):
payload = {
"routing_key": "your-pagerduty-key",
"event_action": "trigger",
"payload": {
"summary": summary,
"severity": severity,
"source": "api-monitoring",
"timestamp": datetime.utcnow().isoformat()
}
}
requests.post("https://events.pagerduty.com/v2/enqueue", json=payload)
Alerting Best Practices
| Practice | Why |
|---|---|
| Alert on symptoms, not causes | "Error rate high" not "CPU at 90%" |
| Page for actionable alerts | Only page if someone must wake up |
| Include runbook links | Responders know what to do |
| Auto-acknowledge repeat alerts | Reduce noise |
| Test alerts regularly | Ensure delivery works |
Common Mistakes
| Mistake | Fix | |---------|-----| | Too many alerts | Alert fatigue, ignored alerts | Only alert on actionable conditions | | No escalation policy | Critical alerts ignored after hours | Set up PagerDuty escalation (dev -> senior -> manager) | | Alerting without runbooks | Responders do not know what to do | Add runbook URL to every alert | | Not silencing during maintenance | False alerts during deployments | Use maintenance Windows | | No alert testing | Alerts fail when needed most | Test alerts with Chaos Engineering |
Practice Questions
- What is the difference between warning and critical alerts?
- How do you prevent alert fatigue?
- What should be in an alert runbook?
- How does PagerDuty escalation work?
- What is a maintenance window?
Challenge
Configure Prometheus alerting rules for: high error rate (>5%), high latency (p99 > 2s), and instance down. Integrate with Slack for warnings and PagerDuty for critical.
What's Next
Learn about SLOs, SLIs, and error budgets.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro