Alerting — Complete PagerDuty and Slack Integration Guide

DodaTech Updated 2026-06-28 2 min read

In this tutorial, you will learn about Alerting. We cover key concepts, practical examples, and best practices to help you master this topic.

Alerting notifies your team when API metrics exceed thresholds. It is the bridge between monitoring data and human action, ensuring issues are addressed quickly.

What You'll Learn

You'll learn how to configure alerts for key API metrics, integrate with PagerDuty and Slack, and set up escalation policies.

Why It Matters

Metrics without alerts mean waiting for users to report issues. Proper alerting reduces mean time to detection (MTTD) from hours to minutes.

Real-World Use

Prometheus alerts when the API error rate exceeds 5% for 5 minutes. The alert fires to Slack (#alerts channel) and creates a PagerDuty incident. The on-call engineer acknowledges within 2 minutes.

Implementation

# prometheus alerting rules
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(api_requests_total{status=~"5.."}[5m])) /
          sum(rate(api_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API error rate above 5%"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, rate(
            api_request_duration_seconds_bucket[5m]
          )) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency above 2 seconds"
          description: "p99 is {{ $value }}s"

      - alert: InstanceDown
        expr: up{job="api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "API instance {{ $labels.instance }} down"

# Sending Slack alerts from Python
import requests

def send_slack_alert(message, severity="warning"):
    webhook_url = "https://hooks.slack.com/services/xxx/yyy/zzz"
    color = {"critical": "danger", "warning": "warning", "info": "good"}
    payload = {
        "attachments": [{
            "color": color.get(severity, "good"),
            "text": message,
            "fields": [
                {"title": "Severity", "value": severity, "short": True},
                {"title": "Time", "value": datetime.utcnow().isoformat(), "short": True}
            ]
        }]
    }
    requests.post(webhook_url, json=payload)

# Send PagerDuty alert
def send_pagerduty_alert(summary, severity="critical"):
    payload = {
        "routing_key": "your-pagerduty-key",
        "event_action": "trigger",
        "payload": {
            "summary": summary,
            "severity": severity,
            "source": "api-monitoring",
            "timestamp": datetime.utcnow().isoformat()
        }
    }
    requests.post("https://events.pagerduty.com/v2/enqueue", json=payload)

Alerting Best Practices

Practice	Why
Alert on symptoms, not causes	"Error rate high" not "CPU at 90%"
Page for actionable alerts	Only page if someone must wake up
Include runbook links	Responders know what to do
Auto-acknowledge repeat alerts	Reduce noise
Test alerts regularly	Ensure delivery works

Common Mistakes

| Mistake | Fix | |---------|-----| | Too many alerts | Alert fatigue, ignored alerts | Only alert on actionable conditions | | No escalation policy | Critical alerts ignored after hours | Set up PagerDuty escalation (dev -> senior -> manager) | | Alerting without runbooks | Responders do not know what to do | Add runbook URL to every alert | | Not silencing during maintenance | False alerts during deployments | Use maintenance Windows | | No alert testing | Alerts fail when needed most | Test alerts with Chaos Engineering |

Practice Questions

What is the difference between warning and critical alerts?
How do you prevent alert fatigue?
What should be in an alert runbook?
How does PagerDuty escalation work?
What is a maintenance window?

Challenge

Configure Prometheus alerting rules for: high error rate (>5%), high latency (p99 > 2s), and instance down. Integrate with Slack for warnings and PagerDuty for critical.

What's Next

Learn about SLOs, SLIs, and error budgets.

← Previous New Relic — Complete API Performance Monitoring Guide Next → SLO, SLI, Error Budget — Complete API Reliability Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Api Monitoring Analytics