Skip to content

Site Reliability Engineering Tools — PagerDuty, Opsgenie, Incident.io

DodaTech Updated 2026-06-23 9 min read

In this tutorial, you'll learn about Site Reliability Engineering Tools. We cover key concepts, practical examples, and best practices.

The SRE tool ecosystem spans incident management, monitoring, automation, collaboration, and observability — each category solves a specific operational problem and the best SRE teams integrate these tools into a cohesive operational platform.

What You'll Learn

In this tutorial, you will learn the key categories of SRE tools, how to evaluate and choose an incident management platform (PagerDuty, Opsgenie, Incident.io), how to integrate monitoring with incident response, and how to build an SRE toolchain that reduces toil and improves MTTR.

Why It Matters

The right tools reduce MTTR by 40-60 percent and cut toil by automating routine tasks. The wrong tools create tool sprawl, alert fatigue, and integration debt. SRE teams must evaluate tools based on reliability, integration quality, and how well they fit the team workflow.

Real-World Use

DodaTech uses PagerDuty for incident alerting and on-call management, Prometheus and Grafana for monitoring, OpenTelemetry for distributed tracing, and custom automation scripts for common operational tasks. The toolchain is integrated so that an alert in Prometheus automatically creates a PagerDuty incident with a link to the relevant runbook.

graph TD
    A[Prometheus] -->|Alert| B[Alertmanager]
    B -->|SEV1/SEV2| C[PagerDuty]
    B -->|SEV3| D[Slack]
    C -->|Create Incident| E[Incident Response]
    E -->|Link Runbook| F[Runbook Repo]
    E -->|Postmortem| G[Postmortem Doc]
    G -->|Action Items| H[Jira]
    H -->|Track Fix| I[Deployment Pipeline]
    I -->|Deploy| J[Production]
    J --> A

Prerequisites

Understanding Incident Response helps you evaluate incident management tools. Familiarity with Monitoring and Alerting for SRE gives context for the monitoring-to-alerting integration.

Incident Management Platforms

Feature PagerDuty Opsgenie Incident.io
On-call scheduling Yes Yes Yes
Alert routing Yes Yes Yes
Runbook integration Yes Yes Native
Postmortem built-in Add-on Add-on Yes
Status page Separate product Separate product Built-in
SLA tracking Yes Yes Yes

PagerDuty Integration Example

class IncidentManager:
    def __init__(self, platform):
        self.platform = platform
        self.incidents = []

    def create_incident(self, title, severity, service, runbook_url):
        incident = {
            "id": len(self.incidents) + 1,
            "title": title,
            "severity": severity,
            "service": service,
            "runbook": runbook_url,
            "status": "triggered",
            "platform": self.platform
        }
        self.incidents.append(incident)
        print(f"[{self.platform}] Incident #{incident['id']}: {title}")
        print(f"  Severity: {severity}")
        print(f"  Service:  {service}")
        print(f"  Runbook:  {runbook_url}")
        return incident

    def acknowledge(self, incident_id, responder):
        inc = next(i for i in self.incidents if i["id"] == incident_id)
        inc["status"] = "acknowledged"
        inc["responder"] = responder
        print(f"[{self.platform}] Incident #{incident_id} acknowledged by {responder}")

    def resolve(self, incident_id, resolution_notes):
        inc = next(i for i in self.incidents if i["id"] == incident_id)
        inc["status"] = "resolved"
        inc["resolution"] = resolution_notes
        print(f"[{self.platform}] Incident #{incident_id} resolved: {resolution_notes}")

pd = IncidentManager("PagerDuty")
inc = pd.create_incident(
    "High CPU on Doda Browser API servers",
    "SEV2",
    "doda-browser-api",
    "https://runbooks.dodatech.com/high-cpu"
)
pd.acknowledge(inc["id"], "alice@dodatech.com")
pd.resolve(inc["id"], "Added auto-scaling policy")

Expected output:

[PagerDuty] Incident #1: High CPU on Doda Browser API servers
  Severity: SEV2
  Service:  doda-browser-api
  Runbook:  https://runbooks.dodatech.com/high-cpu
[PagerDuty] Incident #1 acknowledged by alice@dodatech.com
[PagerDuty] Incident #1 resolved: Added auto-scaling policy

Monitoring and Observability Stack

Layer Tool Purpose
Metrics collection Prometheus Scrape and store time-series metrics
Visualization Grafana Dashboards and alerting
Logging Loki / ELK Log aggregation and search
Tracing Jaeger / OpenTelemetry Distributed tracing
Alerting Alertmanager Route and deduplicate alerts

Prometheus-Style Alerting

class PrometheusAlert:
    def __init__(self, alert_name, expr, severity, annotations):
        self.name = alert_name
        self.expr = expr
        self.severity = severity
        self.annotations = annotations

    def evaluate(self, metric_value):
        print(f"Evaluating: {self.expr}")
        threshold = float(self.expr.split(">")[1].strip())
        if metric_value > threshold:
            print(f"FIRING: {self.name} ({self.severity})")
            for k, v in self.annotations.items():
                print(f"  {k}: {v}")
            return True
        return False

alert = PrometheusAlert(
    "HighErrorRate",
    "rate(http_requests_total{status=~'5..'}[5m]) > 0.01",
    "critical",
    {"summary": "High HTTP 5xx error rate", "runbook": "https://runbooks.dodatech.com/errors"}
)
alert.evaluate(0.02)

Expected output:

Evaluating: rate(http_requests_total{status=~'5..'}[5m]) > 0.01
FIRING: HighErrorRate (critical)
  summary: High HTTP 5xx error rate
  runbook: https://runbooks.dodatech.com/errors

Automation Tools

SRE teams use automation to reduce toil. Common automation approaches include:

Tool Purpose Example Use Case
Terraform Infrastructure provisioning Create cloud resources
Ansible Configuration management Apply OS patches
Custom scripts Ad hoc automation Certificate renewal
ChatOps bots Slack-based operations Run commands from chat

ChatOps Bot for Common Tasks

class ChatOpsBot:
    def __init__(self, channel):
        self.channel = channel

    def handle_command(self, command, user):
        print(f"[{self.channel}] {user}: {command}")
        if command == "status":
            print(f"[{self.channel}] Bot: All services healthy")
        elif command == "deploy":
            print(f"[{self.channel}] Bot: Deploying v2.3.2 to staging...")
        elif command == "restart sync":
            print(f"[{self.channel}] Bot: Restarting sync-service... Done!")
        else:
            print(f"[{self.channel}] Bot: Unknown command. Available: status, deploy, restart")

bot = ChatOpsBot("#sre-operations")
bot.handle_command("status", "alice")
bot.handle_command("restart sync", "bob")

Expected output:

[#sre-operations] alice: status
[#sre-operations] Bot: All services healthy
[#sre-operations] bob: restart sync
[#sre-operations] Bot: Restarting sync-service... Done!

Tool Evaluation Criteria

When choosing SRE tools, evaluate against these criteria:

Criteria Weight Questions to Ask
Reliability High Does the tool itself have a good SLA?
Integration High Does it integrate with our existing stack?
API quality High Can we automate everything through APIs?
On-call support Medium Does it handle escalation and rotation?
Cost Medium Does the pricing scale with our usage?
Ease of use Medium How steep is the learning curve?

Building an SRE Toolchain

An effective SRE toolchain is not just a collection of tools but an integrated system. The key integration points are:

Integration What It Does Benefit
Monitoring to alerting Prometheus alerts trigger PagerDuty incidents Fast notification
Alerting to runbooks PagerDuty incident links to runbook Immediate guidance
Incident to postmortem PagerDuty incident creates postmortem doc No data loss
Postmortem to project tracking Action items go to Jira Tracked to completion
Deployment to monitoring Deployments annotated in Grafana See deploy impact on metrics

Opsgenie Features

Opsgenie is an alternative to PagerDuty with similar capabilities. Key differentiators include:

Feature Opsgenie
On-call schedules Flexible schedules with overrides
Alert deduplication Automatic grouping of related alerts
Integration ecosystem 200+ integrations
Reporting Built-in incident analytics
Pricing Per-user pricing model

Incident.io Features

Incident.io is a newer platform that focuses on the full incident lifecycle, not just alert routing. Key differentiators:

  • Built-in postmortems: Postmortem creation is integrated into the incident resolution flow.
  • Status pages: Internal and external status pages are built in.
  • Incident timeline: Automatic timeline generation from Slack messages and tool integrations.
  • Severity-based workflows: Different workflows trigger based on incident severity.
  • Slack-native: Deep Slack integration for incident command and coordination.

Choosing the Right Incident Management Platform

The best platform depends on your team size, existing toolchain, and budget. Consider these factors:

  • Team size: PagerDuty and Opsgenie work well for teams of any size. Incident.io is particularly strong for teams that want a unified incident lifecycle.
  • Existing tools: If you already use Slack heavily, Incident.io has the deepest Slack integration. If you use Jira, PagerDuty has mature integration.
  • Budget: Opsgenie is often more affordable for smaller teams. PagerDuty pricing scales with features.
  • Postmortem needs: If you want built-in postmortems, incident.io has the strongest offering.

Open Source Alternatives

Not every team needs commercial tools. Open source alternatives exist for most SRE tool categories:

Category Commercial Open Source
Monitoring Datadog, New Relic Prometheus + Grafana
Incident management PagerDuty, Opsgenie Cabot, Uptime Kuma
On-call scheduling PagerDuty oncall (Grafana), Zabbix
Log management Splunk, Datadog Loki, ELK Stack
Runbooks PagerDuty Runbooks Rundeck, StackStorm

Common Errors

Error Explanation
Too many tools Each tool adds integration complexity. Standardize on a minimal set.
No runbook integration If the alert does not link to a runbook, the responder wastes time searching.
Monitoring without alerting Collecting metrics without alerting on them misses incidents until users report them.
Ignoring API quality A tool with a poor API will be harder to automate and integrate.
Not testing the toolchain The full alert-to-incident-to-runbook pipeline should be tested regularly.
No SLA for the tool itself If PagerDuty is down, can your team still respond to incidents? Have a backup.

Practice Questions

  1. What are the three major incident management platforms for SRE?
  2. How does the monitoring stack integrate with incident management?
  3. What criteria should you use to evaluate SRE tools?
  4. Why is API quality important for SRE tools?
  5. What is ChatOps and how does it help SRE teams?

Challenge

Design an SRE toolchain for a new DodaTech service. Choose an incident management platform, monitoring stack, automation approach, and collaboration tools. Draw the integration architecture showing how alerts flow from monitoring to incident management to runbook execution. Justify each tool choice based on the evaluation criteria.

FAQ

What is PagerDuty?

PagerDuty is an incident management platform that handles on-call scheduling, alert routing, escalation policies, and incident tracking.

How is Incident.io different from PagerDuty?

Incident.io includes native postmortem and status page features that are add-ons in PagerDuty. It focuses on the full incident lifecycle from alert to postmortem.

What is the Prometheus stack?

Prometheus collects time-series metrics, Alertmanager handles alert routing, and Grafana provides visualization dashboards.

Do I need all these tools?

Start with monitoring (Prometheus + Grafana), an incident management platform (PagerDuty or similar), and runbooks. Add tools as the team grows.

What is ChatOps?

ChatOps is the practice of running operational commands through chat platforms like Slack, enabling teams to manage incidents without switching tools.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro