Site Reliability Engineering Tools — PagerDuty, Opsgenie, Incident.io

Q: Do I need all these tools?

Start with monitoring ( Prometheus + Grafana ), an incident management platform ( PagerDuty or similar), and runbooks. Add tools as the team grows.

DodaTech Updated 2026-06-23 9 min read

In this tutorial, you'll learn about Site Reliability Engineering Tools. We cover key concepts, practical examples, and best practices.

The SRE tool ecosystem spans incident management, monitoring, automation, collaboration, and observability — each category solves a specific operational problem and the best SRE teams integrate these tools into a cohesive operational platform.

What You'll Learn

In this tutorial, you will learn the key categories of SRE tools, how to evaluate and choose an incident management platform (PagerDuty, Opsgenie, Incident.io), how to integrate monitoring with incident response, and how to build an SRE toolchain that reduces toil and improves MTTR.

Why It Matters

The right tools reduce MTTR by 40-60 percent and cut toil by automating routine tasks. The wrong tools create tool sprawl, alert fatigue, and integration debt. SRE teams must evaluate tools based on reliability, integration quality, and how well they fit the team workflow.

Real-World Use

DodaTech uses PagerDuty for incident alerting and on-call management, Prometheus and Grafana for monitoring, OpenTelemetry for distributed tracing, and custom automation scripts for common operational tasks. The toolchain is integrated so that an alert in Prometheus automatically creates a PagerDuty incident with a link to the relevant runbook.

graph TD
    A[Prometheus] -->|Alert| B[Alertmanager]
    B -->|SEV1/SEV2| C[PagerDuty]
    B -->|SEV3| D[Slack]
    C -->|Create Incident| E[Incident Response]
    E -->|Link Runbook| F[Runbook Repo]
    E -->|Postmortem| G[Postmortem Doc]
    G -->|Action Items| H[Jira]
    H -->|Track Fix| I[Deployment Pipeline]
    I -->|Deploy| J[Production]
    J --> A

Prerequisites

Understanding Incident Response helps you evaluate incident management tools. Familiarity with Monitoring and Alerting for SRE gives context for the monitoring-to-alerting integration.

Incident Management Platforms

Feature	PagerDuty	Opsgenie	Incident.io
On-call scheduling	Yes	Yes	Yes
Alert routing	Yes	Yes	Yes
Runbook integration	Yes	Yes	Native
Postmortem built-in	Add-on	Add-on	Yes
Status page	Separate product	Separate product	Built-in
SLA tracking	Yes	Yes	Yes

PagerDuty Integration Example

class IncidentManager:
    def __init__(self, platform):
        self.platform = platform
        self.incidents = []

    def create_incident(self, title, severity, service, runbook_url):
        incident = {
            "id": len(self.incidents) + 1,
            "title": title,
            "severity": severity,
            "service": service,
            "runbook": runbook_url,
            "status": "triggered",
            "platform": self.platform
        }
        self.incidents.append(incident)
        print(f"[{self.platform}] Incident #{incident['id']}: {title}")
        print(f"  Severity: {severity}")
        print(f"  Service:  {service}")
        print(f"  Runbook:  {runbook_url}")
        return incident

    def acknowledge(self, incident_id, responder):
        inc = next(i for i in self.incidents if i["id"] == incident_id)
        inc["status"] = "acknowledged"
        inc["responder"] = responder
        print(f"[{self.platform}] Incident #{incident_id} acknowledged by {responder}")

    def resolve(self, incident_id, resolution_notes):
        inc = next(i for i in self.incidents if i["id"] == incident_id)
        inc["status"] = "resolved"
        inc["resolution"] = resolution_notes
        print(f"[{self.platform}] Incident #{incident_id} resolved: {resolution_notes}")

pd = IncidentManager("PagerDuty")
inc = pd.create_incident(
    "High CPU on Doda Browser API servers",
    "SEV2",
    "doda-browser-api",
    "https://runbooks.dodatech.com/high-cpu"
)
pd.acknowledge(inc["id"], "alice@dodatech.com")
pd.resolve(inc["id"], "Added auto-scaling policy")

Expected output:

[PagerDuty] Incident #1: High CPU on Doda Browser API servers
  Severity: SEV2
  Service:  doda-browser-api
  Runbook:  https://runbooks.dodatech.com/high-cpu
[PagerDuty] Incident #1 acknowledged by alice@dodatech.com
[PagerDuty] Incident #1 resolved: Added auto-scaling policy

Monitoring and Observability Stack

Layer	Tool	Purpose
Metrics collection	Prometheus	Scrape and store time-series metrics
Visualization	Grafana	Dashboards and alerting
Logging	Loki / ELK	Log aggregation and search
Tracing	Jaeger / OpenTelemetry	Distributed tracing
Alerting	Alertmanager	Route and deduplicate alerts

Prometheus-Style Alerting

class PrometheusAlert:
    def __init__(self, alert_name, expr, severity, annotations):
        self.name = alert_name
        self.expr = expr
        self.severity = severity
        self.annotations = annotations

    def evaluate(self, metric_value):
        print(f"Evaluating: {self.expr}")
        threshold = float(self.expr.split(">")[1].strip())
        if metric_value > threshold:
            print(f"FIRING: {self.name} ({self.severity})")
            for k, v in self.annotations.items():
                print(f"  {k}: {v}")
            return True
        return False

alert = PrometheusAlert(
    "HighErrorRate",
    "rate(http_requests_total{status=~'5..'}[5m]) > 0.01",
    "critical",
    {"summary": "High HTTP 5xx error rate", "runbook": "https://runbooks.dodatech.com/errors"}
)
alert.evaluate(0.02)

Expected output:

Evaluating: rate(http_requests_total{status=~'5..'}[5m]) > 0.01
FIRING: HighErrorRate (critical)
  summary: High HTTP 5xx error rate
  runbook: https://runbooks.dodatech.com/errors

Automation Tools

SRE teams use automation to reduce toil. Common automation approaches include:

Tool	Purpose	Example Use Case
Terraform	Infrastructure provisioning	Create cloud resources
Ansible	Configuration management	Apply OS patches
Custom scripts	Ad hoc automation	Certificate renewal
ChatOps bots	Slack-based operations	Run commands from chat

ChatOps Bot for Common Tasks

class ChatOpsBot:
    def __init__(self, channel):
        self.channel = channel

    def handle_command(self, command, user):
        print(f"[{self.channel}] {user}: {command}")
        if command == "status":
            print(f"[{self.channel}] Bot: All services healthy")
        elif command == "deploy":
            print(f"[{self.channel}] Bot: Deploying v2.3.2 to staging...")
        elif command == "restart sync":
            print(f"[{self.channel}] Bot: Restarting sync-service... Done!")
        else:
            print(f"[{self.channel}] Bot: Unknown command. Available: status, deploy, restart")

bot = ChatOpsBot("#sre-operations")
bot.handle_command("status", "alice")
bot.handle_command("restart sync", "bob")

Expected output:

[#sre-operations] alice: status
[#sre-operations] Bot: All services healthy
[#sre-operations] bob: restart sync
[#sre-operations] Bot: Restarting sync-service... Done!

Tool Evaluation Criteria

When choosing SRE tools, evaluate against these criteria:

Criteria	Weight	Questions to Ask
Reliability	High	Does the tool itself have a good SLA?
Integration	High	Does it integrate with our existing stack?
API quality	High	Can we automate everything through APIs?
On-call support	Medium	Does it handle escalation and rotation?
Cost	Medium	Does the pricing scale with our usage?
Ease of use	Medium	How steep is the learning curve?

Building an SRE Toolchain

An effective SRE toolchain is not just a collection of tools but an integrated system. The key integration points are:

Integration	What It Does	Benefit
Monitoring to alerting	Prometheus alerts trigger PagerDuty incidents	Fast notification
Alerting to runbooks	PagerDuty incident links to runbook	Immediate guidance
Incident to postmortem	PagerDuty incident creates postmortem doc	No data loss
Postmortem to project tracking	Action items go to Jira	Tracked to completion
Deployment to monitoring	Deployments annotated in Grafana	See deploy impact on metrics

Opsgenie Features

Opsgenie is an alternative to PagerDuty with similar capabilities. Key differentiators include:

Feature	Opsgenie
On-call schedules	Flexible schedules with overrides
Alert deduplication	Automatic grouping of related alerts
Integration ecosystem	200+ integrations
Reporting	Built-in incident analytics
Pricing	Per-user pricing model

Incident.io Features

Incident.io is a newer platform that focuses on the full incident lifecycle, not just alert routing. Key differentiators:

Built-in postmortems: Postmortem creation is integrated into the incident resolution flow.
Status pages: Internal and external status pages are built in.
Incident timeline: Automatic timeline generation from Slack messages and tool integrations.
Severity-based workflows: Different workflows trigger based on incident severity.
Slack-native: Deep Slack integration for incident command and coordination.

Choosing the Right Incident Management Platform

The best platform depends on your team size, existing toolchain, and budget. Consider these factors:

Team size: PagerDuty and Opsgenie work well for teams of any size. Incident.io is particularly strong for teams that want a unified incident lifecycle.
Existing tools: If you already use Slack heavily, Incident.io has the deepest Slack integration. If you use Jira, PagerDuty has mature integration.
Budget: Opsgenie is often more affordable for smaller teams. PagerDuty pricing scales with features.
Postmortem needs: If you want built-in postmortems, incident.io has the strongest offering.

Open Source Alternatives

Not every team needs commercial tools. Open source alternatives exist for most SRE tool categories:

Category	Commercial	Open Source
Monitoring	Datadog, New Relic	Prometheus + Grafana
Incident management	PagerDuty, Opsgenie	Cabot, Uptime Kuma
On-call scheduling	PagerDuty	oncall (Grafana), Zabbix
Log management	Splunk, Datadog	Loki, ELK Stack
Runbooks	PagerDuty Runbooks	Rundeck, StackStorm

Common Errors

Error	Explanation
Too many tools	Each tool adds integration complexity. Standardize on a minimal set.
No runbook integration	If the alert does not link to a runbook, the responder wastes time searching.
Monitoring without alerting	Collecting metrics without alerting on them misses incidents until users report them.
Ignoring API quality	A tool with a poor API will be harder to automate and integrate.
Not testing the toolchain	The full alert-to-incident-to-runbook pipeline should be tested regularly.
No SLA for the tool itself	If PagerDuty is down, can your team still respond to incidents? Have a backup.

Practice Questions

What are the three major incident management platforms for SRE?
How does the monitoring stack integrate with incident management?
What criteria should you use to evaluate SRE tools?
Why is API quality important for SRE tools?
What is ChatOps and how does it help SRE teams?

Challenge

Design an SRE toolchain for a new DodaTech service. Choose an incident management platform, monitoring stack, automation approach, and collaboration tools. Draw the integration architecture showing how alerts flow from monitoring to incident management to runbook execution. Justify each tool choice based on the evaluation criteria.

FAQ

What is PagerDuty?

PagerDuty is an incident management platform that handles on-call scheduling, alert routing, escalation policies, and incident tracking.

How is Incident.io different from PagerDuty?

Incident.io includes native postmortem and status page features that are add-ons in PagerDuty. It focuses on the full incident lifecycle from alert to postmortem.

What is the Prometheus stack?

Prometheus collects time-series metrics, Alertmanager handles alert routing, and Grafana provides visualization dashboards.

Do I need all these tools?

Start with monitoring (Prometheus + Grafana), an incident management platform (PagerDuty or similar), and runbooks. Add tools as the team grows.

What is ChatOps?

ChatOps is the practice of running operational commands through chat platforms like Slack, enabling teams to manage incidents without switching tools.

← Previous SRE for Microservices — Distributed Systems Reliability Next → What Is SRE? Core Principles Explained

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Site Reliability Engineering