Site Reliability Engineering Tools — PagerDuty, Opsgenie, Incident.io
In this tutorial, you'll learn about Site Reliability Engineering Tools. We cover key concepts, practical examples, and best practices.
The SRE tool ecosystem spans incident management, monitoring, automation, collaboration, and observability — each category solves a specific operational problem and the best SRE teams integrate these tools into a cohesive operational platform.
What You'll Learn
In this tutorial, you will learn the key categories of SRE tools, how to evaluate and choose an incident management platform (PagerDuty, Opsgenie, Incident.io), how to integrate monitoring with incident response, and how to build an SRE toolchain that reduces toil and improves MTTR.
Why It Matters
The right tools reduce MTTR by 40-60 percent and cut toil by automating routine tasks. The wrong tools create tool sprawl, alert fatigue, and integration debt. SRE teams must evaluate tools based on reliability, integration quality, and how well they fit the team workflow.
Real-World Use
DodaTech uses PagerDuty for incident alerting and on-call management, Prometheus and Grafana for monitoring, OpenTelemetry for distributed tracing, and custom automation scripts for common operational tasks. The toolchain is integrated so that an alert in Prometheus automatically creates a PagerDuty incident with a link to the relevant runbook.
graph TD
A[Prometheus] -->|Alert| B[Alertmanager]
B -->|SEV1/SEV2| C[PagerDuty]
B -->|SEV3| D[Slack]
C -->|Create Incident| E[Incident Response]
E -->|Link Runbook| F[Runbook Repo]
E -->|Postmortem| G[Postmortem Doc]
G -->|Action Items| H[Jira]
H -->|Track Fix| I[Deployment Pipeline]
I -->|Deploy| J[Production]
J --> A
Prerequisites
Understanding Incident Response helps you evaluate incident management tools. Familiarity with Monitoring and Alerting for SRE gives context for the monitoring-to-alerting integration.
Incident Management Platforms
| Feature | PagerDuty | Opsgenie | Incident.io |
|---|---|---|---|
| On-call scheduling | Yes | Yes | Yes |
| Alert routing | Yes | Yes | Yes |
| Runbook integration | Yes | Yes | Native |
| Postmortem built-in | Add-on | Add-on | Yes |
| Status page | Separate product | Separate product | Built-in |
| SLA tracking | Yes | Yes | Yes |
PagerDuty Integration Example
class IncidentManager:
def __init__(self, platform):
self.platform = platform
self.incidents = []
def create_incident(self, title, severity, service, runbook_url):
incident = {
"id": len(self.incidents) + 1,
"title": title,
"severity": severity,
"service": service,
"runbook": runbook_url,
"status": "triggered",
"platform": self.platform
}
self.incidents.append(incident)
print(f"[{self.platform}] Incident #{incident['id']}: {title}")
print(f" Severity: {severity}")
print(f" Service: {service}")
print(f" Runbook: {runbook_url}")
return incident
def acknowledge(self, incident_id, responder):
inc = next(i for i in self.incidents if i["id"] == incident_id)
inc["status"] = "acknowledged"
inc["responder"] = responder
print(f"[{self.platform}] Incident #{incident_id} acknowledged by {responder}")
def resolve(self, incident_id, resolution_notes):
inc = next(i for i in self.incidents if i["id"] == incident_id)
inc["status"] = "resolved"
inc["resolution"] = resolution_notes
print(f"[{self.platform}] Incident #{incident_id} resolved: {resolution_notes}")
pd = IncidentManager("PagerDuty")
inc = pd.create_incident(
"High CPU on Doda Browser API servers",
"SEV2",
"doda-browser-api",
"https://runbooks.dodatech.com/high-cpu"
)
pd.acknowledge(inc["id"], "alice@dodatech.com")
pd.resolve(inc["id"], "Added auto-scaling policy")
Expected output:
[PagerDuty] Incident #1: High CPU on Doda Browser API servers
Severity: SEV2
Service: doda-browser-api
Runbook: https://runbooks.dodatech.com/high-cpu
[PagerDuty] Incident #1 acknowledged by alice@dodatech.com
[PagerDuty] Incident #1 resolved: Added auto-scaling policy
Monitoring and Observability Stack
| Layer | Tool | Purpose |
|---|---|---|
| Metrics collection | Prometheus | Scrape and store time-series metrics |
| Visualization | Grafana | Dashboards and alerting |
| Logging | Loki / ELK | Log aggregation and search |
| Tracing | Jaeger / OpenTelemetry | Distributed tracing |
| Alerting | Alertmanager | Route and deduplicate alerts |
Prometheus-Style Alerting
class PrometheusAlert:
def __init__(self, alert_name, expr, severity, annotations):
self.name = alert_name
self.expr = expr
self.severity = severity
self.annotations = annotations
def evaluate(self, metric_value):
print(f"Evaluating: {self.expr}")
threshold = float(self.expr.split(">")[1].strip())
if metric_value > threshold:
print(f"FIRING: {self.name} ({self.severity})")
for k, v in self.annotations.items():
print(f" {k}: {v}")
return True
return False
alert = PrometheusAlert(
"HighErrorRate",
"rate(http_requests_total{status=~'5..'}[5m]) > 0.01",
"critical",
{"summary": "High HTTP 5xx error rate", "runbook": "https://runbooks.dodatech.com/errors"}
)
alert.evaluate(0.02)
Expected output:
Evaluating: rate(http_requests_total{status=~'5..'}[5m]) > 0.01
FIRING: HighErrorRate (critical)
summary: High HTTP 5xx error rate
runbook: https://runbooks.dodatech.com/errors
Automation Tools
SRE teams use automation to reduce toil. Common automation approaches include:
| Tool | Purpose | Example Use Case |
|---|---|---|
| Terraform | Infrastructure provisioning | Create cloud resources |
| Ansible | Configuration management | Apply OS patches |
| Custom scripts | Ad hoc automation | Certificate renewal |
| ChatOps bots | Slack-based operations | Run commands from chat |
ChatOps Bot for Common Tasks
class ChatOpsBot:
def __init__(self, channel):
self.channel = channel
def handle_command(self, command, user):
print(f"[{self.channel}] {user}: {command}")
if command == "status":
print(f"[{self.channel}] Bot: All services healthy")
elif command == "deploy":
print(f"[{self.channel}] Bot: Deploying v2.3.2 to staging...")
elif command == "restart sync":
print(f"[{self.channel}] Bot: Restarting sync-service... Done!")
else:
print(f"[{self.channel}] Bot: Unknown command. Available: status, deploy, restart")
bot = ChatOpsBot("#sre-operations")
bot.handle_command("status", "alice")
bot.handle_command("restart sync", "bob")
Expected output:
[#sre-operations] alice: status
[#sre-operations] Bot: All services healthy
[#sre-operations] bob: restart sync
[#sre-operations] Bot: Restarting sync-service... Done!
Tool Evaluation Criteria
When choosing SRE tools, evaluate against these criteria:
| Criteria | Weight | Questions to Ask |
|---|---|---|
| Reliability | High | Does the tool itself have a good SLA? |
| Integration | High | Does it integrate with our existing stack? |
| API quality | High | Can we automate everything through APIs? |
| On-call support | Medium | Does it handle escalation and rotation? |
| Cost | Medium | Does the pricing scale with our usage? |
| Ease of use | Medium | How steep is the learning curve? |
Building an SRE Toolchain
An effective SRE toolchain is not just a collection of tools but an integrated system. The key integration points are:
| Integration | What It Does | Benefit |
|---|---|---|
| Monitoring to alerting | Prometheus alerts trigger PagerDuty incidents | Fast notification |
| Alerting to runbooks | PagerDuty incident links to runbook | Immediate guidance |
| Incident to postmortem | PagerDuty incident creates postmortem doc | No data loss |
| Postmortem to project tracking | Action items go to Jira | Tracked to completion |
| Deployment to monitoring | Deployments annotated in Grafana | See deploy impact on metrics |
Opsgenie Features
Opsgenie is an alternative to PagerDuty with similar capabilities. Key differentiators include:
| Feature | Opsgenie |
|---|---|
| On-call schedules | Flexible schedules with overrides |
| Alert deduplication | Automatic grouping of related alerts |
| Integration ecosystem | 200+ integrations |
| Reporting | Built-in incident analytics |
| Pricing | Per-user pricing model |
Incident.io Features
Incident.io is a newer platform that focuses on the full incident lifecycle, not just alert routing. Key differentiators:
- Built-in postmortems: Postmortem creation is integrated into the incident resolution flow.
- Status pages: Internal and external status pages are built in.
- Incident timeline: Automatic timeline generation from Slack messages and tool integrations.
- Severity-based workflows: Different workflows trigger based on incident severity.
- Slack-native: Deep Slack integration for incident command and coordination.
Choosing the Right Incident Management Platform
The best platform depends on your team size, existing toolchain, and budget. Consider these factors:
- Team size: PagerDuty and Opsgenie work well for teams of any size. Incident.io is particularly strong for teams that want a unified incident lifecycle.
- Existing tools: If you already use Slack heavily, Incident.io has the deepest Slack integration. If you use Jira, PagerDuty has mature integration.
- Budget: Opsgenie is often more affordable for smaller teams. PagerDuty pricing scales with features.
- Postmortem needs: If you want built-in postmortems, incident.io has the strongest offering.
Open Source Alternatives
Not every team needs commercial tools. Open source alternatives exist for most SRE tool categories:
| Category | Commercial | Open Source |
|---|---|---|
| Monitoring | Datadog, New Relic | Prometheus + Grafana |
| Incident management | PagerDuty, Opsgenie | Cabot, Uptime Kuma |
| On-call scheduling | PagerDuty | oncall (Grafana), Zabbix |
| Log management | Splunk, Datadog | Loki, ELK Stack |
| Runbooks | PagerDuty Runbooks | Rundeck, StackStorm |
Common Errors
| Error | Explanation |
|---|---|
| Too many tools | Each tool adds integration complexity. Standardize on a minimal set. |
| No runbook integration | If the alert does not link to a runbook, the responder wastes time searching. |
| Monitoring without alerting | Collecting metrics without alerting on them misses incidents until users report them. |
| Ignoring API quality | A tool with a poor API will be harder to automate and integrate. |
| Not testing the toolchain | The full alert-to-incident-to-runbook pipeline should be tested regularly. |
| No SLA for the tool itself | If PagerDuty is down, can your team still respond to incidents? Have a backup. |
Practice Questions
- What are the three major incident management platforms for SRE?
- How does the monitoring stack integrate with incident management?
- What criteria should you use to evaluate SRE tools?
- Why is API quality important for SRE tools?
- What is ChatOps and how does it help SRE teams?
Challenge
Design an SRE toolchain for a new DodaTech service. Choose an incident management platform, monitoring stack, automation approach, and collaboration tools. Draw the integration architecture showing how alerts flow from monitoring to incident management to runbook execution. Justify each tool choice based on the evaluation criteria.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro