SLO, SLI, Error Budget — Complete API Reliability Guide
In this tutorial, you will learn about SLO, SLI, Error Budget. We cover key concepts, practical examples, and best practices to help you master this topic.
Service Level Objectives (SLOs) define target reliability levels for your API. Service Level Indicators (SLIs) measure actual performance. Error budgets quantify acceptable downtime.
What You'll Learn
You'll learn how to define SLIs, set SLOs, and use error budgets to balance reliability and feature development.
Why It Matters
SLOs align engineering teams on reliability goals. Error budgets provide a data-driven framework for deciding when to prioritize reliability over features.
Real-World Use
An API has a 99.9% uptime SLO (8.76 hours downtime per year). When the error budget drops below 50%, the team halts feature deploys and focuses on reliability improvements.
flowchart LR
A[Define SLIs] --> B[Set SLO Targets]
B --> C[Measure SLIs]
C --> D{Within Budget?}
D -->|Yes| E[Continue Feature Work]
D -->|No| F[Freeze Features]
F --> G[Reliability Work]
G --> C
Implementation
# SLO calculation
import time
from collections import deque
class SLOMonitor:
def __init__(self, slo_target=0.999, window_seconds=30*24*3600):
self.slo_target = slo_target
self.window_seconds = window_seconds
self.events = deque()
def record_request(self, success):
self.events.append((time.time(), success))
cutoff = time.time() - self.window_seconds
while self.events and self.events[0][0] < cutoff:
self.events.popleft()
def current_availability(self):
if not self.events:
return 1.0
successes = sum(1 for _, s in self.events if s)
return successes / len(self.events)
def error_budget_remaining(self):
availability = self.current_availability()
budget = self.slo_target - (1 - availability)
return budget / (1 - self.slo_target)
slo = SLOMonitor(slo_target=0.999)
for _ in range(100000):
slo.record_request(success=True)
for _ in range(100):
slo.record_request(success=False)
print(f"Availability: {slo.current_availability():.5%}")
print(f"Budget remaining: {slo.error_budget_remaining():.1%}")
Example SLOs
| SLI | SLO | Error Budget / Month |
|---|---|---|
| API uptime | 99.9% | 43 minutes |
| API uptime | 99.99% | 4.3 minutes |
| Latency p99 < 500ms | 99.5% | 3.6 hours |
| Error rate < 1% | 99% | 7.3 hours |
Common Mistakes
| Mistake | Fix | |---------|-----| | 100% uptime SLO | Impossible; leads to burnout | Set realistic SLOs (99.9% or 99.99%) | | No error budget policy | Budget meaningless without action | Define what happens when budget exhausted | | Too many SLOs | No focus | Start with 3-5 key SLOs | | SLO without SLI measurement | Cannot track progress | Implement SLI measurement first | | Different SLO per environment | Inconsistent expectations | Same SLO targets for all environments |
Practice Questions
- What is the difference between SLI and SLO?
- How does an error budget work?
- What is a realistic SLO for a public API?
- What happens when the error budget is exhausted?
- How do you measure latency SLI?
Challenge
Implement SLO monitoring for an API: track uptime (99.9%), p99 latency (<500ms, 99.5%), and error rate (<1%, 99%). Display budget remaining on a dashboard.
What's Next
Learn about anomaly detection for API monitoring.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro