Service Level Objectives: SLIs and Error Budgets Explained
In this tutorial, you'll learn about Service Level Objectives: SLIs and Error Budgets Explained. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
What You Will Learn
This tutorial teaches you how to define meaningful Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs), manage error budgets, and use SLOs to drive engineering decisions.
Why It Matters
Uptime is not enough. A service that is up but responding slowly is failing your users. SLOs give you a data-driven way to define what "good enough" means and a framework for deciding when to prioritize reliability over features.
Real-World Use
The DodaTech sync service has an SLO of 99.9% availability with a latency SLO of p99 < 500ms. When the error budget dropped to 10%, the team paused feature development for two weeks and focused entirely on reliability improvements. The error budget was restored, and the SLO was met for the next quarter.
An SLI is a quantifiable measure of service performance (like request latency or error rate). An SLO is the target value for that SLI (like "p99 latency below 500ms 99.9% of the time"). The error budget is the acceptable amount of failure: 100% - SLO target. Site Reliability Engineering uses these concepts to balance reliability with velocity.
Prerequisites
- A service with observable metrics (see Prometheus Introduction)
- Basic understanding of percentiles and statistics
- Familiarity with Grafana Dashboards for visualization
Step-by-Step Tutorial
Step 1: Identify Your SLIs
Good SLIs measure what matters to users. Common SLIs include:
| SLI | Definition | Measurement |
|---|---|---|
| Availability | Fraction of requests that succeed | successful / total requests |
| Latency | Time to respond | p50, p95, p99 percentiles |
| Throughput | Requests per second | rate(http_requests_total[5m]) |
| Freshness | How recent the data is | Age of latest data point |
| Correctness | Fraction of correct responses | Application-specific |
Step 2: Define SLIs in PromQL
# Availability SLI
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Latency SLI (p99)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Freshness SLI
time() - max(metric_last_updated_timestamp)
Step 3: Choose SLO Targets
Use the "rule of 3" to set SLO targets:
# Example SLO targets for a web service
availability:
target: 99.9%
measurement_window: 30d
latency_p99:
target: 500ms
measurement_window: 30d
latency_p95:
target: 200ms
measurement_window: 30d
Step 4: Calculate the Error Budget
The error budget is the maximum allowed failure:
- 99.9% SLO = 0.1% error budget
- In 30 days (2,592,000 seconds): 2,592 seconds (43 minutes) of allowed downtime
- In 30 days of requests at 1000 req/s: 2,592,000 allowed errors
def error_budget(slo_percent, total_requests):
allowed_failure_rate = 1 - (slo_percent / 100)
return int(total_requests * allowed_failure_rate)
# Example: 99.9% SLO with 10 million requests
print(error_budget(99.9, 10_000_000))
# Output: 10000 allowed errors
Step 5: Track SLO Burn Rate
Burn rate is how fast you are consuming the error budget:
# Burn rate over 1 hour
(
sum(rate(http_requests_total{status!~"2.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) / (1 - 0.999) # 0.999 = 99.9% SLO
A burn rate of 1 means you will exactly exhaust the budget over the window. A burn rate of 2 means you will exhaust it in half the window.
Step 6: Create Multi-Window, Multi-Burn-Rate Alerts
groups:
- name: slo_alerts
rules:
- alert: SLOViolationCritical
expr: |
(
sum(rate(http_requests_total{status!~"2.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > 14.4 * 0.001
for: 5m
labels:
severity: critical
slo: availability_99.9
annotations:
summary: "SLO burn rate critical: 14.4x over 1 hour"
description: "Error budget exhausted in ~2 hours at current rate"
- alert: SLOViolationWarning
expr: |
(
sum(rate(http_requests_total{status!~"2.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > 6 * 0.001
for: 15m
labels:
severity: warning
slo: availability_99.9
annotations:
summary: "SLO burn rate warning: 6x over 6 hours"
Step 7: Visualize SLOs in Grafana
Create an SLO dashboard with:
- SLO Status Panel: A Stat panel showing current SLO Compliance (green/red)
- Error Budget Panel: A Gauge showing remaining error budget percentage
- Burn Rate Panel: Time series showing burn rate over 1h, 6h, and 30d windows
- SLI Trend Panel: Time series of the actual SLI value vs the SLO target line
Step 8: Implement SLO-Based Decision Making
def can_deploy(error_budget_remaining_percent):
if error_budget_remaining_percent > 50:
return True # Full deploy velocity
elif error_budget_remaining_percent > 20:
return True # But require monitoring review
elif error_budget_remaining_percent > 0:
return False # Freeze feature deploys, focus on reliability
else:
return False # Incident mode, all hands on deck
Learning Path
flowchart LR
A[Define SLIs] --> B[Set SLO Targets]
B --> C[Calculate Error Budget]
C --> D[Track Burn Rate]
D --> E[Multi-Window Alerts]
E --> F[Visualize in Grafana]
F --> G[Decision Framework]
G --> H[Improve Reliability]
style A fill:#4a90d9,color:#fff
style G fill:#e67e22,color:#fff
Common Errors
SLI does not reflect user experience -- The SLI measures infrastructure metrics (like CPU) instead of user-facing metrics (like request latency).
SLO target is too aggressive -- A 99.99% SLO leaves only 4.3 minutes of downtime per month. Most applications do not need this level unless they are critical infrastructure.
Error budget is exhausted due to a single bad deployment -- The burn rate alert did not trigger because the window was too long. Use shorter windows (1h, 6h) in addition to the 30d window.
Multiple SLOs conflict -- Latency SLO and availability SLO can conflict: retrying failed requests improves availability but increases latency. Balance trade-offs consciously.
SLO dashboard shows 100% Compliance constantly -- The SLI measurement is not granular enough. Measure at the request level, not the minute level.
Team ignores the error budget -- The error budget is not visible in the team's daily workflow. Embed SLO data in Pull Request pages and deployment dashboards.
SLO target is changed to avoid paging -- The team is relaxing targets instead of improving reliability. SLOs should be stable and reviewed quarterly, not changed reactively.
Practice Questions
What is the difference between an SLI and an SLO? Answer: An SLI is the actual measured value (e.g., 99.95% availability). An SLO is the target value (e.g., 99.9% availability).
What is an error budget? Answer: The allowable amount of failure:
100% - SLO target. For a 99.9% SLO, the error budget is 0.1% of total requests or time.How does SLO burn rate help with alerting? Answer: Burn rate measures how fast the error budget is being consumed. A burn rate of 2 means the budget will be exhausted in half the window, triggering early alerts.
Why use multi-window, multi-burn-rate alerts? Answer: Short windows catch fast burn rates immediately. Long windows confirm persistent problems. Together they reduce false positives and alert fatigue.
How should a team use the error budget to prioritize work? Answer: When the error budget is full, the team can prioritize features. When it is low, the team should pause features and focus on reliability improvements.
Challenge
Define SLIs and SLOs for an e-commerce API with the following requirements: checkout must succeed 99.95% of the time, product search must return results within 200ms (p95), and the product catalog must be no more than 5 minutes stale. Implement PromQL queries for each SLI. Set up multi-window burn rate alerts for the checkout availability SLO with windows of 1h, 6h, and 30d. Create a Grafana SLO dashboard showing: current SLI vs SLO for all three metrics, remaining error budget as a percentage, and burn rate heatmap. Write a team decision-making policy document based on the error budget. Verify the alert fires when you simulate a 5-minute outage.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro