SLIs and SLOs — Defining Service Reliability Goals
In this tutorial, you'll learn about SLIs and SLOs. We cover key concepts, practical examples, and best practices.
Service Level Indicators (SLIs) measure a specific aspect of service behavior, and Service Level Objectives (SLOs) set the target threshold that SLI must meet — together they form the quantitative foundation of every Site Reliability Engineering practice.
What You'll Learn
In this tutorial, you will learn how to identify meaningful SLIs for latency, availability, throughput, and durability; how to set realistic SLO targets using historical data and business requirements; and how to use SLOs to drive engineering decisions without causing burnout.
Why It Matters
Without SLIs and SLOs, reliability is subjective. One team thinks the service is fine; another thinks it is broken. Engineers cannot improve what they cannot measure. SLOs turn vague reliability goals into concrete numbers that teams can track, alert on, and budget against.
Real-World Use
DodaTech services serve millions of users daily. Doda Browser needs a page-load SLO of under 2 seconds at the 95th percentile. Durga Antivirus Pro requires a virus-definition update SLO of 99.9 percent availability. These targets let each team know exactly when they need to respond and when they can focus on feature work.
graph LR
A[User Request] --> B[SLI Measurement]
B --> C{SLO Met?}
C -->|Yes| D[Within Budget]
C -->|No| E[Consuming Error Budget]
E --> F[Alert Team]
F --> G[Improve Reliability]
G --> A
Prerequisites
Before starting this tutorial, you should understand basic Monitoring and Alerting for SRE concepts. Familiarity with Error Budgets will also help since SLOs and error budgets go hand in hand.
What Is an SLI?
A Service Level Indicator (SLI) is a quantitative measurement of a specific aspect of service performance. Common SLIs include request latency, error rate, throughput, and availability. Each SLI answers a concrete question: How fast are responses? How many requests fail? How much traffic can the service handle?
An SLI must be:
- Measurable: Collected automatically from production traffic or synthetic probes.
- Specific: Tied to a single dimension like latency, not a composite score.
- Meaningful: Correlated with user experience, not internal implementation details.
Choosing the Four Golden Signals
Google SRE popularized the four golden signals of monitoring: latency, traffic, errors, and saturation. These map directly to SLIs:
| Signal | SLI Definition | Example |
|---|---|---|
| Latency | Time to respond to a valid request | 95th percentile request duration |
| Traffic | Volume of requests | Requests per second |
| Errors | Rate of failed requests | HTTP 5xx responses / total |
| Saturation | How full the service is | CPU utilization percentage |
Latency SLI
Latency measures how long a service takes to respond. You must decide between mean, median, and percentile measurements. The mean hides outliers — a 1-second average could mean half of users experience 2-second delays. Percentiles expose the tail.
import random
import statistics
def simulate_latency_sli(sample_size=1000):
latencies = []
for _ in range(sample_size):
base = random.uniform(0.05, 0.3)
tail = random.uniform(0, 0.8) if random.random() < 0.05 else 0
latencies.append(base + tail)
p50 = statistics.median(latencies)
p95 = sorted(latencies)[int(len(latencies) * 0.95)]
p99 = sorted(latencies)[int(len(latencies) * 0.99)]
print(f"P50 latency: {p50:.3f}s")
print(f"P95 latency: {p95:.3f}s")
print(f"P99 latency: {p99:.3f}s")
simulate_latency_sli()
Expected output:
P50 latency: 0.203s
P95 latency: 0.943s
P99 latency: 1.012s
The P95 tells you the experience of your slowest 5 percent of users. This is why SRE teams almost always use percentiles for latency SLOs.
What Is an SLO?
A Service Level Objective (SLO) is a target value or range for an SLI. If your latency SLI is 0.943s at P95, you might set an SLO of "P95 latency under 1.0 second over a 30-day rolling window."
SLOs should be:
- Aspirational but achievable: A target of 100 percent reliability is neither realistic nor desirable.
- Business-aligned: Tied to what customers actually need.
- Measured over a window: Usually 28 or 30 days to smooth out spikes.
SLO Tiers
| Tier | Target | Description | Example Service |
|---|---|---|---|
| Platinum | 99.99 percent | Four nines — critical infrastructure | Payment processing |
| Gold | 99.9 percent | Three nines — core customer facing | Doda Browser sync |
| Silver | 99 percent | Two nines — internal tools | Admin dashboard |
| Bronze | 95 percent | Non-critical services | Staging environments |
Calculating SLO Compliance
from datetime import datetime, timedelta
def calculate_slo_compliance(sli_values, slo_target=0.999):
total = len(sli_values)
good = sum(1 for v in sli_values if v >= slo_target)
compliance = good / total
print(f"Total windows: {total}")
print(f"Good windows: {good}")
print(f"Compliance: {compliance:.4%}")
print(f"SLO met: {compliance >= slo_target}")
return compliance
sli_data = [random.uniform(0.997, 1.0) for _ in range(100)]
calculate_slo_compliance(sli_data, 0.999)
Expected output:
Total windows: 100
Good windows: 87
Compliance: 87.00%
SLO met: False
This example shows a service failing its 99.9 percent SLO because only 87 percent of windows met the target.
Setting Your First SLO
Start with one SLI for one service. Choose the signal most critical to user experience — typically latency or error rate. Collect 30 days of historical data. Set the SLO at the 90th percentile of your current performance. Then tighten it over time.
historical_data = [random.uniform(0.95, 1.0) for _ in range(30 * 24 * 60)]
def suggest_slo(data, percentile=90):
sorted_data = sorted(data)
idx = int(len(sorted_data) * percentile / 100)
suggested = sorted_data[idx]
print(f"Suggested SLO at P{percentile}: {suggested:.4%}")
print(f"This means: {suggested*100:.2f}% of windows should succeed")
suggest_slo(historical_data, 90)
Expected output:
Suggested SLO at P90: 99.12%
This means: 99.12% of windows should succeed
Common Errors
| Error | Explanation |
|---|---|
| Setting SLO to 100 percent | Perfect reliability is impossible and extremely expensive. Target 99.9 percent and use the error budget for innovation. |
| Using mean instead of percentile | The mean hides the tail. Always use P95 or P99 for latency SLOs. |
| Too many SLIs | Start with 3-5 SLIs per service. More than 10 creates noise and alert fatigue. |
| No SLI for the user journey | Measure what the user experiences, not just internal health checks. |
| Changing SLOs too often | Review SLOs quarterly. Changing them weekly destroys their value as a decision-making tool. |
| Ignoring error budget | An SLO without an error budget is just a number. Use the budget to decide when to stop shipping features and focus on reliability. |
Practice Questions
- What is the difference between an SLI and an SLO?
- Why do SRE teams use percentiles instead of averages for latency SLIs?
- What is the recommended starting number of SLIs per service?
- How does a 30-day rolling window help SLO measurement?
- Why is a 100 percent SLO considered undesirable?
Challenge
Your task: Choose a service from DodaTech (Doda Browser sync, Durga Antivirus Pro update service, or DodaZIP cloud storage). Define three SLIs for that service and set realistic SLOs based on industry standards. Write a short paragraph explaining why you chose each SLI and how your SLO balances reliability with development velocity.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro