SLIs and SLOs — Defining Service Reliability Goals

DodaTech Updated 2026-06-23 6 min read

In this tutorial, you'll learn about SLIs and SLOs. We cover key concepts, practical examples, and best practices.

Service Level Indicators (SLIs) measure a specific aspect of service behavior, and Service Level Objectives (SLOs) set the target threshold that SLI must meet — together they form the quantitative foundation of every Site Reliability Engineering practice.

What You'll Learn

In this tutorial, you will learn how to identify meaningful SLIs for latency, availability, throughput, and durability; how to set realistic SLO targets using historical data and business requirements; and how to use SLOs to drive engineering decisions without causing burnout.

Why It Matters

Without SLIs and SLOs, reliability is subjective. One team thinks the service is fine; another thinks it is broken. Engineers cannot improve what they cannot measure. SLOs turn vague reliability goals into concrete numbers that teams can track, alert on, and budget against.

Real-World Use

DodaTech services serve millions of users daily. Doda Browser needs a page-load SLO of under 2 seconds at the 95th percentile. Durga Antivirus Pro requires a virus-definition update SLO of 99.9 percent availability. These targets let each team know exactly when they need to respond and when they can focus on feature work.

graph LR
    A[User Request] --> B[SLI Measurement]
    B --> C{SLO Met?}
    C -->|Yes| D[Within Budget]
    C -->|No| E[Consuming Error Budget]
    E --> F[Alert Team]
    F --> G[Improve Reliability]
    G --> A

Prerequisites

Before starting this tutorial, you should understand basic Monitoring and Alerting for SRE concepts. Familiarity with Error Budgets will also help since SLOs and error budgets go hand in hand.

What Is an SLI?

A Service Level Indicator (SLI) is a quantitative measurement of a specific aspect of service performance. Common SLIs include request latency, error rate, throughput, and availability. Each SLI answers a concrete question: How fast are responses? How many requests fail? How much traffic can the service handle?

An SLI must be:

Measurable: Collected automatically from production traffic or synthetic probes.
Specific: Tied to a single dimension like latency, not a composite score.
Meaningful: Correlated with user experience, not internal implementation details.

Choosing the Four Golden Signals

Google SRE popularized the four golden signals of monitoring: latency, traffic, errors, and saturation. These map directly to SLIs:

Signal	SLI Definition	Example
Latency	Time to respond to a valid request	95th percentile request duration
Traffic	Volume of requests	Requests per second
Errors	Rate of failed requests	HTTP 5xx responses / total
Saturation	How full the service is	CPU utilization percentage

Latency SLI

Latency measures how long a service takes to respond. You must decide between mean, median, and percentile measurements. The mean hides outliers — a 1-second average could mean half of users experience 2-second delays. Percentiles expose the tail.

import random
import statistics

def simulate_latency_sli(sample_size=1000):
    latencies = []
    for _ in range(sample_size):
        base = random.uniform(0.05, 0.3)
        tail = random.uniform(0, 0.8) if random.random() < 0.05 else 0
        latencies.append(base + tail)

    p50 = statistics.median(latencies)
    p95 = sorted(latencies)[int(len(latencies) * 0.95)]
    p99 = sorted(latencies)[int(len(latencies) * 0.99)]

    print(f"P50 latency:  {p50:.3f}s")
    print(f"P95 latency:  {p95:.3f}s")
    print(f"P99 latency:  {p99:.3f}s")

simulate_latency_sli()

Expected output:

P50 latency:  0.203s
P95 latency:  0.943s
P99 latency:  1.012s

The P95 tells you the experience of your slowest 5 percent of users. This is why SRE teams almost always use percentiles for latency SLOs.

What Is an SLO?

A Service Level Objective (SLO) is a target value or range for an SLI. If your latency SLI is 0.943s at P95, you might set an SLO of "P95 latency under 1.0 second over a 30-day rolling window."

SLOs should be:

Aspirational but achievable: A target of 100 percent reliability is neither realistic nor desirable.
Business-aligned: Tied to what customers actually need.
Measured over a window: Usually 28 or 30 days to smooth out spikes.

SLO Tiers

Tier	Target	Description	Example Service
Platinum	99.99 percent	Four nines — critical infrastructure	Payment processing
Gold	99.9 percent	Three nines — core customer facing	Doda Browser sync
Silver	99 percent	Two nines — internal tools	Admin dashboard
Bronze	95 percent	Non-critical services	Staging environments

Calculating SLO Compliance

from datetime import datetime, timedelta

def calculate_slo_compliance(sli_values, slo_target=0.999):
    total = len(sli_values)
    good = sum(1 for v in sli_values if v >= slo_target)
    compliance = good / total
    print(f"Total windows:   {total}")
    print(f"Good windows:    {good}")
    print(f"Compliance:      {compliance:.4%}")
    print(f"SLO met:         {compliance >= slo_target}")
    return compliance

sli_data = [random.uniform(0.997, 1.0) for _ in range(100)]
calculate_slo_compliance(sli_data, 0.999)

Expected output:

Total windows:   100
Good windows:    87
Compliance:      87.00%
SLO met:         False

This example shows a service failing its 99.9 percent SLO because only 87 percent of windows met the target.

Setting Your First SLO

Start with one SLI for one service. Choose the signal most critical to user experience — typically latency or error rate. Collect 30 days of historical data. Set the SLO at the 90th percentile of your current performance. Then tighten it over time.

historical_data = [random.uniform(0.95, 1.0) for _ in range(30 * 24 * 60)]

def suggest_slo(data, percentile=90):
    sorted_data = sorted(data)
    idx = int(len(sorted_data) * percentile / 100)
    suggested = sorted_data[idx]
    print(f"Suggested SLO at P{percentile}: {suggested:.4%}")
    print(f"This means: {suggested*100:.2f}% of windows should succeed")

suggest_slo(historical_data, 90)

Expected output:

Suggested SLO at P90: 99.12%
This means: 99.12% of windows should succeed

Common Errors

Error	Explanation
Setting SLO to 100 percent	Perfect reliability is impossible and extremely expensive. Target 99.9 percent and use the error budget for innovation.
Using mean instead of percentile	The mean hides the tail. Always use P95 or P99 for latency SLOs.
Too many SLIs	Start with 3-5 SLIs per service. More than 10 creates noise and alert fatigue.
No SLI for the user journey	Measure what the user experiences, not just internal health checks.
Changing SLOs too often	Review SLOs quarterly. Changing them weekly destroys their value as a decision-making tool.
Ignoring error budget	An SLO without an error budget is just a number. Use the budget to decide when to stop shipping features and focus on reliability.

Practice Questions

What is the difference between an SLI and an SLO?
Why do SRE teams use percentiles instead of averages for latency SLIs?
What is the recommended starting number of SLIs per service?
How does a 30-day rolling window help SLO measurement?
Why is a 100 percent SLO considered undesirable?

Challenge

Your task: Choose a service from DodaTech (Doda Browser sync, Durga Antivirus Pro update service, or DodaZIP cloud storage). Define three SLIs for that service and set realistic SLOs based on industry standards. Write a short paragraph explaining why you chose each SLI and how your SLO balances reliability with development velocity.

FAQ

What is the difference between an SLI and an SLO?

An SLI is the measurement itself — like latency at P95. An SLO is the target for that measurement — like P95 latency under 500ms.

How many SLIs should I track per service?

Start with three to five. The four golden signals (latency, traffic, errors, saturation) are a good starting point.

Can an SLO be greater than 99.999 percent?

Five nines is achievable for some services with sufficient redundancy and investment. For most services, three or four nines is the practical maximum.

How often should I review SLOs?

Review SLOs quarterly. If your team consistently exceeds the target by a wide margin, tighten it. If they frequently miss, investigate whether the target is realistic.

What happens when we miss an SLO?

Missing an SLO means the error budget is depleted. The team should stop shipping new features and focus on reliability improvements until the budget recovers.

← Previous Configuration Management Sre Next → Error Budgets — Balancing Reliability and Velocity

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Site Reliability Engineering