Service Level Objectives: SLIs and Error Budgets Explained

Q: What is the difference between SLO and SLA?

An SLO is an internal target you set for yourself. An SLA is a contractual commitment to a customer. SLAs are typically less aggressive than SLOs to provide a buffer.

Q: How many SLOs should a service have?

Start with 2-3 SLOs per service (availability and latency). Too many SLOs spread the error budget thin and cause alert fatigue.

Q: Can I have SLOs for internal services?

Yes, internal services should have SLOs too. A database that is slow affects all downstream services. SLOs help internal teams prioritize reliability.

Q: How often should SLO targets be reviewed?

Quarterly is a good cadence. Review whether the targets are still appropriate, and adjust based on the previous quarter's performance.

Q: What is the "one-nine" difference?

Each additional "nine" is an order of magnitude improvement: 99% (3.65 days/year downtime) vs 99.9% (8.76 hours/year) vs 99.99% (52.6 minutes/year).

DodaTech Updated 2026-06-23 7 min read

In this tutorial, you'll learn about Service Level Objectives: SLIs and Error Budgets Explained. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You Will Learn

This tutorial teaches you how to define meaningful Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs), manage error budgets, and use SLOs to drive engineering decisions.

Why It Matters

Uptime is not enough. A service that is up but responding slowly is failing your users. SLOs give you a data-driven way to define what "good enough" means and a framework for deciding when to prioritize reliability over features.

Real-World Use

The DodaTech sync service has an SLO of 99.9% availability with a latency SLO of p99 < 500ms. When the error budget dropped to 10%, the team paused feature development for two weeks and focused entirely on reliability improvements. The error budget was restored, and the SLO was met for the next quarter.

An SLI is a quantifiable measure of service performance (like request latency or error rate). An SLO is the target value for that SLI (like "p99 latency below 500ms 99.9% of the time"). The error budget is the acceptable amount of failure: 100% - SLO target. Site Reliability Engineering uses these concepts to balance reliability with velocity.

Prerequisites

A service with observable metrics (see Prometheus Introduction)
Basic understanding of percentiles and statistics
Familiarity with Grafana Dashboards for visualization

Step-by-Step Tutorial

Step 1: Identify Your SLIs

Good SLIs measure what matters to users. Common SLIs include:

SLI	Definition	Measurement
Availability	Fraction of requests that succeed	`successful / total requests`
Latency	Time to respond	p50, p95, p99 percentiles
Throughput	Requests per second	`rate(http_requests_total[5m])`
Freshness	How recent the data is	Age of latest data point
Correctness	Fraction of correct responses	Application-specific

Step 2: Define SLIs in PromQL

# Availability SLI
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency SLI (p99)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Freshness SLI
time() - max(metric_last_updated_timestamp)

Step 3: Choose SLO Targets

Use the "rule of 3" to set SLO targets:

# Example SLO targets for a web service
availability:
  target: 99.9%
  measurement_window: 30d

latency_p99:
  target: 500ms
  measurement_window: 30d

latency_p95:
  target: 200ms
  measurement_window: 30d

Step 4: Calculate the Error Budget

The error budget is the maximum allowed failure:

99.9% SLO = 0.1% error budget
In 30 days (2,592,000 seconds): 2,592 seconds (43 minutes) of allowed downtime
In 30 days of requests at 1000 req/s: 2,592,000 allowed errors

def error_budget(slo_percent, total_requests):
    allowed_failure_rate = 1 - (slo_percent / 100)
    return int(total_requests * allowed_failure_rate)

# Example: 99.9% SLO with 10 million requests
print(error_budget(99.9, 10_000_000))
# Output: 10000 allowed errors

Step 5: Track SLO Burn Rate

Burn rate is how fast you are consuming the error budget:

# Burn rate over 1 hour
(
  sum(rate(http_requests_total{status!~"2.."}[1h]))
  /
  sum(rate(http_requests_total[1h]))
) / (1 - 0.999)  # 0.999 = 99.9% SLO

A burn rate of 1 means you will exactly exhaust the budget over the window. A burn rate of 2 means you will exhaust it in half the window.

Step 6: Create Multi-Window, Multi-Burn-Rate Alerts

groups:
  - name: slo_alerts
    rules:
      - alert: SLOViolationCritical
        expr: |
          (
            sum(rate(http_requests_total{status!~"2.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > 14.4 * 0.001
        for: 5m
        labels:
          severity: critical
          slo: availability_99.9
        annotations:
          summary: "SLO burn rate critical: 14.4x over 1 hour"
          description: "Error budget exhausted in ~2 hours at current rate"

      - alert: SLOViolationWarning
        expr: |
          (
            sum(rate(http_requests_total{status!~"2.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > 6 * 0.001
        for: 15m
        labels:
          severity: warning
          slo: availability_99.9
        annotations:
          summary: "SLO burn rate warning: 6x over 6 hours"

Step 7: Visualize SLOs in Grafana

Create an SLO dashboard with:

SLO Status Panel: A Stat panel showing current SLO Compliance (green/red)
Error Budget Panel: A Gauge showing remaining error budget percentage
Burn Rate Panel: Time series showing burn rate over 1h, 6h, and 30d windows
SLI Trend Panel: Time series of the actual SLI value vs the SLO target line

Step 8: Implement SLO-Based Decision Making

def can_deploy(error_budget_remaining_percent):
    if error_budget_remaining_percent > 50:
        return True  # Full deploy velocity
    elif error_budget_remaining_percent > 20:
        return True  # But require monitoring review
    elif error_budget_remaining_percent > 0:
        return False # Freeze feature deploys, focus on reliability
    else:
        return False # Incident mode, all hands on deck

Learning Path

flowchart LR
    A[Define SLIs] --> B[Set SLO Targets]
    B --> C[Calculate Error Budget]
    C --> D[Track Burn Rate]
    D --> E[Multi-Window Alerts]
    E --> F[Visualize in Grafana]
    F --> G[Decision Framework]
    G --> H[Improve Reliability]
    style A fill:#4a90d9,color:#fff
    style G fill:#e67e22,color:#fff

Common Errors

SLI does not reflect user experience -- The SLI measures infrastructure metrics (like CPU) instead of user-facing metrics (like request latency).
SLO target is too aggressive -- A 99.99% SLO leaves only 4.3 minutes of downtime per month. Most applications do not need this level unless they are critical infrastructure.
Error budget is exhausted due to a single bad deployment -- The burn rate alert did not trigger because the window was too long. Use shorter windows (1h, 6h) in addition to the 30d window.
Multiple SLOs conflict -- Latency SLO and availability SLO can conflict: retrying failed requests improves availability but increases latency. Balance trade-offs consciously.
SLO dashboard shows 100% Compliance constantly -- The SLI measurement is not granular enough. Measure at the request level, not the minute level.
Team ignores the error budget -- The error budget is not visible in the team's daily workflow. Embed SLO data in Pull Request pages and deployment dashboards.
SLO target is changed to avoid paging -- The team is relaxing targets instead of improving reliability. SLOs should be stable and reviewed quarterly, not changed reactively.

Practice Questions

What is the difference between an SLI and an SLO? Answer: An SLI is the actual measured value (e.g., 99.95% availability). An SLO is the target value (e.g., 99.9% availability).
What is an error budget? Answer: The allowable amount of failure: 100% - SLO target. For a 99.9% SLO, the error budget is 0.1% of total requests or time.
How does SLO burn rate help with alerting? Answer: Burn rate measures how fast the error budget is being consumed. A burn rate of 2 means the budget will be exhausted in half the window, triggering early alerts.
Why use multi-window, multi-burn-rate alerts? Answer: Short windows catch fast burn rates immediately. Long windows confirm persistent problems. Together they reduce false positives and alert fatigue.
How should a team use the error budget to prioritize work? Answer: When the error budget is full, the team can prioritize features. When it is low, the team should pause features and focus on reliability improvements.

Challenge

Define SLIs and SLOs for an e-commerce API with the following requirements: checkout must succeed 99.95% of the time, product search must return results within 200ms (p95), and the product catalog must be no more than 5 minutes stale. Implement PromQL queries for each SLI. Set up multi-window burn rate alerts for the checkout availability SLO with windows of 1h, 6h, and 30d. Create a Grafana SLO dashboard showing: current SLI vs SLO for all three metrics, remaining error budget as a percentage, and burn rate heatmap. Write a team decision-making policy document based on the error budget. Verify the alert fires when you simulate a 5-minute outage.

FAQ

What is the difference between SLO and SLA?

An SLO is an internal target you set for yourself. An SLA is a contractual commitment to a customer. SLAs are typically less aggressive than SLOs to provide a buffer.

How many SLOs should a service have?

Start with 2-3 SLOs per service (availability and latency). Too many SLOs spread the error budget thin and cause alert fatigue.

Can I have SLOs for internal services?

Yes, internal services should have SLOs too. A database that is slow affects all downstream services. SLOs help internal teams prioritize reliability.

How often should SLO targets be reviewed?

Quarterly is a good cadence. Review whether the targets are still appropriate, and adjust based on the previous quarter's performance.

What is the "one-nine" difference?

Each additional "nine" is an order of magnitude improvement: 99% (3.65 days/year downtime) vs 99.9% (8.76 hours/year) vs 99.99% (52.6 minutes/year).

← Previous Monitoring Web Applications: RUM and Synthetic Monitoring Next → Anomaly Detection in Metrics: Statistical and ML-Based Methods

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Observability