Service Level Agreements (SLAs) vs SLOs vs SLIs
In this tutorial, you'll learn about Service Level Agreements (SLAs) vs SLOs vs SLIs. We cover key concepts, practical examples, and best practices.
SLAs, SLOs, and SLIs form a hierarchy of reliability measurements: SLIs measure what the service does, SLOs set internal targets for those measurements, and SLAs make contractual promises to customers based on those targets — each layer serves a different audience and purpose.
What You'll Learn
In this tutorial, you will learn the distinct definitions of SLAs, SLOs, and SLIs, how they interrelate in a reliability framework, how to set each one for different service tiers, and the common mistakes teams make when confusing these three concepts.
Why It Matters
Confusing SLAs with SLOs is one of the most expensive mistakes an SRE team can make. An SLA is a legal contract with financial penalties. An SLO is an internal engineering target. If you set them at the same level, you have no warning before you breach your contract. Understanding the difference protects both the business and the engineering team.
Real-World Use
DodaTech guarantees 99.9 percent uptime in its customer SLAs for DodaZIP cloud storage. The internal SRE team sets the SLO at 99.99 percent for the same service. This 0.09 percent buffer is the error budget that allows the team to deploy changes and perform maintenance without risking contractual penalties.
graph TD
A[SLI: Measured Value] --> B[SLO: Internal Target]
B --> C[Error Budget]
C --> D[SLA: Customer Contract]
D --> E[Penalties if Breached]
B --> F[Engineering Decisions]
D --> G[Business/Legal]
Prerequisites
You should understand SLIs and SLOs before reading this comparison tutorial. Familiarity with Error Budgets also helps since the SLO-to-SLA buffer creates the error budget.
Definitions
SLI — Service Level Indicator
An SLI is a raw measurement of a specific aspect of service behavior. It answers the question: "What is the current value of this metric?"
Examples:
- Request latency at P95 over the last 5 minutes
- Number of HTTP 5xx responses per minute
- Percentage of successful backups completed today
- Storage utilization as a percentage of total capacity
SLO — Service Level Objective
An SLO is a target value or range for an SLI. It answers the question: "What should this metric be?"
Examples:
- P95 latency under 500ms over a 30-day window
- Error rate below 0.1 percent of requests
- Backup success rate of 99.9 percent
- Storage utilization below 80 percent
SLA — Service Level Agreement
An SLA is a contractual commitment to a customer that includes specific service levels and penalties for failing to meet them. It answers: "What have we promised the customer?"
Examples:
- 99.9 percent uptime guarantee with 5 percent service credit per hour of downtime
- Maximum 1-second response time at P95 measured monthly
- 99.99 percent data durability guarantee
Comparison
| Dimension | SLI | SLO | SLA |
|---|---|---|---|
| Type | Raw measurement | Internal target | Contractual promise |
| Audience | Engineering | Engineering + Management | Customers + Legal |
| Penalty | None | None (error budget) | Financial credits |
| Typical value | Varies continuously | 99% to 99.99% | 99% to 99.9% |
| Review cadence | Real-time | Quarterly | Annually |
| Tightness | No target | Tight (aspirational) | Loose (with buffer) |
Why SLO Must Be Tighter Than SLA
If your SLA promises 99.9 percent uptime and your SLO is also 99.9 percent, then any downtime at all breaches both simultaneously. There is no warning period. The right approach is to set the SLO tighter than the SLA.
def check_sla_slo_relationship(sla, slo):
print(f"SLA: {sla}%")
print(f"SLO: {slo}%")
if slo > sla:
buffer = slo - sla
print(f"Buffer: {buffer:.2f}%")
print(f"Status: SAFE (SLO is {buffer:.2f}% tighter than SLA)")
else:
print("Status: DANGER (SLO is not tighter than SLA)")
check_sla_slo_relationship(99.9, 99.99)
check_sla_slo_relationship(99.9, 99.9)
Expected output:
SLA: 99.9%
SLO: 99.99%
Buffer: 0.09%
Status: SAFE (SLO is 0.09% tighter than SLA)
SLA: 99.9%
SLO: 99.9%
Status: DANGER (SLO is not tighter than SLA)
Setting SLIs, SLOs, and SLAs
Step 1: Define SLIs
Identify what matters to users. For a file storage service, durability and availability matter more than latency. For a real-time chat service, latency matters more than durability.
class ServiceLevels:
def __init__(self, service_name):
self.name = service_name
self.slis = []
self.slos = {}
self.sla = None
def add_sli(self, name, measurement, unit):
self.slis.append({"name": name, "measurement": measurement, "unit": unit})
def set_slo(self, sli_name, target):
self.slos[sli_name] = target
def set_sla(self, sla_value):
self.sla = sla_value
def report(self):
print(f"Service: {self.name}")
print(f"\nSLIs:")
for sli in self.slis:
print(f" - {sli['name']}: {sli['measurement']} ({sli['unit']})")
print(f"\nSLOs:")
for name, target in self.slos.items():
print(f" - {name}: {target}")
print(f"\nSLA: {self.sla}")
levels = ServiceLevels("DodaZIP Cloud Storage")
levels.add_sli("Availability", "Uptime percentage", "%")
levels.add_sli("Durability", "Data integrity check pass rate", "%")
levels.add_sli("Upload latency", "P95 upload completion time", "seconds")
levels.set_slo("Availability", "99.99%")
levels.set_slo("Durability", "99.9999999%")
levels.set_sla("99.9% uptime (SLA)")
levels.report()
Expected output:
Service: DodaZIP Cloud Storage
SLIs:
- Availability: Uptime percentage (%)
- Durability: Data integrity check pass rate (%)
- Upload latency: P95 upload completion time (seconds)
SLOs:
- Availability: 99.99%
- Durability: 99.9999999%
SLA: 99.9% uptime (SLA)
Step 2: Set SLOs Using Historical Data
Look at the last 30 days of SLI data. Set the SLO at a level that the service meets most of the time but requires effort to sustain.
Step 3: Negotiate SLAs with the Business
SLAs are business decisions, not engineering decisions. The SRE team provides data on what availability levels are achievable. The product and legal teams decide what to promise customers.
Common Errors
| Error | Explanation |
|---|---|
| Setting SLO equal to SLA | Eliminates the error budget buffer. Any downtime breaches both. |
| Having only SLAs without SLOs | Without internal SLOs, the SLA is the only reliability target — and by the time you breach it, you are already paying penalties. |
| Too many SLIs | Each SLI needs monitoring, alerting, and an SLO. Keep the set small and meaningful. |
| Ignoring SLIs that are not in the SLA | Just because something is not in the SLA does not mean it should not be measured. Track it internally. |
| Treating SLOs as guarantees | SLOs are not promises. They are internal targets that can be missed without penalty. SLA breaches have consequences. |
Practice Questions
- What is the difference between SLI, SLO, and SLA?
- Why must the SLO be tighter than the SLA?
- Who owns each of SLI, SLO, and SLA?
- What happens when an SLA is breached?
- How many SLIs should you track per service?
Challenge
You are the SRE lead for a new DodaTech service: a real-time document collaboration tool. Define three SLIs, set SLOs for each, and recommend an SLA to the business team. Explain why you chose each SLI and how much buffer exists between the SLO and SLA.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro