Error Budgets — Balancing Reliability and Velocity
In this tutorial, you'll learn about Error Budgets. We cover key concepts, practical examples, and best practices.
An error budget is the permissible amount of unreliability your service can experience over a defined window — when the budget is spent, the team shifts from shipping features to improving reliability.
What You'll Learn
In this tutorial, you will learn how error budgets are calculated from SLO targets, how to track budget consumption in real time, and how to use budget alerts to make data-driven decisions about feature freezes and reliability investments.
Why It Matters
Without error budgets, reliability and feature velocity are in constant conflict. Product teams push for faster releases. SRE teams push for stability. An error budget provides a clear, agreed-upon mechanism for deciding when to stop shipping and start stabilizing.
Real-World Use
Doda Browser sync service runs with a 99.9 percent SLO, giving it a 0.1 percent error budget over 30 days. That works out to about 43 minutes of allowed downtime per month. When the budget drops below 50 percent, the on-call team investigates. When it hits zero, all non-critical deployments stop until the budget recovers.
graph TD
A[SLO Target: 99.9%] --> B[Error Budget: 0.1%]
B --> C[Track Consumption]
C --> D{Budget Remaining?}
D -->|Green: >50%| E[Ship Features]
D -->|Yellow: 10-50%| F[Investigate Incidents]
D -->|Red: <10%| G[Stop Deployments]
E --> H[Reliability Focus]
F --> H
G --> H
H --> B
Prerequisites
You should understand SLIs and SLOs before working with error budgets. Knowing how Error Budgets interact with Incident Response is also helpful.
How Error Budgets Work
An error budget is simply 100 percent minus your SLO. If your SLO is 99.9 percent availability, your error budget is 0.1 percent of total request time.
Budget Calculation
Over a 30-day window at 99.9 percent SLO:
Total seconds in 30 days: 2,592,000 Allowed downtime: 2,592,000 * 0.001 = 2,592 seconds (about 43 minutes)
def calculate_error_budget(slo_percent, window_days):
total_seconds = window_days * 24 * 60 * 60
budget_seconds = total_seconds * (1 - slo_percent / 100)
print(f"SLO: {slo_percent}%")
print(f"Window: {window_days} days")
print(f"Total seconds: {total_seconds:,}")
print(f"Error budget (sec): {budget_seconds:.0f}")
print(f"Error budget (min): {budget_seconds/60:.1f}")
return budget_seconds
calculate_error_budget(99.9, 30)
Expected output:
SLO: 99.9%
Window: 30 days
Total seconds: 2,592,000
Error budget (sec): 2,592
Error budget (min): 43.2
Budget Consumption Rate
Track your actual downtime or failure rate against the budget. Each incident consumes a portion.
class ErrorBudgetTracker:
def __init__(self, slo_percent, window_days):
self.budget = calculate_error_budget(slo_percent, window_days)
self.consumed = 0
self.events = []
def record_incident(self, downtime_seconds, description):
self.consumed += downtime_seconds
remaining = self.budget - self.consumed
remaining_pct = remaining / self.budget * 100
self.events.append({
'downtime': downtime_seconds,
'desc': description,
'remaining_pct': remaining_pct
})
print(f"Incident: {description}")
print(f" Downtime: {downtime_seconds}s")
print(f" Budget remaining: {remaining_pct:.1f}%")
if remaining_pct < 10:
print(" WARNING: Budget critical — freeze deployments!")
tracker = ErrorBudgetTracker(99.9, 30)
tracker.record_incident(300, "Database failover")
tracker.record_incident(900, "CDN outage")
tracker.record_incident(600, "Deployment rollback")
Expected output:
Incident: Database failover
Downtime: 300s
Budget remaining: 88.4%
Incident: CDN outage
Downtime: 900s
Budget remaining: 53.7%
Incident: Deployment rollback
Downtime: 600s
Budget remaining: 30.6%
Error Budget Policies
A clear policy defines what happens at each consumption level.
| Budget Remaining | Action |
|---|---|
| Greater than 50 percent | Normal operations. Ship features. |
| 25 to 50 percent | Increased monitoring. Review recent incidents. |
| 10 to 25 percent | Alert the team. Consider slowing deployments. |
| Less than 10 percent | Deployment freeze. Focus exclusively on reliability. |
| Zero or negative | Emergency incident response. Full team mobilization. |
Automating Policy Enforcement
def check_budget(remaining_pct):
if remaining_pct > 50:
return "GREEN", "Normal operations"
elif remaining_pct > 25:
return "YELLOW", "Increased monitoring"
elif remaining_pct > 10:
return "ORANGE", "Slow deployments"
elif remaining_pct > 0:
return "RED", "Deployment freeze"
else:
return "CRITICAL", "Emergency response"
for pct in [60, 30, 15, 5, 0]:
status, action = check_budget(pct)
print(f"{pct:3d}% -> {status:10s} | {action}")
Expected output:
60% -> GREEN | Normal operations
30% -> YELLOW | Increased monitoring
15% -> ORANGE | Slow deployments
5% -> RED | Deployment freeze
0% -> CRITICAL | Emergency response
Error Budgets vs SLAs
An SLA (Service Level Agreement) is a contractual commitment with penalties. An SLO is an internal target. An error budget sits between them. Your SLO should always be tighter than your SLA. That way, you burn through your error budget before you breach the contract.
For example, if your SLA requires 99.95 percent uptime, set your internal SLO at 99.99 percent. Your error budget exists within that 0.04 percent buffer.
Common Errors
| Error | Explanation |
|---|---|
| Setting SLO equal to SLA | If your SLO matches your SLA, an error budget gives no warning before contractual breach. Set SLO tighter. |
| Ignoring the budget | The error budget only works if the team respects the policy. If you deploy through a red budget, the budget has no meaning. |
| Using too short a window | A 7-day window is too noisy. Use 28 or 30 days to smooth out natural variance. |
| No automated alerts | Tracking budget manually in a spreadsheet is error-prone. Automate alerts at each threshold. |
| One budget for all services | Different services have different criticality. Set separate budgets for different tiers. |
| Forgetting to refill | The budget resets each window. If you ended the previous window in the red, the new window starts fresh. |
Practice Questions
- How is an error budget calculated from an SLO?
- What is the recommended action when the error budget drops below 10 percent?
- Why should the SLO be tighter than the SLA?
- What is a reasonable window length for error budget tracking?
- What happens to the error budget at the end of the measurement window?
Challenge
Your DodaTech team runs a file synchronization service with a 99.95 percent SLO. Over the last 15 days, you have experienced three incidents: a 10-minute DNS outage, a 25-minute database migration issue, and a 5-minute certificate expiration. Calculate the remaining error budget, determine the current policy status, and write a brief recommendation for the next team meeting.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro