Error Budgets — Balancing Reliability and Velocity

DodaTech Updated 2026-06-23 6 min read

In this tutorial, you'll learn about Error Budgets. We cover key concepts, practical examples, and best practices.

An error budget is the permissible amount of unreliability your service can experience over a defined window — when the budget is spent, the team shifts from shipping features to improving reliability.

What You'll Learn

In this tutorial, you will learn how error budgets are calculated from SLO targets, how to track budget consumption in real time, and how to use budget alerts to make data-driven decisions about feature freezes and reliability investments.

Why It Matters

Without error budgets, reliability and feature velocity are in constant conflict. Product teams push for faster releases. SRE teams push for stability. An error budget provides a clear, agreed-upon mechanism for deciding when to stop shipping and start stabilizing.

Real-World Use

Doda Browser sync service runs with a 99.9 percent SLO, giving it a 0.1 percent error budget over 30 days. That works out to about 43 minutes of allowed downtime per month. When the budget drops below 50 percent, the on-call team investigates. When it hits zero, all non-critical deployments stop until the budget recovers.

graph TD
    A[SLO Target: 99.9%] --> B[Error Budget: 0.1%]
    B --> C[Track Consumption]
    C --> D{Budget Remaining?}
    D -->|Green: >50%| E[Ship Features]
    D -->|Yellow: 10-50%| F[Investigate Incidents]
    D -->|Red: <10%| G[Stop Deployments]
    E --> H[Reliability Focus]
    F --> H
    G --> H
    H --> B

Prerequisites

You should understand SLIs and SLOs before working with error budgets. Knowing how Error Budgets interact with Incident Response is also helpful.

How Error Budgets Work

An error budget is simply 100 percent minus your SLO. If your SLO is 99.9 percent availability, your error budget is 0.1 percent of total request time.

Budget Calculation

Over a 30-day window at 99.9 percent SLO:

Total seconds in 30 days: 2,592,000 Allowed downtime: 2,592,000 * 0.001 = 2,592 seconds (about 43 minutes)

def calculate_error_budget(slo_percent, window_days):
    total_seconds = window_days * 24 * 60 * 60
    budget_seconds = total_seconds * (1 - slo_percent / 100)
    print(f"SLO:                  {slo_percent}%")
    print(f"Window:               {window_days} days")
    print(f"Total seconds:        {total_seconds:,}")
    print(f"Error budget (sec):   {budget_seconds:.0f}")
    print(f"Error budget (min):   {budget_seconds/60:.1f}")
    return budget_seconds

calculate_error_budget(99.9, 30)

Expected output:

SLO:                  99.9%
Window:               30 days
Total seconds:        2,592,000
Error budget (sec):   2,592
Error budget (min):   43.2

Budget Consumption Rate

Track your actual downtime or failure rate against the budget. Each incident consumes a portion.

class ErrorBudgetTracker:
    def __init__(self, slo_percent, window_days):
        self.budget = calculate_error_budget(slo_percent, window_days)
        self.consumed = 0
        self.events = []

    def record_incident(self, downtime_seconds, description):
        self.consumed += downtime_seconds
        remaining = self.budget - self.consumed
        remaining_pct = remaining / self.budget * 100
        self.events.append({
            'downtime': downtime_seconds,
            'desc': description,
            'remaining_pct': remaining_pct
        })
        print(f"Incident: {description}")
        print(f"  Downtime: {downtime_seconds}s")
        print(f"  Budget remaining: {remaining_pct:.1f}%")
        if remaining_pct < 10:
            print("  WARNING: Budget critical — freeze deployments!")

tracker = ErrorBudgetTracker(99.9, 30)
tracker.record_incident(300, "Database failover")
tracker.record_incident(900, "CDN outage")
tracker.record_incident(600, "Deployment rollback")

Expected output:

Incident: Database failover
  Downtime: 300s
  Budget remaining: 88.4%
Incident: CDN outage
  Downtime: 900s
  Budget remaining: 53.7%
Incident: Deployment rollback
  Downtime: 600s
  Budget remaining: 30.6%

Error Budget Policies

A clear policy defines what happens at each consumption level.

Budget Remaining	Action
Greater than 50 percent	Normal operations. Ship features.
25 to 50 percent	Increased monitoring. Review recent incidents.
10 to 25 percent	Alert the team. Consider slowing deployments.
Less than 10 percent	Deployment freeze. Focus exclusively on reliability.
Zero or negative	Emergency incident response. Full team mobilization.

Automating Policy Enforcement

def check_budget(remaining_pct):
    if remaining_pct > 50:
        return "GREEN", "Normal operations"
    elif remaining_pct > 25:
        return "YELLOW", "Increased monitoring"
    elif remaining_pct > 10:
        return "ORANGE", "Slow deployments"
    elif remaining_pct > 0:
        return "RED", "Deployment freeze"
    else:
        return "CRITICAL", "Emergency response"

for pct in [60, 30, 15, 5, 0]:
    status, action = check_budget(pct)
    print(f"{pct:3d}% -> {status:10s} | {action}")

Expected output:

 60% -> GREEN      | Normal operations
 30% -> YELLOW     | Increased monitoring
 15% -> ORANGE     | Slow deployments
  5% -> RED        | Deployment freeze
  0% -> CRITICAL   | Emergency response

Error Budgets vs SLAs

An SLA (Service Level Agreement) is a contractual commitment with penalties. An SLO is an internal target. An error budget sits between them. Your SLO should always be tighter than your SLA. That way, you burn through your error budget before you breach the contract.

For example, if your SLA requires 99.95 percent uptime, set your internal SLO at 99.99 percent. Your error budget exists within that 0.04 percent buffer.

Common Errors

Error	Explanation
Setting SLO equal to SLA	If your SLO matches your SLA, an error budget gives no warning before contractual breach. Set SLO tighter.
Ignoring the budget	The error budget only works if the team respects the policy. If you deploy through a red budget, the budget has no meaning.
Using too short a window	A 7-day window is too noisy. Use 28 or 30 days to smooth out natural variance.
No automated alerts	Tracking budget manually in a spreadsheet is error-prone. Automate alerts at each threshold.
One budget for all services	Different services have different criticality. Set separate budgets for different tiers.
Forgetting to refill	The budget resets each window. If you ended the previous window in the red, the new window starts fresh.

Practice Questions

How is an error budget calculated from an SLO?
What is the recommended action when the error budget drops below 10 percent?
Why should the SLO be tighter than the SLA?
What is a reasonable window length for error budget tracking?
What happens to the error budget at the end of the measurement window?

Challenge

Your DodaTech team runs a file synchronization service with a 99.95 percent SLO. Over the last 15 days, you have experienced three incidents: a 10-minute DNS outage, a 25-minute database migration issue, and a 5-minute certificate expiration. Calculate the remaining error budget, determine the current policy status, and write a brief recommendation for the next team meeting.

FAQ

What is an error budget in SRE?

An error budget is the amount of time a service can be unavailable or fail to meet its SLO within a defined window before the team must stop shipping features.

How is the error budget calculated?

The error budget is 100 percent minus the SLO percentage, multiplied by the total time in the measurement window. For a 99.9 percent SLO over 30 days, the budget is 43.2 minutes.

What happens when the error budget is depleted?

When the budget is depleted, the team shifts focus from feature development to reliability improvements. Non-critical deployments are frozen until the budget recovers.

Can the error budget be negative?

The budget cannot go below zero in practice. If it reaches zero, the team enters emergency response mode.

Does the error budget reset?

Yes. The error budget resets at the end of each measurement window, typically every 28 or 30 days.

← Previous SLIs and SLOs — Defining Service Reliability Goals Next → Postmortems and Blameless Culture — Complete Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Site Reliability Engineering