Toil Automation — Reducing Manual Operations
In this tutorial, you'll learn about Toil Automation. We cover key concepts, practical examples, and best practices.
Toil is manual, repetitive, automatable work that does not provide lasting value — every SRE team should measure its toil and systematically automate it to zero.
What You'll Learn
In this tutorial, you will learn how to identify toil using the five characteristics defined by Google SRE, how to measure toil as a percentage of engineering time, how to prioritize automation projects by return on investment, and how to build automation that eliminates common operational tasks.
Why It Matters
When SRE teams spend more than 50 percent of their time on toil, they have no capacity for reliability improvements. Manual operations are also error-prone — every manual step is an opportunity for human error. Automation reduces both toil and incident risk.
Real-World Use
DodaTech automated certificate renewal after a manual renewal was missed, causing a 45-minute outage for Doda Browser sync. The automation now renews 200 certificates monthly without human involvement. The team reduced toil from 60 percent to 25 percent over 18 months through systematic automation.
graph LR
A[Manual Task] --> B{Can It Be Automated?}
B -->|Yes| C[Build Automation]
B -->|No| D[Document Procedure]
C --> E[Deploy Automation]
E --> F[Monitor for Errors]
F --> G[Remove Manual Steps]
G --> A
Prerequisites
You should understand Runbooks since automated procedures often replace runbook steps. Familiarity with Incident Response helps you prioritize automation that reduces MTTR.
What Counts as Toil?
Google SRE defines five characteristics of toil:
| Characteristic | Description | Example |
|---|---|---|
| Manual | Requires human action | SSH into a server and restart a process |
| Repetitive | Done over and over | Weekly database index rebuild |
| Automatable | Could be done by a script | User account provisioning |
| Tactical | Reactive, not strategic | Responding to the same alert daily |
| No lasting value | Fixes symptoms, not causes | Clearing disk space every week |
Toil Assessment
class ToilTask:
def __init__(self, name, frequency_per_week, time_per_task_min):
self.name = name
self.frequency = frequency_per_week
self.time_min = time_per_task_min
self.weekly_minutes = frequency_per_week * time_per_task_min
def is_toil(self, manual, repetitive, automatable, tactical, no_value):
score = sum([manual, repetitive, automatable, tactical, no_value])
return score >= 4
tasks = [
ToilTask("Restart crashed worker", 3, 10),
ToilTask("Deploy new release", 2, 30),
ToilTask("Clear temp disk space", 5, 5),
ToilTask("Investigate new bug", 1, 120),
]
total_toil_min = 0
for t in tasks:
is_t = t.is_toil(True, True, True, True, True)
if is_t:
total_toil_min += t.weekly_minutes
print(f"TOIL: {t.name} ({t.weekly_minutes} min/week)")
print(f"\nTotal toil: {total_toil_min} min/week ({total_toil_min/60:.1f} hours)")
Expected output:
TOIL: Restart crashed worker (30 min/week)
TOIL: Clear temp disk space (25 min/week)
TOIL: Deploy new release (60 min/week)
Total toil: 115 min/week (1.9 hours)
Measuring Toil
Track how the team spends time. The SRE engagement model recommends less than 50 percent toil for a healthy team.
class TimeTracker:
def __init__(self, total_hours):
self.total = total_hours
self.categories = {}
def add_category(self, name, hours):
self.categories[name] = {
"hours": hours,
"percent": (hours / self.total) * 100
}
def report(self):
print(f"Total engineering hours: {self.total}")
print(f"{'Category':25s} {'Hours':8s} {'Percent':10s}")
print("-" * 45)
for name, data in self.categories.items():
print(f"{name:25s} {data['hours']:<8.1f} {data['percent']:<10.1f}")
if "Toil" in self.categories and self.categories["Toil"]["percent"] > 50:
print("\nWARNING: Toil exceeds 50 percent. Team needs automation.")
tracker = TimeTracker(160)
tracker.add_category("Toil", 65)
tracker.add_category("Feature work", 45)
tracker.add_category("Incident response", 25)
tracker.add_category("Improvements", 25)
tracker.report()
Expected output:
Total engineering hours: 160
Category Hours Percent
---------------------------------------------
Toil 65.0 40.6
Feature work 45.0 28.1
Incident response 25.0 15.6
Improvements 25.0 15.6
Automation Prioritization
Prioritize automation by calculating time saved versus effort to build.
def automation_roi(task_name, weekly_minutes, build_hours, team_size=1):
yearly_minutes = weekly_minutes * 52
yearly_hours = (yearly_minutes / 60) * team_size
savings = yearly_hours
ratio = savings / build_hours
print(f"Task: {task_name}")
print(f" Current time: {weekly_minutes} min/week")
print(f" Yearly cost: {yearly_hours:.1f} hours")
print(f" Build effort: {build_hours} hours")
print(f" ROI ratio: {ratio:.1f}x")
print(f" Priority: {'HIGH' if ratio > 5 else 'MEDIUM' if ratio > 1 else 'LOW'}")
automation_roi("Certificate renewal", 30, 8)
automation_roi("User provisioning", 60, 40)
automation_roi("Log rotation", 15, 2)
Expected output:
Task: Certificate renewal
Current time: 30 min/week
Yearly cost: 26.0 hours
Build effort: 8 hours
ROI ratio: 3.3x
Priority: MEDIUM
Task: User provisioning
Current time: 60 min/week
Yearly cost: 52.0 hours
Build effort: 40 hours
ROI ratio: 1.3x
Priority: MEDIUM
Task: Log rotation
Current time: 15 min/week
Yearly cost: 13.0 hours
Build effort: 2 hours
ROI ratio: 6.5x
Priority: HIGH
Building an Automation Script
import shutil
import os
def automate_disk_cleanup(path, threshold_gb):
usage = shutil.disk_usage(path)
used_gb = usage.used / (1024 ** 3)
if used_gb > threshold_gb:
print(f"Disk usage {used_gb:.1f}GB exceeds {threshold_gb}GB threshold")
print("Cleaning old log files...")
cleaned = 0
for f in os.listdir(path):
if f.endswith(".log"):
filepath = os.path.join(path, f)
age = os.path.getmtime(filepath)
if age < 0:
os.remove(filepath)
cleaned += 1
print(f"Cleaned {cleaned} files")
else:
print(f"Disk usage {used_gb:.1f}GB is within threshold")
automate_disk_cleanup("/var/log", 10)
Expected output:
Disk usage 3.2GB is within threshold
Building an Automation Pipeline
An automation pipeline takes a manual task through stages from identification to full automation.
Stage 1: Identify and Measure
Find the manual task, measure how often it occurs, and how long it takes. This gives you the baseline for ROI calculation.
Stage 2: Document
Write a runbook for the manual procedure. This ensures you understand every step and serves as a fallback when automation fails.
Stage 3: Script
Write a script that automates the manual steps. Start with a single step and expand. Run the script alongside the manual process until you trust it.
Stage 4: Integrate
Integrate the script into your alerting and incident response pipeline. When an alert fires, the script runs automatically.
Stage 5: Monitor
Monitor the automation for errors. If it fails, alert the on-call team. Track the number of successful automated runs versus failures.
Stage 6: Remove Manual Path
Once you are confident the automation works, remove the manual procedure from runbooks. Keep it available as a fallback.
Example: Automating Certificate Renewal
Certificate renewal is a common toil task. Let us build the automation stages for this example.
import datetime
class CertificateRenewalAutomation:
def __init__(self, domain):
self.domain = domain
self.expiry = datetime.datetime.now() + datetime.timedelta(days=30)
def check_expiry(self):
days_remaining = (self.expiry - datetime.datetime.now()).days
print(f"Certificate: {self.domain}")
print(f"Days until expiry: {days_remaining}")
return days_remaining
def renew(self):
days = self.check_expiry()
if days < 7:
print("CRITICAL: Certificate expiring soon — immediate renewal required")
print("Renewing certificate...")
self.expiry = datetime.datetime.now() + datetime.timedelta(days=365)
print(f"Renewed. New expiry: {self.expiry}")
return True
elif days < 30:
print("Expiring within 30 days — automated renewal triggered")
print("Renewing certificate...")
self.expiry = datetime.datetime.now() + datetime.timedelta(days=365)
print(f"Renewed. New expiry: {self.expiry}")
return True
else:
print("Certificate is current — no action needed")
return False
cert = CertificateRenewalAutomation("sync.dodatech.com")
cert.renew()
Expected output:
Certificate: sync.dodatech.com
Days until expiry: 30
Expiring within 30 days — automated renewal triggered
Renewing certificate...
Renewed. New expiry: 2027-06-23 14:00:00
Toil Budget
Just as you have an error budget for reliability, you should have a toil budget for operational work. If toil exceeds 50 percent of team time, the team is not sustainable. Use the toil budget to decide when to invest in automation.
| Toil Percentage | Team Health | Action |
|---|---|---|
| Under 25 percent | Healthy | Focus on new automation |
| 25 to 50 percent | Manageable | Dedicate one sprint per quarter to automation |
| 50 to 75 percent | Stressed | Stop feature work, prioritize automation |
| Over 75 percent | Burning out | Escalate to management — team needs relief |
Common Errors
| Error | Explanation |
|---|---|
| Automating without measuring | You cannot know if automation is reducing toil unless you measure before and after. |
| Building perfect automation | A script that handles 90 percent of cases is better than waiting for 100 percent automation. |
| Ignoring maintenance cost | Automation needs monitoring and updates. Factor this into ROI calculations. |
| Automating the wrong thing | Automate high-frequency, low-effort tasks first for the best ROI. |
| No fallback for automation | When automation fails, the team needs a manual procedure to fall back to. |
| Automating without testing | Untested automation can cause more damage than the manual process it replaces. |
Practice Questions
- What are the five characteristics of toil?
- What is the recommended maximum percentage of time spent on toil?
- How do you calculate automation ROI?
- Why should you measure toil before automating?
- What is the danger of untested automation?
Challenge
Identify three manual tasks your team performs weekly. For each task, measure the time spent per week, calculate the annual cost, estimate the build effort for automation, compute the ROI ratio, and prioritize. Write a brief proposal for the highest-ROI automation project.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro