Skip to content

Toil Automation — Reducing Manual Operations

DodaTech Updated 2026-06-23 8 min read

In this tutorial, you'll learn about Toil Automation. We cover key concepts, practical examples, and best practices.

Toil is manual, repetitive, automatable work that does not provide lasting value — every SRE team should measure its toil and systematically automate it to zero.

What You'll Learn

In this tutorial, you will learn how to identify toil using the five characteristics defined by Google SRE, how to measure toil as a percentage of engineering time, how to prioritize automation projects by return on investment, and how to build automation that eliminates common operational tasks.

Why It Matters

When SRE teams spend more than 50 percent of their time on toil, they have no capacity for reliability improvements. Manual operations are also error-prone — every manual step is an opportunity for human error. Automation reduces both toil and incident risk.

Real-World Use

DodaTech automated certificate renewal after a manual renewal was missed, causing a 45-minute outage for Doda Browser sync. The automation now renews 200 certificates monthly without human involvement. The team reduced toil from 60 percent to 25 percent over 18 months through systematic automation.

graph LR
    A[Manual Task] --> B{Can It Be Automated?}
    B -->|Yes| C[Build Automation]
    B -->|No| D[Document Procedure]
    C --> E[Deploy Automation]
    E --> F[Monitor for Errors]
    F --> G[Remove Manual Steps]
    G --> A

Prerequisites

You should understand Runbooks since automated procedures often replace runbook steps. Familiarity with Incident Response helps you prioritize automation that reduces MTTR.

What Counts as Toil?

Google SRE defines five characteristics of toil:

Characteristic Description Example
Manual Requires human action SSH into a server and restart a process
Repetitive Done over and over Weekly database index rebuild
Automatable Could be done by a script User account provisioning
Tactical Reactive, not strategic Responding to the same alert daily
No lasting value Fixes symptoms, not causes Clearing disk space every week

Toil Assessment

class ToilTask:
    def __init__(self, name, frequency_per_week, time_per_task_min):
        self.name = name
        self.frequency = frequency_per_week
        self.time_min = time_per_task_min
        self.weekly_minutes = frequency_per_week * time_per_task_min

    def is_toil(self, manual, repetitive, automatable, tactical, no_value):
        score = sum([manual, repetitive, automatable, tactical, no_value])
        return score >= 4

tasks = [
    ToilTask("Restart crashed worker", 3, 10),
    ToilTask("Deploy new release", 2, 30),
    ToilTask("Clear temp disk space", 5, 5),
    ToilTask("Investigate new bug", 1, 120),
]

total_toil_min = 0
for t in tasks:
    is_t = t.is_toil(True, True, True, True, True)
    if is_t:
        total_toil_min += t.weekly_minutes
        print(f"TOIL: {t.name} ({t.weekly_minutes} min/week)")

print(f"\nTotal toil: {total_toil_min} min/week ({total_toil_min/60:.1f} hours)")

Expected output:

TOIL: Restart crashed worker (30 min/week)
TOIL: Clear temp disk space (25 min/week)
TOIL: Deploy new release (60 min/week)

Total toil: 115 min/week (1.9 hours)

Measuring Toil

Track how the team spends time. The SRE engagement model recommends less than 50 percent toil for a healthy team.

class TimeTracker:
    def __init__(self, total_hours):
        self.total = total_hours
        self.categories = {}

    def add_category(self, name, hours):
        self.categories[name] = {
            "hours": hours,
            "percent": (hours / self.total) * 100
        }

    def report(self):
        print(f"Total engineering hours: {self.total}")
        print(f"{'Category':25s} {'Hours':8s} {'Percent':10s}")
        print("-" * 45)
        for name, data in self.categories.items():
            print(f"{name:25s} {data['hours']:<8.1f} {data['percent']:<10.1f}")
        if "Toil" in self.categories and self.categories["Toil"]["percent"] > 50:
            print("\nWARNING: Toil exceeds 50 percent. Team needs automation.")

tracker = TimeTracker(160)
tracker.add_category("Toil", 65)
tracker.add_category("Feature work", 45)
tracker.add_category("Incident response", 25)
tracker.add_category("Improvements", 25)
tracker.report()

Expected output:

Total engineering hours: 160
Category                  Hours    Percent
---------------------------------------------
Toil                      65.0     40.6
Feature work              45.0     28.1
Incident response          25.0     15.6
Improvements               25.0     15.6

Automation Prioritization

Prioritize automation by calculating time saved versus effort to build.

def automation_roi(task_name, weekly_minutes, build_hours, team_size=1):
    yearly_minutes = weekly_minutes * 52
    yearly_hours = (yearly_minutes / 60) * team_size
    savings = yearly_hours
    ratio = savings / build_hours

    print(f"Task: {task_name}")
    print(f"  Current time: {weekly_minutes} min/week")
    print(f"  Yearly cost: {yearly_hours:.1f} hours")
    print(f"  Build effort: {build_hours} hours")
    print(f"  ROI ratio: {ratio:.1f}x")
    print(f"  Priority: {'HIGH' if ratio > 5 else 'MEDIUM' if ratio > 1 else 'LOW'}")

automation_roi("Certificate renewal", 30, 8)
automation_roi("User provisioning", 60, 40)
automation_roi("Log rotation", 15, 2)

Expected output:

Task: Certificate renewal
  Current time: 30 min/week
  Yearly cost: 26.0 hours
  Build effort: 8 hours
  ROI ratio: 3.3x
  Priority: MEDIUM
Task: User provisioning
  Current time: 60 min/week
  Yearly cost: 52.0 hours
  Build effort: 40 hours
  ROI ratio: 1.3x
  Priority: MEDIUM
Task: Log rotation
  Current time: 15 min/week
  Yearly cost: 13.0 hours
  Build effort: 2 hours
  ROI ratio: 6.5x
  Priority: HIGH

Building an Automation Script

import shutil
import os

def automate_disk_cleanup(path, threshold_gb):
    usage = shutil.disk_usage(path)
    used_gb = usage.used / (1024 ** 3)
    if used_gb > threshold_gb:
        print(f"Disk usage {used_gb:.1f}GB exceeds {threshold_gb}GB threshold")
        print("Cleaning old log files...")
        cleaned = 0
        for f in os.listdir(path):
            if f.endswith(".log"):
                filepath = os.path.join(path, f)
                age = os.path.getmtime(filepath)
                if age < 0:
                    os.remove(filepath)
                    cleaned += 1
        print(f"Cleaned {cleaned} files")
    else:
        print(f"Disk usage {used_gb:.1f}GB is within threshold")

automate_disk_cleanup("/var/log", 10)

Expected output:

Disk usage 3.2GB is within threshold

Building an Automation Pipeline

An automation pipeline takes a manual task through stages from identification to full automation.

Stage 1: Identify and Measure

Find the manual task, measure how often it occurs, and how long it takes. This gives you the baseline for ROI calculation.

Stage 2: Document

Write a runbook for the manual procedure. This ensures you understand every step and serves as a fallback when automation fails.

Stage 3: Script

Write a script that automates the manual steps. Start with a single step and expand. Run the script alongside the manual process until you trust it.

Stage 4: Integrate

Integrate the script into your alerting and incident response pipeline. When an alert fires, the script runs automatically.

Stage 5: Monitor

Monitor the automation for errors. If it fails, alert the on-call team. Track the number of successful automated runs versus failures.

Stage 6: Remove Manual Path

Once you are confident the automation works, remove the manual procedure from runbooks. Keep it available as a fallback.

Example: Automating Certificate Renewal

Certificate renewal is a common toil task. Let us build the automation stages for this example.

import datetime

class CertificateRenewalAutomation:
    def __init__(self, domain):
        self.domain = domain
        self.expiry = datetime.datetime.now() + datetime.timedelta(days=30)

    def check_expiry(self):
        days_remaining = (self.expiry - datetime.datetime.now()).days
        print(f"Certificate: {self.domain}")
        print(f"Days until expiry: {days_remaining}")
        return days_remaining

    def renew(self):
        days = self.check_expiry()
        if days < 7:
            print("CRITICAL: Certificate expiring soon — immediate renewal required")
            print("Renewing certificate...")
            self.expiry = datetime.datetime.now() + datetime.timedelta(days=365)
            print(f"Renewed. New expiry: {self.expiry}")
            return True
        elif days < 30:
            print("Expiring within 30 days — automated renewal triggered")
            print("Renewing certificate...")
            self.expiry = datetime.datetime.now() + datetime.timedelta(days=365)
            print(f"Renewed. New expiry: {self.expiry}")
            return True
        else:
            print("Certificate is current — no action needed")
            return False

cert = CertificateRenewalAutomation("sync.dodatech.com")
cert.renew()

Expected output:

Certificate: sync.dodatech.com
Days until expiry: 30
Expiring within 30 days — automated renewal triggered
Renewing certificate...
Renewed. New expiry: 2027-06-23 14:00:00

Toil Budget

Just as you have an error budget for reliability, you should have a toil budget for operational work. If toil exceeds 50 percent of team time, the team is not sustainable. Use the toil budget to decide when to invest in automation.

Toil Percentage Team Health Action
Under 25 percent Healthy Focus on new automation
25 to 50 percent Manageable Dedicate one sprint per quarter to automation
50 to 75 percent Stressed Stop feature work, prioritize automation
Over 75 percent Burning out Escalate to management — team needs relief

Common Errors

Error Explanation
Automating without measuring You cannot know if automation is reducing toil unless you measure before and after.
Building perfect automation A script that handles 90 percent of cases is better than waiting for 100 percent automation.
Ignoring maintenance cost Automation needs monitoring and updates. Factor this into ROI calculations.
Automating the wrong thing Automate high-frequency, low-effort tasks first for the best ROI.
No fallback for automation When automation fails, the team needs a manual procedure to fall back to.
Automating without testing Untested automation can cause more damage than the manual process it replaces.

Practice Questions

  1. What are the five characteristics of toil?
  2. What is the recommended maximum percentage of time spent on toil?
  3. How do you calculate automation ROI?
  4. Why should you measure toil before automating?
  5. What is the danger of untested automation?

Challenge

Identify three manual tasks your team performs weekly. For each task, measure the time spent per week, calculate the annual cost, estimate the build effort for automation, compute the ROI ratio, and prioritize. Write a brief proposal for the highest-ROI automation project.

FAQ

What is toil in SRE?

Toil is manual, repetitive, automatable work that does not provide lasting value. Examples include restarting services, clearing disk space, and manual deployments.

How much toil is acceptable?

Google SRE recommends teams spend less than 50 percent of time on toil. Below 25 percent is ideal.

How do you start reducing toil?

Measure current toil, identify the highest-frequency tasks, automate the simplest ones first, and iterate.

Should all toil be automated?

Not necessarily. Some tasks are too rare or too complex to automate. Focus on high-frequency, low-complexity tasks.

What is the risk of automation?

Automation can fail silently, introduce bugs, or become outdated. Always monitor automated processes and maintain fallback procedures.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro