Cost Efficiency in SRE — Balancing Spend and Reliability
In this tutorial, you'll learn about Cost Efficiency in SRE. We cover key concepts, practical examples, and best practices.
Cost efficiency in SRE means spending infrastructure money where it most improves reliability and reducing spending where excess capacity provides no meaningful reliability benefit — it is about optimizing the cost-to-reliability ratio, not minimizing either one in isolation.
What You'll Learn
In this tutorial, you will learn how to calculate the cost of reliability, how to use cost-aware SLOs to make trade-off decisions, how to right-size resources without sacrificing SLO targets, and how to implement auto-scaling and spot instances to reduce waste.
Why It Matters
Cloud infrastructure is the second-largest cost for most technology companies after personnel. SRE teams control infrastructure decisions. Without cost awareness, teams over-provision for peak traffic, run idle capacity, and waste millions. With cost awareness, they can reduce spending by 30-50 percent while maintaining or improving reliability.
Real-World Use
DodaTech reduced its annual cloud infrastructure cost by 40 percent by implementing right-sizing and intelligent auto-scaling. DodaZIP storage uses spot instances for compression workers, saving 60 percent compared to on-demand instances. The Doda Browser sync team identified that running at 99.99 percent availability instead of 99.999 percent saved 35 percent in infrastructure costs while meeting customer needs.
graph LR
A[Service] --> B[Set SLO]
B --> C[Required Capacity]
C --> D[Cost Calculation]
D --> E{Can We Optimize?}
E -->|Right-size| F[Reduce Waste]
E -->|Auto-scale| G[Match Demand]
E -->|Spot/Reserved| H[Lower Unit Cost]
F --> I[Optimized Cost]
G --> I
H --> I
I --> J{Within Budget?}
J -->|Yes| K[Maintain]
J -->|No| A
Prerequisites
Understanding SLIs and SLOs helps you connect cost decisions to reliability targets. Familiarity with Capacity Planning is essential since capacity decisions directly affect cost.
The Cost of Reliability
Reliability has diminishing returns. Moving from 99 percent to 99.9 percent is relatively cheap. Moving from 99.99 percent to 99.999 percent can cost 10 times more for a 0.009 percent improvement.
def reliability_cost(availability_target):
base_cost = 1000
nines = availability_target - 99
multiplier = 2 ** nines
cost = base_cost * multiplier
print(f"Target: {availability_target}% ({nines} nines)")
print(f"Relative cost: ${cost:,}/month")
return cost
reliability_cost(99.0)
reliability_cost(99.9)
reliability_cost(99.99)
reliability_cost(99.999)
Expected output:
Target: 99.0% (0 nines)
Relative cost: $1,000/month
Target: 99.9% (1 nines)
Relative cost: $2,000/month
Target: 99.99% (2 nines)
Relative cost: $4,000/month
Target: 99.999% (3 nines)
Relative cost: $8,000/month
Cost-Aware SLOs
Different services need different reliability levels. A payment service needs five nines. An internal reporting dashboard can run at 99 percent. The SRE team must help product owners understand the cost implications of each additional nine so they can make informed trade-offs between customer experience, budget, and engineering effort. Every nine adds approximately 100 percent to infrastructure cost for that service.
class CostAwareSLO:
def __init__(self, service_name, slo, monthly_cost):
self.service = service_name
self.slo = slo
self.monthly_cost = monthly_cost
def cost_per_nine(self):
nines = self.slo - 99
return self.monthly_cost / nines if nines > 0 else self.monthly_cost
def report(self):
print(f"Service: {self.service}")
print(f" SLO: {self.slo}%")
print(f" Monthly cost: ${self.monthly_cost:,}")
print(f" Cost per nine: ${self.cost_per_nine():,.0f}")
services = [
CostAwareSLO("Payment API", 99.99, 15000),
CostAwareSLO("Admin Dashboard", 99.0, 2000),
CostAwareSLO("Notification Service", 99.9, 5000),
]
for s in services:
s.report()
print()
Expected output:
Service: Payment API
SLO: 99.99%
Monthly cost: $15,000
Cost per nine: $7,500
Service: Admin Dashboard
SLO: 99.0%
Monthly cost: $2,000
Cost per nine: $2,000
Service: Notification Service
SLO: 99.9%
Monthly cost: $5,000
Cost per nine: $5,000
Resource Right-Sizing
Right-sizing means matching instance types and counts to actual workload requirements, not peak theoretical maximum. A server running at 20 percent utilization is wasting 80 percent of its capacity and cost. Rightsizing recovers that wasted spending without affecting performance because the extra capacity was never needed.
def right_sizing_recommendation(current_instance, current_cost, avg_utilization):
print(f"Current instance: {current_instance} (${current_cost}/month)")
print(f"Average utilization: {avg_utilization}%")
if avg_utilization < 30:
recommendation = "DOWNSCALE to smaller instance"
estimated_savings = current_cost * 0.5
elif avg_utilization < 60:
recommendation = "OPTIMAL size — no change needed"
estimated_savings = 0
elif avg_utilization < 85:
recommendation = "MONITOR — approaching capacity limits"
estimated_savings = 0
else:
recommendation = "UPSCALE — risk of capacity exhaustion"
estimated_savings = 0
print(f"Recommendation: {recommendation}")
if estimated_savings > 0:
print(f"Estimated monthly savings: ${estimated_savings:.0f}")
print()
right_sizing_recommendation("m5.large", 120, 25)
right_sizing_recommendation("m5.xlarge", 240, 55)
Expected output:
Current instance: m5.large ($120/month)
Average utilization: 25%
Recommendation: DOWNSCALE to smaller instance
Estimated monthly savings: $60
Current instance: m5.xlarge ($240/month)
Average utilization: 55%
Recommendation: OPTIMAL size — no change needed
Estimated monthly savings: $0
Auto-Scaling Cost Optimization
Auto-scaling saves money by matching capacity to demand. The key metric is the ratio of peak to baseline traffic.
def autoscaling_savings(baseline_instances, peak_instances, hours_at_peak):
always_on_cost = peak_instances * 720 * 0.10
auto_scaled_cost = (baseline_instances * 720 * 0.10) + \
((peak_instances - baseline_instances) * hours_at_peak * 0.10)
savings = always_on_cost - auto_scaled_cost
print(f"Baseline: {baseline_instances} instances")
print(f"Peak: {peak_instances} instances ({hours_at_peak}h/day at peak)")
print(f"Cost (always on): ${always_on_cost:.0f}/month")
print(f"Cost (auto-scale): ${auto_scaled_cost:.0f}/month")
print(f"Savings: ${savings:.0f}/month ({savings/always_on_cost:.0%})")
autoscaling_savings(5, 20, 4)
Expected output:
Baseline: 5 instances
Peak: 20 instances (4h/day at peak)
Cost (always on): $1,440/month
Cost (auto-scale): $420/month
Savings: $1,020/month (71%)
Spot and Reserved Instances
| Purchase Model | Discount | Risk | Best For |
|---|---|---|---|
| On-demand | 0 percent | None | Variable workloads |
| Reserved (1yr) | 30-40 percent | Commitment | Steady-state workloads |
| Reserved (3yr) | 50-60 percent | Commitment | Long-term stable workloads |
| Spot | 60-90 percent | Can be interrupted | Fault-tolerant batch workloads |
Unit Economics of Reliability
Understanding unit economics helps SRE teams make data-driven cost decisions. The key question is: how much does it cost to serve one request reliably?
For the Doda Browser sync service, the calculation might be:
- Monthly infrastructure cost: $12,000
- Monthly requests: 40 million
- Cost per request: $0.0003
- Cost to add one nine of reliability: approximately $6,000/month extra
With this data, the team can decide whether the business value of moving from 99.9 percent to 99.99 percent availability justifies the additional $72,000 per year in infrastructure costs.
Building a Cost Allocation Strategy
Cost allocation ensures that infrastructure spending is correctly attributed to the teams and services that generate it. Without allocation, you cannot identify which services are over-spending or which optimizations provide the best return.
The standard approach is to tag every cloud resource with service name, environment, team, and cost center. Automated policies enforce tagging at deployment time. Reports are generated weekly showing cost by service and team.
class CostAllocation:
def __init__(self):
self.resources = []
def add_resource(self, resource_id, service, env, monthly_cost):
self.resources.append({
"id": resource_id,
"service": service,
"env": env,
"cost": monthly_cost
})
def report_by_service(self):
costs = {}
for r in self.resources:
costs.setdefault(r["service"], 0)
costs[r["service"]] += r["cost"]
print("Cost by Service:")
for svc, cost in sorted(costs.items(), key=lambda x: -x[1]):
print(f" {svc}: ${cost:,.0f}/month")
def report_by_environment(self):
costs = {}
for r in self.resources:
costs.setdefault(r["env"], 0)
costs[r["env"]] += r["cost"]
print("\nCost by Environment:")
for env, cost in sorted(costs.items(), key=lambda x: -x[1]):
print(f" {env}: ${cost:,.0f}/month")
alloc = CostAllocation()
alloc.add_resource("web-1", "doda-browser", "prod", 1200)
alloc.add_resource("web-2", "doda-browser", "prod", 1200)
alloc.add_resource("worker-1", "durga-antivirus", "prod", 800)
alloc.add_resource("cache-1", "doda-browser", "staging", 200)
alloc.report_by_service()
alloc.report_by_environment()
Expected output:
Cost by Service:
doda-browser: $2,600/month
durga-antivirus: $800/month
Cost by Environment:
prod: $3,200/month
staging: $200/month
FinOps and SRE Collaboration
FinOps is the practice of bringing financial accountability to cloud spending. SRE teams work with FinOps by providing engineering data about capacity needs, scaling behavior, and reliability requirements. FinOps provides cost data that helps SRE teams make trade-off decisions.
The key FinOps principles that apply to SRE include:
- Visibility: Every engineer can see the cost impact of their infrastructure decisions.
- Optimization: Teams actively reduce waste through right-sizing, auto-scaling, and choosing the right purchasing model.
- Governance: Policies prevent uncontrolled spending, such as deploying expensive instance types without approval.
- Continuous improvement: Cost optimization is not a one-time project but an ongoing practice reviewed monthly.
Elastic vs Provisioned Capacity
Understanding the difference between elastic and provisioned capacity helps you choose the right cost model.
| Model | Description | Cost Characteristic | Best For |
|---|---|---|---|
| Elastic (serverless) | Pay per invocation | High per-unit cost, zero idle cost | Variable or low traffic |
| Provisioned (instances) | Pay per hour | Low per-unit cost, pays for idle | Steady or high traffic |
For example, AWS Lambda charges per invocation. A service handling 1 million requests per month costs very little. The same service handling 1 billion requests per month would be cheaper on provisioned EC2 instances with reserved pricing.
Cost Optimization Without Reliability Impact
Not all cost optimization reduces reliability. These optimizations are safe:
- Right-sizing instances: Running 25 percent utilized instances is wasteful. Downsize without affecting performance.
- Eliminating idle resources: Stale load balancers, unattached storage volumes, and unused IP addresses cost money without providing any reliability benefit.
- Choosing the right storage tier: Frequently accessed data belongs on SSD. Infrequently accessed data belongs on cold storage tiers.
- Optimizing data transfer: Data transfer between regions is expensive. Keep data in the same region where possible.
def savings_without_risk(downsize_candidates, idle_resources, tier_optimizations):
total = downsize_candidates + idle_resources + tier_optimizations
print("Risk-Free Cost Optimization Opportunities")
print(f" Right-size candidates: {downsize_candidates}")
print(f" Idle resources: {idle_resources}")
print(f" Tier optimizations: {tier_optimizations}")
print(f" Total monthly savings: ${total:,}")
savings_without_risk(1200, 450, 600)
Expected output:
Risk-Free Cost Optimization Opportunities
Right-size candidates: 1200
Idle resources: 450
Tier optimizations: 600
Total monthly savings: $2,250
Common Errors
The most useful metric for SRE cost optimization is cost per request. It normalizes spending against traffic volume and reveals whether efficiency is improving or degrading over time.
def cost_per_request(monthly_cost, requests_per_month):
cpr = monthly_cost / requests_per_month
print(f"Monthly cost: ${monthly_cost:,.0f}")
print(f"Requests: {requests_per_month:,}/month")
print(f"Cost per request: ${cpr:.6f}")
cost_per_request(15000, 50000000)
cost_per_request(3000, 5000000)
Expected output:
Monthly cost: $15,000
Requests: 50,000,000/month
Cost per request: $0.000300
Monthly cost: $3,000
Requests: 5,000,000/month
Cost per request: $0.000600
Common Errors
| Error | Explanation |
|---|---|
| Over-provisioning for peak | Provisioning for theoretical maximum peak wastes money. Use auto-scaling. |
| Ignoring spot instances | Spot instances offer massive savings for fault-tolerant workloads like batch processing. |
| One-size-fits-all SLO | Different services need different reliability levels. Do not over-engineer non-critical services. |
| No cost visibility | If engineers do not see the cost of their decisions, they will over-provision. Show cost in dashboards. |
| Reserved instances for spiky workloads | Reserved instances are best for steady-state workloads. Spot or on-demand is better for variable loads. |
| Not tagging resources | Untagged resources cannot be tracked to a team or service. Tag everything. |
Practice Questions
- Why does reliability have diminishing returns on investment?
- What is right-sizing and how does it reduce costs?
- How can auto-scaling reduce infrastructure costs?
- When should you use spot instances vs reserved instances?
- Why should different services have different SLOs?
Challenge
You manage the DodaTech infrastructure budget. The team runs three services: Doda Browser sync (99.99 percent SLO, $20,000/month), internal admin tools (99 percent SLO, $3,000/month), and batch analytics (no SLO, $8,000/month). Recommend cost optimization strategies for each service, estimate potential savings, and write a brief justification for each recommendation.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro