Blast Radius — Minimizing Impact of Chaos Experiments
In this tutorial, you'll learn about Blast Radius. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
The Blast Radius in Chaos Engineering is the scope of impact a Chaos Experiment could have on users, services, or data if the experiment goes wrong. Minimizing the Blast Radius is the third core principle and the most important safety practice in Chaos Engineering.
What You Will Learn
This tutorial teaches you how to design chaos experiments with tight Blast Radius controls, how to use guardrails, and how to expand safely over time.
Why It Matters
A Chaos Experiment that takes down production for thousands of users defeats the purpose of Chaos Engineering. The goal is to find weaknesses safely. Controlling the Blast Radius ensures that when something goes wrong — and it will — the damage stays contained.
Real-World Use
At DodaTech all chaos experiments target a single replica of a single service in a pool of at least five replicas. The experiment automatically stops if error rates increase by more than 5 percent. This guardrail has prevented real incidents on three separate occasions.
Prerequisites
Before starting you should know:
- The core chaos engineering principles from the previous tutorials
- How Kubernetes deployments and services route traffic
- Basic monitoring and alerting concepts
- How to read YAML configuration files
Step 1: Understand Blast Radius Dimensions
Blast Radius has three dimensions you must control:
- Scope: How many instances or services are affected (one pod, one node, one availability zone).
- Duration: How long the fault lasts (seconds, minutes, hours).
- Intensity: How severe the fault is (mild latency, packet loss, complete failure).
# blast-radius-controls.yaml
# Each dimension is explicitly set before the experiment
experiment:
scope:
max_instances: 1
max_services: 1
allowed_namespaces: ["staging"]
duration:
max_seconds: 120
auto_rollback: true
intensity:
max_latency_ms: 1000
max_packet_loss_percent: 50
Step 2: Use Guardrails and Circuit Breakers
Guardrails automatically halt experiments when predefined conditions are violated. This is your safety net.
# guardrails.yaml
# These metrics are monitored during the experiment.
# If any threshold is breached the experiment stops.
guardrails:
error_rate_increase_max: 5
latency_p99_increase_ms: 500
cpu_utilization_max_percent: 85
alert_on_breach: true
action: abort_experiment
Expected behavior: If error rate spikes above 5 percent the chaos tooling terminates the fault automatically.
Step 3: Start Small and Expand
Progression path for Blast Radius:
- Local: Chaos experiments on your development machine using Docker Compose.
- Staging: Single pod in a staging namespace with minimal traffic.
- Canary: A single production instance behind a Canary Deployment.
- Production: Full production experiment with guardrails and team notification.
# Check which namespaces are safe for experiments
kubectl get namespaces --show-labels
# Expected output:
# NAME STATUS LABELS
# default Active <none>
# staging Active environment=staging
# production Active environment=production,experiment-allowed=false
Notice the production namespace has experiment-allowed=false. This label blocks experiment execution until explicitly changed.
Step 4: Implement Blast Radius in Chaos Mesh
Chaos Mesh allows you to constrain experiments using selectors and conditions.
# safe-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: safe-pod-kill
spec:
action: pod-kill
mode: one
value: "1"
selector:
namespaces:
- staging
labelSelectors:
app: payment-service
tier: non-critical
duration: 30s
scheduler:
cron: "@every 24h"
Expected output:
kubectl apply -f safe-experiment.yaml
podchaos.chaos-mesh.org/safe-pod-kill created
Step 5: Monitor the Blast Radius in Real Time
During the experiment watch the Blast Radius metrics dashboard.
# Monitor error rate across the affected service
kubectl logs -l app=payment-service --tail=10
# Expected output showing normal request handling on remaining pods:
# [INFO] POST /payment 200 45ms
# [INFO] POST /payment 200 52ms
# [INFO] POST /payment 200 48ms
If you see 5xx errors on the remaining pods the Blast Radius has leaked and the guardrail should abort the experiment.
Learning Path
flowchart LR A[Chaos Principles] --> B[Blast Radius] B --> C[Steady State Hypothesis] C --> D[Designing Experiments] D --> E[Game Days] style B fill:#f90,color:#fff
Common Errors
- Not setting a maximum duration: Without a timeout a Fault Injection that hangs can become a permanent outage.
- Experimenting on services with less than two replicas: If you kill the only replica you cause a real outage not a controlled experiment.
- Running experiments without guardrails: Always define automated abort conditions. Manual monitoring is not reliable enough.
- Expanding Blast Radius too quickly: Spend weeks at each level (local, staging, canary, production) before moving up.
- Ignoring the Blast Radius of monitoring itself: The experiment might overwhelm your monitoring system. Ensure monitoring is resilient too.
Practice Questions
- What are the three dimensions of Blast Radius in chaos experiments?
- Why should you start experiments in staging rather than production?
- What is a guardrail and how does it protect against runaway experiments?
- Describe the progression path from local to production chaos experiments.
- How does Chaos Mesh constrain the Blast Radius using selectors?
Challenge
Design a Blast Radius control plan for a multi-region service with six replicas across three availability zones. Specify the guardrails, progression path, and conditions under which you would abort a production experiment.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro