Chaos Engineering Principles — Steady State & Hypothesis
In this tutorial, you'll learn about Chaos Engineering Principles. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Chaos Engineering principles provide the scientific framework that separates genuine Resilience Testing from random system abuse. The two most important concepts are the steady state — what normal looks like — and the hypothesis — what you expect to happen when a fault is introduced.
What You Will Learn
This tutorial explains the four foundational principles of Chaos Engineering with a focus on steady state measurement and hypothesis formulation.
Why It Matters
Without a clear steady state definition your chaos experiments have no baseline for comparison. Without a hypothesis you cannot prove or disprove anything. These principles turn chaos from vandalism into science.
Real-World Use
The DodaTech infrastructure team defines steady state for Durga Antivirus Pro as: p99 latency under 200ms, error rate below 0.1 percent, and CPU utilization under 70 percent. Every Chaos Experiment is measured against these three metrics.
Prerequisites
Before starting you should understand:
- The basics of chaos engineering from the overview tutorial
- How microservices communicate over a network
- Basic monitoring concepts (latency, throughput, error budgets)
- Familiarity with Prometheus or similar metrics systems
Step 1: Define Steady State
Steady state is a measurable indicator that the system is operating normally. Choose metrics that reflect user experience, not just infrastructure health.
# steady-state-metrics.yaml
# Define the SLIs (Service Level Indicators) you will measure
slis:
latency:
metric: http_request_duration_seconds
p99_target: 0.5
errors:
metric: http_requests_total{status=~"5.."}
rate_target: 0.001
throughput:
metric: http_requests_total
min_rate: 100
Expected Prometheus query result for baseline:
http_request_duration_seconds{p99="0.342"}
http_requests_total{status="500"} rate[5m] = 0.0003
Step 2: Form a Hypothesis
A good hypothesis predicts the outcome: "If I inject 500ms latency into the payment service, then the checkout flow will experience p99 latency under 2 seconds because the timeout is configured at 3 seconds."
# Example hypothesis statement template
cat <<EOF
Hypothesis: If we kill one pod of the payment service,
then the error rate will stay below 1 percent
because the deployment runs three replicas with a
readiness probe that removes unhealthy pods from the
load balancer pool.
EOF
Expected output: The hypothesis is documented and shared with the team before the experiment.
Step 3: Minimize Blast Radius
The blast radius should be proportional to the confidence level. Early experiments target a single instance in a non-critical service. As confidence grows you can expand the scope.
# Check replica count to understand blast radius
kubectl get deployment payment-service -o jsonpath='{.spec.replicas}'
# Expected output:
# 3
With three replicas killing one pod affects at most one third of traffic. The remaining two replicas should absorb the load.
Step 4: Run the Experiment
Execute the experiment while monitoring the steady state metrics. Automate the process using a chaos platform if possible.
# Using Chaos Mesh to inject a pod kill
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: payment-pod-kill-test
spec:
action: pod-kill
mode: one
selector:
namespaces: ["production"]
labelSelectors:
app: payment-service
duration: 60s
EOF
Expected output:
podchaos.chaos-mesh.org/payment-pod-kill-test created
Step 5: Analyze and Learn
Compare post-experiment metrics against the steady state baseline. If your hypothesis was correct document the confirmation. If the system behaved unexpectedly investigate the root cause.
# Compare metrics before vs during the experiment
# Before:
curl -s http://prometheus:9090/api/v1/query?query=p99_latency
# {"status":"success","data":{"result":[{"value":[1718970000,"0.342"]}]}}
#
# During (expected within hypothesis):
curl -s http://prometheus:9090/api/v1/query?query=p99_latency
# {"status":"success","data":{"result":[{"value":[1718970060,"0.412"]}]}}
Learning Path
flowchart LR A[Chaos Engineering Overview] --> B[Chaos Principles] B --> C[Blast Radius] C --> D[Steady State Hypothesis] D --> E[Designing Experiments] style B fill:#f90,color:#fff
Common Errors
- Defining steady state too broadly: Use specific quantifiable metrics like p99 latency, not vague terms like "system feels fast."
- Confirmation bias in hypotheses: State a falsifiable hypothesis. If you cannot be proven wrong you are not doing science.
- Forgetting to revert the fault: Always include an automatic rollback or duration limit. A stuck fault becomes a real outage.
- Skipping the baseline measurement: You need before-and-after data. Without baseline you have no reference.
- Running experiments without monitoring: If you cannot observe the impact in real time you cannot learn from the experiment.
Practice Questions
- What three metrics would you choose to define the steady state of an API gateway?
- Write a hypothesis for a network Latency Injection experiment on a database service.
- Why is minimizing blast radius the third principle and not the first?
- How do you determine if a Chaos Experiment passed or failed?
- How does the scientific method relate to Chaos Engineering principles?
Challenge
Write a complete Chaos Engineering experiment plan for a Redis cache layer. Define the steady state metrics, state a falsifiable hypothesis, specify the Fault Injection, and describe what result would confirm or reject the hypothesis.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro