Steady State Hypothesis — Defining Normal Behavior
In this tutorial, you'll learn about Steady State Hypothesis. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
The Steady State hypothesis is the core of every Chaos Engineering experiment. It defines what normal looks like for your system and predicts what will happen when you inject a fault. Without a Steady State hypothesis you cannot determine whether your experiment passed or failed.
What You Will Learn
This tutorial teaches you how to write effective Steady State hypotheses, select appropriate metrics, and interpret experiment results against your hypothesis.
Why It Matters
A well-written hypothesis turns Chaos Engineering from random breaking into a scientific discipline. It forces you to think carefully about what your system should do under stress and makes your experiments reproducible and testable.
Real-World Use
Before every experiment DodaTech engineers write a hypothesis card that includes: the fault to inject, the expected system behavior, the reasoning, and the metrics that will confirm or reject the hypothesis. This card is reviewed by a peer before the experiment runs.
Prerequisites
Before starting you should understand:
- The core chaos engineering principles
- How to define blast radius for experiments
- Basic Prometheus query language (PromQL)
- How Microservices handle failures with retries and timeouts
Step 1: Identify Measurable Metrics
Choose metrics that reflect real user experience. These are your Service Level Indicators (SLIs).
# sli-definitions.yaml
# These metrics represent the user-facing health of the system
sli:
- name: request_latency_p99
query: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
normal_range: "0.1 to 0.5"
- name: error_rate
query: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
normal_range: "0 to 0.01"
- name: throughput
query: sum(rate(http_requests_total[5m]))
normal_range: "100 to 500"
Expected Prometheus output for baseline:
{quantile="0.99"} 0.342
Step 2: Write a Falsifiable Hypothesis
A hypothesis must be structured so it can be proven wrong. Use this template:
If we inject [FAULT] into [SERVICE/TARGET],
then [METRIC] will stay within [THRESHOLD]
because [REASONING].
# Hypothesis template for documentation
cat <<'HYPOTHESIS'
If we kill 1 of 3 replicas of the payment-service,
then the p99 latency will stay below 500ms and error rate below 1%
because the remaining 2 replicas can handle the full traffic load
and the load balancer removes unhealthy pods within 5 seconds.
HYPOTHESIS
Expected output: The hypothesis is printed to stdout and can be saved to a file for team review.
Step 3: Set Thresholds
Thresholds define the boundary between normal and degraded. Set them based on historical data not guesswork.
# thresholds.yaml
thresholds:
p99_latency:
warning: 400ms
critical: 1000ms
error_rate:
warning: 1%
critical: 5%
throughput_degradation:
warning: 20%
critical: 50%
Step 4: Run the Experiment and Compare
Execute the experiment and immediately compare the observed metrics against the hypothesis.
# Run the experiment
kubectl apply -f payment-pod-kill.yaml
# Check latency after 30 seconds
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,rate(http_request_duration_seconds_bucket[1m]))" | jq .data.result[].value[1]
# Expected output if hypothesis holds:
# "0.412"
# This is below the 500ms threshold so the hypothesis is confirmed.
Step 5: Document the Result
Record whether the hypothesis was confirmed or rejected and what the team learned.
# experiment-report.yaml
experiment:
hypothesis: "Kill 1 of 3 payment-service pods, p99 stays under 500ms"
result: confirmed
observed_p99: 412ms
observed_error_rate: 0.3%
findings: |
The system handled the failure correctly. No cascading issues detected.
Recommended: Reduce the readiness probe interval from 10s to 5s
for faster recovery.
Learning Path
flowchart LR A[Blast Radius] --> B[Steady State Hypothesis] B --> C[Designing Experiments] C --> D[Game Days] D --> E[Chaos Mesh] style B fill:#f90,color:#fff
Common Errors
- Writing a hypothesis that cannot be falsified: "The system might be impacted" is not a hypothesis. "Error rate stays below 1 percent" is testable.
- Using non-quantifiable metrics: "The system feels slow" is not measurable. "p99 latency exceeds 500ms" is measurable.
- Forgetting the reasoning clause: The "because" part is crucial. It documents your architectural understanding.
- Setting thresholds too loosely: A threshold of "error rate under 100 percent" passes everything and reveals nothing.
- Ignoring the baseline: Always collect baseline data before the experiment. Historical data alone may not reflect current traffic patterns.
Practice Questions
- What are the four parts of a Steady State hypothesis?
- Why must a hypothesis be falsifiable?
- How do you choose appropriate metric thresholds?
- What does it mean if an experiment confirms the hypothesis?
- How should you document a rejected hypothesis?
Challenge
Write a Steady State hypothesis for a database Connection Pool exhaustion experiment. Define the SLIs, set specific thresholds, include the reasoning clause, and describe what data would confirm versus reject the hypothesis.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro