Steady State Hypothesis — Defining Normal Behavior

DodaTech Updated 2026-06-21 4 min read

In this tutorial, you'll learn about Steady State Hypothesis. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

The Steady State hypothesis is the core of every Chaos Engineering experiment. It defines what normal looks like for your system and predicts what will happen when you inject a fault. Without a Steady State hypothesis you cannot determine whether your experiment passed or failed.

What You Will Learn

This tutorial teaches you how to write effective Steady State hypotheses, select appropriate metrics, and interpret experiment results against your hypothesis.

Why It Matters

A well-written hypothesis turns Chaos Engineering from random breaking into a scientific discipline. It forces you to think carefully about what your system should do under stress and makes your experiments reproducible and testable.

Real-World Use

Before every experiment DodaTech engineers write a hypothesis card that includes: the fault to inject, the expected system behavior, the reasoning, and the metrics that will confirm or reject the hypothesis. This card is reviewed by a peer before the experiment runs.

Prerequisites

Before starting you should understand:

The core chaos engineering principles
How to define blast radius for experiments
Basic Prometheus query language (PromQL)
How Microservices handle failures with retries and timeouts

Step 1: Identify Measurable Metrics

Choose metrics that reflect real user experience. These are your Service Level Indicators (SLIs).

# sli-definitions.yaml
# These metrics represent the user-facing health of the system
sli:
  - name: request_latency_p99
    query: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
    normal_range: "0.1 to 0.5"
  - name: error_rate
    query: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
    normal_range: "0 to 0.01"
  - name: throughput
    query: sum(rate(http_requests_total[5m]))
    normal_range: "100 to 500"

Expected Prometheus output for baseline:

{quantile="0.99"} 0.342

Step 2: Write a Falsifiable Hypothesis

A hypothesis must be structured so it can be proven wrong. Use this template:

If we inject [FAULT] into [SERVICE/TARGET],
then [METRIC] will stay within [THRESHOLD]
because [REASONING].

# Hypothesis template for documentation
cat <<'HYPOTHESIS'
If we kill 1 of 3 replicas of the payment-service,
then the p99 latency will stay below 500ms and error rate below 1%
because the remaining 2 replicas can handle the full traffic load
and the load balancer removes unhealthy pods within 5 seconds.
HYPOTHESIS

Expected output: The hypothesis is printed to stdout and can be saved to a file for team review.

Step 3: Set Thresholds

Thresholds define the boundary between normal and degraded. Set them based on historical data not guesswork.

# thresholds.yaml
thresholds:
  p99_latency:
    warning: 400ms
    critical: 1000ms
  error_rate:
    warning: 1%
    critical: 5%
  throughput_degradation:
    warning: 20%
    critical: 50%

Step 4: Run the Experiment and Compare

Execute the experiment and immediately compare the observed metrics against the hypothesis.

# Run the experiment
kubectl apply -f payment-pod-kill.yaml

# Check latency after 30 seconds
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,rate(http_request_duration_seconds_bucket[1m]))" | jq .data.result[].value[1]
# Expected output if hypothesis holds:
# "0.412"
# This is below the 500ms threshold so the hypothesis is confirmed.

Step 5: Document the Result

Record whether the hypothesis was confirmed or rejected and what the team learned.

# experiment-report.yaml
experiment:
  hypothesis: "Kill 1 of 3 payment-service pods, p99 stays under 500ms"
  result: confirmed
  observed_p99: 412ms
  observed_error_rate: 0.3%
  findings: |
    The system handled the failure correctly. No cascading issues detected.
    Recommended: Reduce the readiness probe interval from 10s to 5s
    for faster recovery.

Learning Path

flowchart LR
  A[Blast Radius] --> B[Steady State Hypothesis]
  B --> C[Designing Experiments]
  C --> D[Game Days]
  D --> E[Chaos Mesh]
  style B fill:#f90,color:#fff

Common Errors

Writing a hypothesis that cannot be falsified: "The system might be impacted" is not a hypothesis. "Error rate stays below 1 percent" is testable.
Using non-quantifiable metrics: "The system feels slow" is not measurable. "p99 latency exceeds 500ms" is measurable.
Forgetting the reasoning clause: The "because" part is crucial. It documents your architectural understanding.
Setting thresholds too loosely: A threshold of "error rate under 100 percent" passes everything and reveals nothing.
Ignoring the baseline: Always collect baseline data before the experiment. Historical data alone may not reflect current traffic patterns.

Practice Questions

What are the four parts of a Steady State hypothesis?
Why must a hypothesis be falsifiable?
How do you choose appropriate metric thresholds?
What does it mean if an experiment confirms the hypothesis?
How should you document a rejected hypothesis?

Challenge

Write a Steady State hypothesis for a database Connection Pool exhaustion experiment. Define the SLIs, set specific thresholds, include the reasoning clause, and describe what data would confirm versus reject the hypothesis.

FAQ

What is the Steady State hypothesis?

It is a falsifiable prediction about how your system will behave when a specific fault is introduced. It includes the fault, expected outcome, metric thresholds, and reasoning.

How do I choose which metrics to include?

Select metrics that directly reflect user experience: latency, error rate, and throughput. Infrastructure metrics are secondary and should not be the primary hypothesis.

Can a hypothesis include multiple metrics?

Yes. A robust hypothesis often monitors two or three metrics simultaneously. For example: "p99 latency stays under 500ms and error rate stays under 1 percent."

What if my hypothesis is rejected?

That is valuable information. It means you found a weakness. Document what happened, fix the root cause, and rerun the experiment.

How often should I update my hypotheses?

Update them whenever the system architecture changes significantly or when you discover that assumptions in the "because" clause are no longer valid.

← Previous Blast Radius — Minimizing Impact of Chaos Experiments Next → Designing Chaos Experiments — From Idea to Execution

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Chaos Engineering