Chaos Engineering Principles — Steady State & Hypothesis

DodaTech Updated 2026-06-21 5 min read

In this tutorial, you'll learn about Chaos Engineering Principles. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Chaos Engineering principles provide the scientific framework that separates genuine Resilience Testing from random system abuse. The two most important concepts are the steady state — what normal looks like — and the hypothesis — what you expect to happen when a fault is introduced.

What You Will Learn

This tutorial explains the four foundational principles of Chaos Engineering with a focus on steady state measurement and hypothesis formulation.

Why It Matters

Without a clear steady state definition your chaos experiments have no baseline for comparison. Without a hypothesis you cannot prove or disprove anything. These principles turn chaos from vandalism into science.

Real-World Use

The DodaTech infrastructure team defines steady state for Durga Antivirus Pro as: p99 latency under 200ms, error rate below 0.1 percent, and CPU utilization under 70 percent. Every Chaos Experiment is measured against these three metrics.

Prerequisites

Before starting you should understand:

The basics of chaos engineering from the overview tutorial
How microservices communicate over a network
Basic monitoring concepts (latency, throughput, error budgets)
Familiarity with Prometheus or similar metrics systems

Step 1: Define Steady State

Steady state is a measurable indicator that the system is operating normally. Choose metrics that reflect user experience, not just infrastructure health.

# steady-state-metrics.yaml
# Define the SLIs (Service Level Indicators) you will measure
slis:
  latency:
    metric: http_request_duration_seconds
    p99_target: 0.5
  errors:
    metric: http_requests_total{status=~"5.."}
    rate_target: 0.001
  throughput:
    metric: http_requests_total
    min_rate: 100

Expected Prometheus query result for baseline:

http_request_duration_seconds{p99="0.342"}
http_requests_total{status="500"} rate[5m] = 0.0003

Step 2: Form a Hypothesis

A good hypothesis predicts the outcome: "If I inject 500ms latency into the payment service, then the checkout flow will experience p99 latency under 2 seconds because the timeout is configured at 3 seconds."

# Example hypothesis statement template
cat <<EOF
Hypothesis: If we kill one pod of the payment service,
then the error rate will stay below 1 percent
because the deployment runs three replicas with a
readiness probe that removes unhealthy pods from the
load balancer pool.
EOF

Expected output: The hypothesis is documented and shared with the team before the experiment.

Step 3: Minimize Blast Radius

The blast radius should be proportional to the confidence level. Early experiments target a single instance in a non-critical service. As confidence grows you can expand the scope.

# Check replica count to understand blast radius
kubectl get deployment payment-service -o jsonpath='{.spec.replicas}'
# Expected output:
# 3

With three replicas killing one pod affects at most one third of traffic. The remaining two replicas should absorb the load.

Step 4: Run the Experiment

Execute the experiment while monitoring the steady state metrics. Automate the process using a chaos platform if possible.

# Using Chaos Mesh to inject a pod kill
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: payment-pod-kill-test
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: ["production"]
    labelSelectors:
      app: payment-service
  duration: 60s
EOF

Expected output:

podchaos.chaos-mesh.org/payment-pod-kill-test created

Step 5: Analyze and Learn

Compare post-experiment metrics against the steady state baseline. If your hypothesis was correct document the confirmation. If the system behaved unexpectedly investigate the root cause.

# Compare metrics before vs during the experiment
# Before:
curl -s http://prometheus:9090/api/v1/query?query=p99_latency
# {"status":"success","data":{"result":[{"value":[1718970000,"0.342"]}]}}
#
# During (expected within hypothesis):
curl -s http://prometheus:9090/api/v1/query?query=p99_latency
# {"status":"success","data":{"result":[{"value":[1718970060,"0.412"]}]}}

Learning Path

flowchart LR
  A[Chaos Engineering Overview] --> B[Chaos Principles]
  B --> C[Blast Radius]
  C --> D[Steady State Hypothesis]
  D --> E[Designing Experiments]
  style B fill:#f90,color:#fff

Common Errors

Defining steady state too broadly: Use specific quantifiable metrics like p99 latency, not vague terms like "system feels fast."
Confirmation bias in hypotheses: State a falsifiable hypothesis. If you cannot be proven wrong you are not doing science.
Forgetting to revert the fault: Always include an automatic rollback or duration limit. A stuck fault becomes a real outage.
Skipping the baseline measurement: You need before-and-after data. Without baseline you have no reference.
Running experiments without monitoring: If you cannot observe the impact in real time you cannot learn from the experiment.

Practice Questions

What three metrics would you choose to define the steady state of an API gateway?
Write a hypothesis for a network Latency Injection experiment on a database service.
Why is minimizing blast radius the third principle and not the first?
How do you determine if a Chaos Experiment passed or failed?
How does the scientific method relate to Chaos Engineering principles?

Challenge

Write a complete Chaos Engineering experiment plan for a Redis cache layer. Define the steady state metrics, state a falsifiable hypothesis, specify the Fault Injection, and describe what result would confirm or reject the hypothesis.

FAQ

Why is steady state important in Chaos Engineering?

Steady state gives you a quantifiable baseline to compare against. Without it you cannot objectively determine whether the system handled the fault successfully.

What makes a good Chaos Engineering hypothesis?

A good hypothesis is falsifiable, specific, and tied to measurable metrics. It should state the fault, the expected outcome, and the reasoning.

How do I choose which metrics to monitor for steady state?

Choose metrics that directly reflect user experience: latency, error rate, and throughput. Infrastructure metrics like CPU and memory are secondary.

Can I have multiple steady state indicators?

Yes. In fact you should. A combination of latency, error rate, and saturation metrics gives a complete picture of system health.

What happens if the hypothesis is disproven?

That is the best outcome. It means you discovered a weakness before it caused a production outage. Document the finding and fix the issue.

← Previous Chaos Engineering Overview — Building Resilient Systems Next → Blast Radius — Minimizing Impact of Chaos Experiments

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Chaos Engineering