Skip to content

Observability in Chaos Engineering — Metrics, Traces & Logs

DodaTech Updated 2026-06-23 7 min read

In this tutorial, you'll learn about Observability in Chaos Engineering. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Observability is the foundation of Chaos Engineering. Without proper metrics, traces, and logs, a Chaos Experiment produces no actionable results. Observing experiments means capturing the system state before, during, and after Fault Injection, then comparing the data against your hypothesis.

What You Will Learn

This tutorial teaches you how to instrument chaos experiments with Prometheus metrics, OpenTelemetry distributed traces, structured logging, and Grafana dashboards designed specifically for experiment impact analysis and hypothesis validation.

Why It Matters

Running a Chaos Experiment without Observability is like performing surgery blindfolded. You cannot tell if the system is behaving as expected, if the fault is actually being applied, or if the recovery mechanisms are working. Observability turns raw experiment data into validated hypotheses and actionable resilience insights.

Real-World Use

DodaTech maintains a dedicated "Chaos Observability" Grafana dashboard that displays real-time metrics during active experiments. The dashboard includes before-and-after comparison panels for latency, error rate, throughput, and resource utilization so engineers can immediately see the impact of any Fault Injection.

Prerequisites

Before starting you should understand:

  • Prometheus and basic metric types (counter, gauge, histogram)
  • Grafana dashboard concepts
  • Chaos Engineering experiment design
  • OpenTelemetry for distributed tracing

Step 1: Set Up Experiment-Specific Metrics

Create Prometheus metrics that capture experiment impact in real time:

# prometheus-experiment-metrics.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: chaos-experiment-monitor
spec:
  selector:
    matchLabels:
      app: chaos-exporter
  endpoints:
    - port: metrics
      interval: 5s
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_label_experiment_id]
          targetLabel: experiment_id
        - sourceLabels: [__meta_kubernetes_pod_label_fault_type]
          targetLabel: fault_type
#!/usr/bin/env python3
"""Custom Prometheus exporter for chaos experiment metrics."""
from prometheus_client import start_http_server, Gauge, Histogram, Counter
import time
import random
import sys

EXPERIMENT_ID = sys.argv[1] if len(sys.argv) > 1 else "unknown"

p99_latency = Gauge(
    "chaos_p99_latency_ms",
    "P99 latency during experiment",
    ["experiment_id", "service"]
)
error_rate = Gauge(
    "chaos_error_rate_percent",
    "Error rate during experiment",
    ["experiment_id", "service"]
)
throughput = Gauge(
    "chaos_throughput_rps",
    "Requests per second during experiment",
    ["experiment_id", "service"]
)
guardrail_breaches = Counter(
    "chaos_guardrail_breaches_total",
    "Total guardrail breaches during experiment",
    ["experiment_id", "guardrail_name"]
)

def collect_metrics():
    p99_latency.labels(experiment_id=EXPERIMENT_ID, service="payment").set(450)
    error_rate.labels(experiment_id=EXPERIMENT_ID, service="payment").set(2.1)
    throughput.labels(experiment_id=EXPERIMENT_ID, service="payment").set(850)
    print("Metrics collected and exported on :8000")

if __name__ == "__main__":
    start_http_server(8000)
    collect_metrics()
    while True:
        time.sleep(15)

# Expected output:
# Metrics collected and exported on :8000
# Access at http://localhost:8000 to see raw metrics

Step 2: Compare Before-and-After Metrics

Query Prometheus to compare system state before and after an experiment:

# Query baseline metrics (before experiment)
PROMETHEUS="http://prometheus:9090"
START_TIME=$(date -d "5 minutes ago" +%s)

# Get baseline p99 latency
curl -s "$PROMETHEUS/api/v1/query_range" \
  --data-urlencode "query=histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[1m])) by (le))" \
  --data-urlencode "start=$START_TIME" \
  --data-urlencode "end=$(date +%s)" \
  --data-urlencode "step=15" | jq '.data.result[0].values[] | .[1]'
# Expected output:
# "0.342"
# "0.351"
# "0.338"
# (Baseline p99 around 340ms)

# Query during-experiment metrics (same query, time range during experiment)
# Expected output:
# "0.342"
# "1.234"
# "1.567"
# (p99 spiked to 1.5s during the experiment)

Step 3: Trace Request Paths During Fault Injection

Use OpenTelemetry to trace how requests behave under fault conditions:

#!/usr/bin/env python3
"""OpenTelemetry tracing during chaos experiment."""
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
import requests
import time

provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4318/v1/traces")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

EXPERIMENT_ID = "exp-payment-kill-001"

def trace_request():
    with tracer.start_as_current_span("payment-checkout") as span:
        span.set_attribute("experiment.id", EXPERIMENT_ID)
        span.set_attribute("experiment.fault", "pod-kill")

        # Make a request to the payment service
        start = time.time()
        try:
            response = requests.get(
                "http://payment-service:8080/health",
                timeout=5
            )
            latency_ms = (time.time() - start) * 1000
            span.set_attribute("http.status_code", response.status_code)
            span.set_attribute("http.latency_ms", latency_ms)

            if response.status_code != 200:
                span.set_attribute("error", True)
                span.set_attribute("error.message", f"Status {response.status_code}")

        except Exception as e:
            span.set_attribute("error", True)
            span.set_attribute("error.message", str(e))

        return span

# Run trace before, during, and after experiment
for phase in ["before", "during", "after"]:
    print(f"Tracing {phase} experiment...")
    trace_request()
    time.sleep(5)

print("Traces exported to OpenTelemetry collector")

# Expected output:
# Tracing before experiment...   (span: status=200, latency=45ms)
# Tracing during experiment...   (span: status=200, latency=1234ms) OR (span: error, timeout)
# Tracing after experiment...    (span: status=200, latency=48ms)

Step 4: Build a Chaos Experiment Dashboard

Create a Grafana dashboard for visualizing experiment impact:

{
  "dashboard": {
    "title": "Chaos Experiment Impact",
    "panels": [
      {
        "title": "Latency Comparison (Before vs During)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{experiment_id=\"$experiment\"}[1m])) by (le))",
            "legendFormat": "P99 Latency"
          }
        ]
      },
      {
        "title": "Error Rate During Experiment",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\",experiment_id=\"$experiment\"}[1m])) / sum(rate(http_requests_total{experiment_id=\"$experiment\"}[1m])) * 100",
            "legendFormat": "Error Rate"
          }
        ]
      },
      {
        "title": "Guardrail Status",
        "type": "stat",
        "targets": [
          {
            "expr": "chaos_guardrail_breaches_total{experiment_id=\"$experiment\"}]
          }
        ]
      }
    ],
    "templating": {
      "list": [
        {
          "name": "experiment",
          "type": "query",
          "query": "label_values(chaos_p99_latency_ms, experiment_id)]
        }
      ]
    }
  }
}

Step 5: Implement Structured Logging for Experiments

Add experiment context to all application logs:

#!/usr/bin/env python3
"""Structured logging with experiment context."""
import structlog
import sys

EXPERIMENT_ID = "exp-payment-kill-001"

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.PrintLoggerFactory(),
)

logger = structlog.get_logger()

def handle_request():
    log = logger.bind(
        experiment_id=EXPERIMENT_ID,
        service="payment",
        trace_id="trace-abc-123"
    )

    log.info("request_started", path="/api/charge", method="POST")
    # ... process request ...
    log.info("request_completed",
        status_code=200,
        latency_ms=45,
        database_healthy=True
    )

    # During experiment, log differences
    log.warning("latency_increased",
        current_p99=1234,
        baseline_p99=342,
        experiment_id=EXPERIMENT_ID
    )

handle_request()

# Expected JSON log output:
# {"event": "request_started", "level": "info", "experiment_id": "exp-payment-kill-001", ...}
# {"event": "request_completed", "level": "info", "status_code": 200, "latency_ms": 45, ...}
# {"event": "latency_increased", "level": "warning", "current_p99": 1234, "baseline_p99": 342, ...}

Learning Path

flowchart LR
  A[Chaos Engineering Pipeline] --> B[Chaos Observability]
  B --> C[Advanced Experiments]
  C --> D[Game Days]
  D --> E[Chaos Mesh Advanced]
  style B fill:#f90,color:#fff

Common Errors

  1. Not collecting baseline metrics before the experiment starts: Without baseline data you cannot quantify the impact of a fault. Always capture 5-10 minutes of pre-experiment metrics.
  2. Sampling traces at too low a rate during experiments: During a fault many requests fail, and if your trace sampler only captures 1 percent of requests you may miss the failed ones entirely.
  3. Grafana dashboards with time ranges that exclude the experiment window: Set the dashboard time range to cover the experiment period. Use annotations to mark experiment start and end times.
  4. Logging without experiment correlation IDs: If logs do not include the experiment ID, you cannot filter for experiment-related events. Add experiment context to every log entry.
  5. Alerting on experiment metrics as if they were production SLOs: Experiment metrics include fault effects and will trigger false alarms. Use separate alerting rules for experiment data.

Practice Questions

  1. Why is it important to collect baseline metrics before running a Chaos Experiment?
  2. How do OpenTelemetry traces help during Fault Injection experiments?
  3. What Prometheus metric types are most useful for experiment impact analysis?
  4. How do you add experiment correlation IDs to application logs?
  5. What is the difference between a Grafana annotation and a regular data point?

Challenge

Build a complete chaos Observability stack for a microservice application. Set up a Prometheus exporter that captures p99 latency and error rate with experiment labels, configure OpenTelemetry to trace requests through three services during a fault, create a Grafana dashboard that compares before-and-after metrics with an experiment selector variable, and implement structured logging that includes experiment IDs in every log entry.

FAQ

What is observability in Chaos Engineering?

Observability in Chaos Engineering means collecting metrics, traces, and logs with experiment correlation data so that you can precisely measure the impact of Fault Injection on system behavior.

Why do I need baseline metrics for chaos experiments?

Baseline metrics establish normal system behavior before Fault Injection. Without a baseline you cannot quantify the impact of the fault or validate your hypothesis about acceptable degradation.

How do I correlate traces with specific chaos experiments?

Add experiment ID and fault type as span attributes in OpenTelemetry traces. This allows you to filter traces by experiment and see how specific requests were affected.

What should a Chaos Experiment dashboard show?

A good dashboard shows before-and-after comparisons of latency, error rate, throughput, and resource utilization, with annotations marking experiment start and end times and a selector for choosing which experiment to analyze.

How do I prevent Chaos Experiment metrics from triggering false alarms?

Use separate Prometheus recording rules or alerting configurations for experiment metrics. Tag experiment metrics with an experiment_id label and exclude them from production alerting queries.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro