Observability in Chaos Engineering — Metrics, Traces & Logs
In this tutorial, you'll learn about Observability in Chaos Engineering. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Observability is the foundation of Chaos Engineering. Without proper metrics, traces, and logs, a Chaos Experiment produces no actionable results. Observing experiments means capturing the system state before, during, and after Fault Injection, then comparing the data against your hypothesis.
What You Will Learn
This tutorial teaches you how to instrument chaos experiments with Prometheus metrics, OpenTelemetry distributed traces, structured logging, and Grafana dashboards designed specifically for experiment impact analysis and hypothesis validation.
Why It Matters
Running a Chaos Experiment without Observability is like performing surgery blindfolded. You cannot tell if the system is behaving as expected, if the fault is actually being applied, or if the recovery mechanisms are working. Observability turns raw experiment data into validated hypotheses and actionable resilience insights.
Real-World Use
DodaTech maintains a dedicated "Chaos Observability" Grafana dashboard that displays real-time metrics during active experiments. The dashboard includes before-and-after comparison panels for latency, error rate, throughput, and resource utilization so engineers can immediately see the impact of any Fault Injection.
Prerequisites
Before starting you should understand:
- Prometheus and basic metric types (counter, gauge, histogram)
- Grafana dashboard concepts
- Chaos Engineering experiment design
- OpenTelemetry for distributed tracing
Step 1: Set Up Experiment-Specific Metrics
Create Prometheus metrics that capture experiment impact in real time:
# prometheus-experiment-metrics.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: chaos-experiment-monitor
spec:
selector:
matchLabels:
app: chaos-exporter
endpoints:
- port: metrics
interval: 5s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_label_experiment_id]
targetLabel: experiment_id
- sourceLabels: [__meta_kubernetes_pod_label_fault_type]
targetLabel: fault_type
#!/usr/bin/env python3
"""Custom Prometheus exporter for chaos experiment metrics."""
from prometheus_client import start_http_server, Gauge, Histogram, Counter
import time
import random
import sys
EXPERIMENT_ID = sys.argv[1] if len(sys.argv) > 1 else "unknown"
p99_latency = Gauge(
"chaos_p99_latency_ms",
"P99 latency during experiment",
["experiment_id", "service"]
)
error_rate = Gauge(
"chaos_error_rate_percent",
"Error rate during experiment",
["experiment_id", "service"]
)
throughput = Gauge(
"chaos_throughput_rps",
"Requests per second during experiment",
["experiment_id", "service"]
)
guardrail_breaches = Counter(
"chaos_guardrail_breaches_total",
"Total guardrail breaches during experiment",
["experiment_id", "guardrail_name"]
)
def collect_metrics():
p99_latency.labels(experiment_id=EXPERIMENT_ID, service="payment").set(450)
error_rate.labels(experiment_id=EXPERIMENT_ID, service="payment").set(2.1)
throughput.labels(experiment_id=EXPERIMENT_ID, service="payment").set(850)
print("Metrics collected and exported on :8000")
if __name__ == "__main__":
start_http_server(8000)
collect_metrics()
while True:
time.sleep(15)
# Expected output:
# Metrics collected and exported on :8000
# Access at http://localhost:8000 to see raw metrics
Step 2: Compare Before-and-After Metrics
Query Prometheus to compare system state before and after an experiment:
# Query baseline metrics (before experiment)
PROMETHEUS="http://prometheus:9090"
START_TIME=$(date -d "5 minutes ago" +%s)
# Get baseline p99 latency
curl -s "$PROMETHEUS/api/v1/query_range" \
--data-urlencode "query=histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[1m])) by (le))" \
--data-urlencode "start=$START_TIME" \
--data-urlencode "end=$(date +%s)" \
--data-urlencode "step=15" | jq '.data.result[0].values[] | .[1]'
# Expected output:
# "0.342"
# "0.351"
# "0.338"
# (Baseline p99 around 340ms)
# Query during-experiment metrics (same query, time range during experiment)
# Expected output:
# "0.342"
# "1.234"
# "1.567"
# (p99 spiked to 1.5s during the experiment)
Step 3: Trace Request Paths During Fault Injection
Use OpenTelemetry to trace how requests behave under fault conditions:
#!/usr/bin/env python3
"""OpenTelemetry tracing during chaos experiment."""
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
import requests
import time
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4318/v1/traces")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
EXPERIMENT_ID = "exp-payment-kill-001"
def trace_request():
with tracer.start_as_current_span("payment-checkout") as span:
span.set_attribute("experiment.id", EXPERIMENT_ID)
span.set_attribute("experiment.fault", "pod-kill")
# Make a request to the payment service
start = time.time()
try:
response = requests.get(
"http://payment-service:8080/health",
timeout=5
)
latency_ms = (time.time() - start) * 1000
span.set_attribute("http.status_code", response.status_code)
span.set_attribute("http.latency_ms", latency_ms)
if response.status_code != 200:
span.set_attribute("error", True)
span.set_attribute("error.message", f"Status {response.status_code}")
except Exception as e:
span.set_attribute("error", True)
span.set_attribute("error.message", str(e))
return span
# Run trace before, during, and after experiment
for phase in ["before", "during", "after"]:
print(f"Tracing {phase} experiment...")
trace_request()
time.sleep(5)
print("Traces exported to OpenTelemetry collector")
# Expected output:
# Tracing before experiment... (span: status=200, latency=45ms)
# Tracing during experiment... (span: status=200, latency=1234ms) OR (span: error, timeout)
# Tracing after experiment... (span: status=200, latency=48ms)
Step 4: Build a Chaos Experiment Dashboard
Create a Grafana dashboard for visualizing experiment impact:
{
"dashboard": {
"title": "Chaos Experiment Impact",
"panels": [
{
"title": "Latency Comparison (Before vs During)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{experiment_id=\"$experiment\"}[1m])) by (le))",
"legendFormat": "P99 Latency"
}
]
},
{
"title": "Error Rate During Experiment",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\",experiment_id=\"$experiment\"}[1m])) / sum(rate(http_requests_total{experiment_id=\"$experiment\"}[1m])) * 100",
"legendFormat": "Error Rate"
}
]
},
{
"title": "Guardrail Status",
"type": "stat",
"targets": [
{
"expr": "chaos_guardrail_breaches_total{experiment_id=\"$experiment\"}]
}
]
}
],
"templating": {
"list": [
{
"name": "experiment",
"type": "query",
"query": "label_values(chaos_p99_latency_ms, experiment_id)]
}
]
}
}
}
Step 5: Implement Structured Logging for Experiments
Add experiment context to all application logs:
#!/usr/bin/env python3
"""Structured logging with experiment context."""
import structlog
import sys
EXPERIMENT_ID = "exp-payment-kill-001"
# Configure structured logging
structlog.configure(
processors=[
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer()
],
context_class=dict,
logger_factory=structlog.PrintLoggerFactory(),
)
logger = structlog.get_logger()
def handle_request():
log = logger.bind(
experiment_id=EXPERIMENT_ID,
service="payment",
trace_id="trace-abc-123"
)
log.info("request_started", path="/api/charge", method="POST")
# ... process request ...
log.info("request_completed",
status_code=200,
latency_ms=45,
database_healthy=True
)
# During experiment, log differences
log.warning("latency_increased",
current_p99=1234,
baseline_p99=342,
experiment_id=EXPERIMENT_ID
)
handle_request()
# Expected JSON log output:
# {"event": "request_started", "level": "info", "experiment_id": "exp-payment-kill-001", ...}
# {"event": "request_completed", "level": "info", "status_code": 200, "latency_ms": 45, ...}
# {"event": "latency_increased", "level": "warning", "current_p99": 1234, "baseline_p99": 342, ...}
Learning Path
flowchart LR A[Chaos Engineering Pipeline] --> B[Chaos Observability] B --> C[Advanced Experiments] C --> D[Game Days] D --> E[Chaos Mesh Advanced] style B fill:#f90,color:#fff
Common Errors
- Not collecting baseline metrics before the experiment starts: Without baseline data you cannot quantify the impact of a fault. Always capture 5-10 minutes of pre-experiment metrics.
- Sampling traces at too low a rate during experiments: During a fault many requests fail, and if your trace sampler only captures 1 percent of requests you may miss the failed ones entirely.
- Grafana dashboards with time ranges that exclude the experiment window: Set the dashboard time range to cover the experiment period. Use annotations to mark experiment start and end times.
- Logging without experiment correlation IDs: If logs do not include the experiment ID, you cannot filter for experiment-related events. Add experiment context to every log entry.
- Alerting on experiment metrics as if they were production SLOs: Experiment metrics include fault effects and will trigger false alarms. Use separate alerting rules for experiment data.
Practice Questions
- Why is it important to collect baseline metrics before running a Chaos Experiment?
- How do OpenTelemetry traces help during Fault Injection experiments?
- What Prometheus metric types are most useful for experiment impact analysis?
- How do you add experiment correlation IDs to application logs?
- What is the difference between a Grafana annotation and a regular data point?
Challenge
Build a complete chaos Observability stack for a microservice application. Set up a Prometheus exporter that captures p99 latency and error rate with experiment labels, configure OpenTelemetry to trace requests through three services during a fault, create a Grafana dashboard that compares before-and-after metrics with an experiment selector variable, and implement structured logging that includes experiment IDs in every log entry.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro