Database Chaos Engineering — PostgreSQL, MySQL & Redis Resilience
In this tutorial, you'll learn about Database Chaos Engineering. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Chaos Engineering for databases extends beyond simple connection drops to include Replication lag, failover timing, cache stampedes, query timeouts, and data consistency under partition. Databases are the stateful heart of most applications, and their failure modes are uniquely destructive.
What You Will Learn
This tutorial teaches you how to run chaos experiments on PostgreSQL, MySQL, and Redis databases: testing Connection Pool behavior, Replication lag tolerance, automated failover timing, cache invalidation storms, and read-after-write consistency.
Why It Matters
Database failures cascade faster and further than any other type of infrastructure failure. A single slow query can exhaust the Connection Pool, blocking every service that depends on that database. Testing these scenarios with controlled chaos experiments reveals connection leaks, misconfigured timeouts, and missing fallback logic before they cause production incidents.
Real-World Use
DodaTech runs a weekly PostgreSQL failover experiment on the Durga Antivirus Pro update database. The experiment triggers a manual failover from primary to replica and measures how long the update service is unable to serve new virus definition requests. The SLO target is under 10 seconds of write unavailability.
Prerequisites
Before starting you should understand:
- PostgreSQL or MySQL Replication concepts
- Redis cache patterns and eviction policies
- Docker for local database setup
- Basic Chaos Engineering principles
Step 1: Test PostgreSQL Connection Pool Behavior
Set up a Connection Pool and test exhaustion scenarios:
# pg-pool-chaos.yaml
services:
postgres:
image: postgres:16
environment:
POSTGRES_PASSWORD: secret
POSTGRES_DB: testdb
ports:
- "5432:5432"
app:
image: python:3.11
command: >
sh -c "pip install psycopg2-binary &&
python /scripts/pool-test.py"
depends_on:
- postgres
environment:
DB_HOST: postgres
DB_PASSWORD: secret
volumes:
- ./scripts:/scripts
# Reduce max_connections to trigger exhaustion quickly
docker exec postgres psql -U postgres -c "ALTER SYSTEM SET max_connections = 10;"
docker exec postgres psql -U postgres -c "SELECT pg_reload_conf();"
# Open 15 connections to exhaust the pool
python3 pool-exhaust.py
#!/usr/bin/env python3
"""Connection pool exhaustion test."""
import psycopg2
import threading
import time
import sys
connections = []
exhausted = False
def open_connection(worker_id):
global exhausted
try:
conn = psycopg2.connect(
host="localhost",
dbname="testdb",
user="postgres",
password="secret",
connect_timeout=5
)
connections.append(conn)
print(f"Worker {worker_id}: Connection opened successfully")
except Exception as e:
exhausted = True
print(f"Worker {worker_id}: Connection FAILED - {e}")
# Open connections in parallel
threads = []
for i in range(15):
t = threading.Thread(target=open_connection, args=(i,))
threads.append(t)
t.start()
time.sleep(0.1)
for t in threads:
t.join()
print(f"\nTotal connections opened: {len(connections)}")
print(f"Pool exhausted: {exhausted}")
for conn in connections:
conn.close()
# Expected output:
# Worker 0: Connection opened successfully
# Worker 1: Connection opened successfully
# ...
# Worker 9: Connection opened successfully
# Worker 10: Connection FAILED - FATAL: sorry, too many clients already
# Worker 11: Connection FAILED - FATAL: sorry, too many clients already
# Total connections opened: 10
# Pool exhausted: True
Step 2: Test PostgreSQL Replication Failover
Simulate a primary database failure and observe replica promotion:
# Set up streaming replication
docker exec postgres-primary psql -U postgres \
-c "CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'repl_secret';"
# Add 500ms of latency between primary and replica
docker exec postgres-replica tc qdisc add dev eth0 root netem delay 500ms
# Check replication lag
docker exec postgres-primary psql -U postgres -c "
SELECT application_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
"
# Expected output:
# application_name | lag_bytes
# walreceiver | 31457280
# Stop the primary to trigger failover
docker stop postgres-primary
# Promote the replica to primary (simulating automated failover)
docker exec postgres-replica psql -U postgres -c "SELECT pg_promote();"
# Expected output:
# pg_promote
# t
# (The replica is now the new primary)
# Verify write capability on the promoted replica
docker exec postgres-replica psql -U postgres -c "CREATE TABLE failover_test (id int); INSERT INTO failover_test VALUES (1); SELECT * FROM failover_test;"
# Expected output:
# id
# 1
# The promoted replica accepts writes successfully
Step 3: Test Redis Cache Failures
Simulate Redis cache node failures and observe application behavior:
# Start a Redis cluster
docker run -d --name redis-cache -p 6379:6379 redis:7.2
# Set a cache entry
docker exec redis-cache redis-cli SET session:user1 "active" EX 3600
# Expected output:
# OK
# Simulate cache node failure
docker stop redis-cache
# Verify that the application falls back to the database
python3 cache-fallback-test.py
#!/usr/bin/env python3
"""Redis cache fallback test with chaos."""
import redis
import time
def get_user_session(user_id):
cache = redis.Redis(host="localhost", port=6379, decode_responses=True)
try:
# Try cache first
session = cache.get(f"session:{user_id}")
if session:
print(f"Cache HIT for user {user_id}: {session}")
return session
except redis.ConnectionError:
print(f"Cache MISS - Redis unavailable, falling back to database")
# Fallback to database
session = query_database(user_id)
print(f"Database fallback for user {user_id}: {session}")
return session
def query_database(user_id):
# Simulate database query
time.sleep(0.1)
return "active"
# Test 1: Cache is up
get_user_session("user1")
# Expected output:
# Cache HIT for user user1: active
# Test 2: Cache is down
get_user_session("user2")
# Expected output:
# Cache MISS - Redis unavailable, falling back to database
# Database fallback for user user2: active
Step 4: Test Read-After-Write Consistency
Verify consistency guarantees during replication lag:
# Start a primary and replica pair
docker compose -f db-pair.yaml up -d
# Write to primary
docker exec db-primary psql -U postgres -c "INSERT INTO orders VALUES (1, 'completed');"
# Immediately read from replica (may have lag)
docker exec db-replica psql -U postgres -c "SELECT * FROM orders WHERE id = 1;"
# Expected output (if no lag):
# id | status
# 1 | completed
# Add 10 seconds of replication lag
docker exec db-replica tc qdisc add dev eth0 root netem delay 10000ms
# Write to primary
docker exec db-primary psql -U postgres -c "INSERT INTO orders VALUES (2, 'pending');"
# Read from replica immediately
docker exec db-replica psql -U postgres -c "SELECT * FROM orders;"
# Expected output:
# id | status
# 1 | completed
# Order 2 is not visible on the replica yet due to replication lag
Learning Path
flowchart LR A[Database Faults] --> B[Database Chaos] B --> C[Network Chaos] C --> D[Kubernetes Chaos Testing] D --> E[Chaos Observability] style B fill:#f90,color:#fff
Common Errors
- Not testing Connection Pool recovery after database restart: A database restart can leave stale connections in the pool. Test that the pool reconnects gracefully after a restart.
- Assuming synchronous Replication in PostgreSQL: The default is asynchronous. Writes committed on the primary may not be visible on the replica for several seconds.
- Redis cache experiments without measuring the database load spike: When the cache empties, the database receives a sudden traffic spike. Measure whether the database can handle it.
- Testing failover without verifying the application reconnection logic: The application may still point to the failed primary. Test that the application retries with the new primary address.
- Ignoring cascading effects of a database failure: A database failure may trigger retry loops that increase load on other services. Monitor the entire system during database experiments.
Practice Questions
- What happens when a PostgreSQL Connection Pool is exhausted?
- How does Replication lag affect read-after-write consistency?
- What should an application do when Redis is unavailable?
- How do you test automated database failover?
- Why is it important to measure cache miss rates during chaos experiments?
Challenge
Set up a three-tier application with PostgreSQL primary-replica, Redis cache, and an application server. Run a five-stage database Chaos Experiment: exhaust the Connection Pool, trigger primary failover, kill the Redis cache, induce 30 seconds of Replication lag, and restart the primary. After each stage, measure the application response time, error rate, and cache hit ratio. Document which stages caused SLO breaches.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro