Skip to content

Database Chaos Engineering — PostgreSQL, MySQL & Redis Resilience

DodaTech Updated 2026-06-23 7 min read

In this tutorial, you'll learn about Database Chaos Engineering. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Chaos Engineering for databases extends beyond simple connection drops to include Replication lag, failover timing, cache stampedes, query timeouts, and data consistency under partition. Databases are the stateful heart of most applications, and their failure modes are uniquely destructive.

What You Will Learn

This tutorial teaches you how to run chaos experiments on PostgreSQL, MySQL, and Redis databases: testing Connection Pool behavior, Replication lag tolerance, automated failover timing, cache invalidation storms, and read-after-write consistency.

Why It Matters

Database failures cascade faster and further than any other type of infrastructure failure. A single slow query can exhaust the Connection Pool, blocking every service that depends on that database. Testing these scenarios with controlled chaos experiments reveals connection leaks, misconfigured timeouts, and missing fallback logic before they cause production incidents.

Real-World Use

DodaTech runs a weekly PostgreSQL failover experiment on the Durga Antivirus Pro update database. The experiment triggers a manual failover from primary to replica and measures how long the update service is unable to serve new virus definition requests. The SLO target is under 10 seconds of write unavailability.

Prerequisites

Before starting you should understand:

  • PostgreSQL or MySQL Replication concepts
  • Redis cache patterns and eviction policies
  • Docker for local database setup
  • Basic Chaos Engineering principles

Step 1: Test PostgreSQL Connection Pool Behavior

Set up a Connection Pool and test exhaustion scenarios:

# pg-pool-chaos.yaml
services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: testdb
    ports:
      - "5432:5432"
  app:
    image: python:3.11
    command: >
      sh -c "pip install psycopg2-binary &&
             python /scripts/pool-test.py"
    depends_on:
      - postgres
    environment:
      DB_HOST: postgres
      DB_PASSWORD: secret
    volumes:
      - ./scripts:/scripts
# Reduce max_connections to trigger exhaustion quickly
docker exec postgres psql -U postgres -c "ALTER SYSTEM SET max_connections = 10;"
docker exec postgres psql -U postgres -c "SELECT pg_reload_conf();"

# Open 15 connections to exhaust the pool
python3 pool-exhaust.py
#!/usr/bin/env python3
"""Connection pool exhaustion test."""
import psycopg2
import threading
import time
import sys

connections = []
exhausted = False

def open_connection(worker_id):
    global exhausted
    try:
        conn = psycopg2.connect(
            host="localhost",
            dbname="testdb",
            user="postgres",
            password="secret",
            connect_timeout=5
        )
        connections.append(conn)
        print(f"Worker {worker_id}: Connection opened successfully")
    except Exception as e:
        exhausted = True
        print(f"Worker {worker_id}: Connection FAILED - {e}")

# Open connections in parallel
threads = []
for i in range(15):
    t = threading.Thread(target=open_connection, args=(i,))
    threads.append(t)
    t.start()
    time.sleep(0.1)

for t in threads:
    t.join()

print(f"\nTotal connections opened: {len(connections)}")
print(f"Pool exhausted: {exhausted}")

for conn in connections:
    conn.close()

# Expected output:
# Worker 0: Connection opened successfully
# Worker 1: Connection opened successfully
# ...
# Worker 9: Connection opened successfully
# Worker 10: Connection FAILED - FATAL: sorry, too many clients already
# Worker 11: Connection FAILED - FATAL: sorry, too many clients already
# Total connections opened: 10
# Pool exhausted: True

Step 2: Test PostgreSQL Replication Failover

Simulate a primary database failure and observe replica promotion:

# Set up streaming replication
docker exec postgres-primary psql -U postgres \
  -c "CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'repl_secret';"

# Add 500ms of latency between primary and replica
docker exec postgres-replica tc qdisc add dev eth0 root netem delay 500ms

# Check replication lag
docker exec postgres-primary psql -U postgres -c "
SELECT application_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
"
# Expected output:
# application_name | lag_bytes
# walreceiver      | 31457280

# Stop the primary to trigger failover
docker stop postgres-primary

# Promote the replica to primary (simulating automated failover)
docker exec postgres-replica psql -U postgres -c "SELECT pg_promote();"
# Expected output:
# pg_promote
# t
# (The replica is now the new primary)

# Verify write capability on the promoted replica
docker exec postgres-replica psql -U postgres -c "CREATE TABLE failover_test (id int); INSERT INTO failover_test VALUES (1); SELECT * FROM failover_test;"
# Expected output:
# id
# 1
# The promoted replica accepts writes successfully

Step 3: Test Redis Cache Failures

Simulate Redis cache node failures and observe application behavior:

# Start a Redis cluster
docker run -d --name redis-cache -p 6379:6379 redis:7.2

# Set a cache entry
docker exec redis-cache redis-cli SET session:user1 "active" EX 3600
# Expected output:
# OK

# Simulate cache node failure
docker stop redis-cache

# Verify that the application falls back to the database
python3 cache-fallback-test.py
#!/usr/bin/env python3
"""Redis cache fallback test with chaos."""
import redis
import time

def get_user_session(user_id):
    cache = redis.Redis(host="localhost", port=6379, decode_responses=True)

    try:
        # Try cache first
        session = cache.get(f"session:{user_id}")
        if session:
            print(f"Cache HIT for user {user_id}: {session}")
            return session
    except redis.ConnectionError:
        print(f"Cache MISS - Redis unavailable, falling back to database")

    # Fallback to database
    session = query_database(user_id)
    print(f"Database fallback for user {user_id}: {session}")
    return session

def query_database(user_id):
    # Simulate database query
    time.sleep(0.1)
    return "active"

# Test 1: Cache is up
get_user_session("user1")
# Expected output:
# Cache HIT for user user1: active

# Test 2: Cache is down
get_user_session("user2")
# Expected output:
# Cache MISS - Redis unavailable, falling back to database
# Database fallback for user user2: active

Step 4: Test Read-After-Write Consistency

Verify consistency guarantees during replication lag:

# Start a primary and replica pair
docker compose -f db-pair.yaml up -d

# Write to primary
docker exec db-primary psql -U postgres -c "INSERT INTO orders VALUES (1, 'completed');"

# Immediately read from replica (may have lag)
docker exec db-replica psql -U postgres -c "SELECT * FROM orders WHERE id = 1;"
# Expected output (if no lag):
# id | status
# 1  | completed

# Add 10 seconds of replication lag
docker exec db-replica tc qdisc add dev eth0 root netem delay 10000ms

# Write to primary
docker exec db-primary psql -U postgres -c "INSERT INTO orders VALUES (2, 'pending');"

# Read from replica immediately
docker exec db-replica psql -U postgres -c "SELECT * FROM orders;"
# Expected output:
# id | status
# 1  | completed
# Order 2 is not visible on the replica yet due to replication lag

Learning Path

flowchart LR
  A[Database Faults] --> B[Database Chaos]
  B --> C[Network Chaos]
  C --> D[Kubernetes Chaos Testing]
  D --> E[Chaos Observability]
  style B fill:#f90,color:#fff

Common Errors

  1. Not testing Connection Pool recovery after database restart: A database restart can leave stale connections in the pool. Test that the pool reconnects gracefully after a restart.
  2. Assuming synchronous Replication in PostgreSQL: The default is asynchronous. Writes committed on the primary may not be visible on the replica for several seconds.
  3. Redis cache experiments without measuring the database load spike: When the cache empties, the database receives a sudden traffic spike. Measure whether the database can handle it.
  4. Testing failover without verifying the application reconnection logic: The application may still point to the failed primary. Test that the application retries with the new primary address.
  5. Ignoring cascading effects of a database failure: A database failure may trigger retry loops that increase load on other services. Monitor the entire system during database experiments.

Practice Questions

  1. What happens when a PostgreSQL Connection Pool is exhausted?
  2. How does Replication lag affect read-after-write consistency?
  3. What should an application do when Redis is unavailable?
  4. How do you test automated database failover?
  5. Why is it important to measure cache miss rates during chaos experiments?

Challenge

Set up a three-tier application with PostgreSQL primary-replica, Redis cache, and an application server. Run a five-stage database Chaos Experiment: exhaust the Connection Pool, trigger primary failover, kill the Redis cache, induce 30 seconds of Replication lag, and restart the primary. After each stage, measure the application response time, error rate, and cache hit ratio. Document which stages caused SLO breaches.

FAQ

What is database Chaos Engineering?

Database Chaos Engineering tests how applications behave when databases experience failures like connection drops, Replication lag, failover, cache misses, and consistency violations.

How do you test PostgreSQL Replication failover?

Stop the primary database, promote the replica to primary using pg_promote, and verify that the application reconnects to the new primary and resumes normal operations.

What is the impact of Redis cache failure on applications?

When Redis fails, all requests hit the database directly. This can cause a sudden traffic spike that overwhelms the database, leading to cascading failures.

How do you measure read-after-write consistency during Replication lag?

Write a record to the primary database and immediately try to read it from the replica. If the replica does not show the new record, there is Replication lag affecting consistency.

What is Connection Pool exhaustion and how do you test it?

Connection Pool exhaustion occurs when all available database connections are in use. Test it by opening more connections than max_connections allows and observing how the application handles the connection failures.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro