Database Chaos — Connection Drops, Replication Lag & Corruption
In this tutorial, you'll learn about Database Chaos. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Database chaos is a Chaos Engineering discipline that tests how applications behave when databases fail or degrade. Databases are the most common single point of failure in Distributed Systems because they handle state, and state is hard to recover.
What You Will Learn
This tutorial teaches you how to simulate database connection drops, Replication lag, data corruption, and Connection Pool exhaustion to validate your applications database resilience.
Why It Matters
Database failures are devastating because they affect every service that reads or writes data. A single corrupted row, a Connection Pool leak, or a Replication lag spike can take down an entire application. Testing these scenarios before they happen in production is critical.
Real-World Use
DodaTech runs a monthly database chaos day where each microservice team must demonstrate that their service can survive the loss of its primary database connection for at least 60 seconds without losing data or corrupting state.
Prerequisites
Before starting you should understand:
- Basic Chaos Engineering concepts
- How PostgreSQL or MySQL database connections work
- Docker and Docker Compose for database setup
- Application connection pooling concepts
Step 1: Set Up a Test Database
Create a PostgreSQL database with Docker for experiments:
# docker-compose-db-chaos.yaml
services:
postgres-primary:
image: postgres:16
environment:
POSTGRES_PASSWORD: secret
ports:
- "5432:5432"
postgres-replica:
image: postgres:16
environment:
POSTGRES_PASSWORD: secret
ports:
- "5433:5432"
docker compose -f docker-compose-db-chaos.yaml up -d
# Expected output:
# [+] Running 3/3
# - Container postgres-primary Started
# - Container postgres-replica Started
Step 2: Simulate a Connection Drop
Kill the database connection and observe application behavior:
# Kill the database container
docker stop postgres-primary
# From the application
curl -s http://localhost:5000/api/users
# Expected output (with connection pool configured):
# {
# "error": "database_unavailable",
# "pool": "degraded"
# }
# The application should use a fallback or return cached data
Step 3: Simulate Replication Lag
Introduce network latency between primary and replica:
# Add latency to the replica connection
docker exec postgres-primary tc qdisc add dev eth0 root netem delay 5000ms
# Check replication status
docker exec postgres-primary psql -U postgres -c "SELECT pg_current_wal_lsn() - pg_stat_replication.sent_lsn AS lag_bytes FROM pg_stat_replication;"
# Expected output:
# lag_bytes
# 52428800
# The lag shows 50MB of unsent WAL data
Step 4: Test Connection Pool Exhaustion
Open many connections to exhaust the Connection Pool:
# Set max_connections to a low value
docker exec postgres-primary psql -U postgres \
-c "ALTER SYSTEM SET max_connections = 5;" \
-c "SELECT pg_reload_conf();"
# Open many connections in parallel
for i in $(seq 1 10); do
(psql -h localhost -U postgres -c "SELECT pg_sleep(30)") &
done
# The 6th connection will fail
# Expected output:
# psql: error: connection to server on socket "/tmp/.s.PGSQL.5432"
# FATAL: sorry, too many clients already
Step 5: Simulate Data Corruption
Corrupt a table and observe the applications response:
# Simulate disk-level corruption by writing to the data directory
docker exec postgres-primary bash -c "dd if=/dev/urandom of=/var/lib/postgresql/data/base/1/12345 bs=1024 count=1 conv=notrunc"
# Query the corrupted table
docker exec postgres-primary psql -U postgres -c "SELECT * FROM users;"
# Expected output:
# WARNING: page verification failed, calculated checksum ...
# ERROR: invalid page in block 0 of relation base/1/12345
Learning Path
flowchart LR A[Resilience Testing] --> B[Database Faults] B --> C[Network Partitioning] C --> D[Infrastructure Faults] D --> E[Kubernetes Chaos] style B fill:#f90,color:#fff
Common Errors
- Not configuring Connection Pool limits: Without pool limits every query attempt creates a new connection, eventually exhausting database resources.
- Ignoring read replica failover: Applications that use read replicas often hardcode the replica endpoint. Test what happens when the replica becomes unavailable.
- Setting query timeouts too high: A query that hangs for 30 seconds blocks a connection from the pool for the entire duration, exhausting the pool quickly.
- Not testing with WAL (Write-Ahead Log) corruption: WAL corruption is rare but catastrophic. Test that your backup and recovery procedures work.
- Forgetting to restore max_connections after testing: Leaving max_connections at an artificially low value will cause application errors in production.
Practice Questions
- How does connection pooling protect against database overload?
- What is Replication lag and how can you simulate it?
- Why should you test applications with corrupted database blocks?
- What happens when the Connection Pool is exhausted and a new request arrives?
- How do you verify that an application handles a database connection drop gracefully?
Challenge
Set up a PostgreSQL primary-replica pair and an application that reads from the replica. Run chaos experiments that: stop the primary, inject 10 seconds of replication lag, exhaust the Connection Pool, and corrupt a single row. Verify that the application degrades gracefully in each scenario and recovers fully when the fault is removed.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro