Skip to content

Database Chaos — Connection Drops, Replication Lag & Corruption

DodaTech Updated 2026-06-21 5 min read

In this tutorial, you'll learn about Database Chaos. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Database chaos is a Chaos Engineering discipline that tests how applications behave when databases fail or degrade. Databases are the most common single point of failure in Distributed Systems because they handle state, and state is hard to recover.

What You Will Learn

This tutorial teaches you how to simulate database connection drops, Replication lag, data corruption, and Connection Pool exhaustion to validate your applications database resilience.

Why It Matters

Database failures are devastating because they affect every service that reads or writes data. A single corrupted row, a Connection Pool leak, or a Replication lag spike can take down an entire application. Testing these scenarios before they happen in production is critical.

Real-World Use

DodaTech runs a monthly database chaos day where each microservice team must demonstrate that their service can survive the loss of its primary database connection for at least 60 seconds without losing data or corrupting state.

Prerequisites

Before starting you should understand:

  • Basic Chaos Engineering concepts
  • How PostgreSQL or MySQL database connections work
  • Docker and Docker Compose for database setup
  • Application connection pooling concepts

Step 1: Set Up a Test Database

Create a PostgreSQL database with Docker for experiments:

# docker-compose-db-chaos.yaml
services:
  postgres-primary:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
    ports:
      - "5432:5432"
  postgres-replica:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
    ports:
      - "5433:5432"
docker compose -f docker-compose-db-chaos.yaml up -d
# Expected output:
# [+] Running 3/3
#  - Container postgres-primary Started
#  - Container postgres-replica Started

Step 2: Simulate a Connection Drop

Kill the database connection and observe application behavior:

# Kill the database container
docker stop postgres-primary

# From the application
curl -s http://localhost:5000/api/users
# Expected output (with connection pool configured):
# {
#   "error": "database_unavailable",
#   "pool": "degraded"
# }

# The application should use a fallback or return cached data

Step 3: Simulate Replication Lag

Introduce network latency between primary and replica:

# Add latency to the replica connection
docker exec postgres-primary tc qdisc add dev eth0 root netem delay 5000ms

# Check replication status
docker exec postgres-primary psql -U postgres -c "SELECT pg_current_wal_lsn() - pg_stat_replication.sent_lsn AS lag_bytes FROM pg_stat_replication;"
# Expected output:
# lag_bytes
# 52428800
# The lag shows 50MB of unsent WAL data

Step 4: Test Connection Pool Exhaustion

Open many connections to exhaust the Connection Pool:

# Set max_connections to a low value
docker exec postgres-primary psql -U postgres \
  -c "ALTER SYSTEM SET max_connections = 5;" \
  -c "SELECT pg_reload_conf();"

# Open many connections in parallel
for i in $(seq 1 10); do
  (psql -h localhost -U postgres -c "SELECT pg_sleep(30)") &
done

# The 6th connection will fail
# Expected output:
# psql: error: connection to server on socket "/tmp/.s.PGSQL.5432"
# FATAL: sorry, too many clients already

Step 5: Simulate Data Corruption

Corrupt a table and observe the applications response:

# Simulate disk-level corruption by writing to the data directory
docker exec postgres-primary bash -c "dd if=/dev/urandom of=/var/lib/postgresql/data/base/1/12345 bs=1024 count=1 conv=notrunc"

# Query the corrupted table
docker exec postgres-primary psql -U postgres -c "SELECT * FROM users;"
# Expected output:
# WARNING: page verification failed, calculated checksum ...
# ERROR: invalid page in block 0 of relation base/1/12345

Learning Path

flowchart LR
  A[Resilience Testing] --> B[Database Faults]
  B --> C[Network Partitioning]
  C --> D[Infrastructure Faults]
  D --> E[Kubernetes Chaos]
  style B fill:#f90,color:#fff

Common Errors

  1. Not configuring Connection Pool limits: Without pool limits every query attempt creates a new connection, eventually exhausting database resources.
  2. Ignoring read replica failover: Applications that use read replicas often hardcode the replica endpoint. Test what happens when the replica becomes unavailable.
  3. Setting query timeouts too high: A query that hangs for 30 seconds blocks a connection from the pool for the entire duration, exhausting the pool quickly.
  4. Not testing with WAL (Write-Ahead Log) corruption: WAL corruption is rare but catastrophic. Test that your backup and recovery procedures work.
  5. Forgetting to restore max_connections after testing: Leaving max_connections at an artificially low value will cause application errors in production.

Practice Questions

  1. How does connection pooling protect against database overload?
  2. What is Replication lag and how can you simulate it?
  3. Why should you test applications with corrupted database blocks?
  4. What happens when the Connection Pool is exhausted and a new request arrives?
  5. How do you verify that an application handles a database connection drop gracefully?

Challenge

Set up a PostgreSQL primary-replica pair and an application that reads from the replica. Run chaos experiments that: stop the primary, inject 10 seconds of replication lag, exhaust the Connection Pool, and corrupt a single row. Verify that the application degrades gracefully in each scenario and recovers fully when the fault is removed.

FAQ

What is database Chaos Engineering?

Database Chaos Engineering tests how applications behave when databases experience failures like connection drops, Replication lag, corruption, or pool exhaustion.

How do I simulate database connection drops?

Stop the database process (Docker stop, systemctl stop) or block the port with iptables. Observe how the application and Connection Pool react.

What is Replication lag?

Replication lag is the delay between a write on the primary database and its appearance on read replicas. High lag means replicas serve stale data.

How do I protect against Connection Pool exhaustion?

Set Connection Pool limits to match the database max_connections, configure queue timeouts, and monitor pool utilization with alerts.

Can I test database corruption safely?

Yes. Use a dedicated test database with non-production data. Corruption operations can permanently damage the data directory.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro