Data Reliability — Backups, Replication, Consistency
In this tutorial, you'll learn about Data Reliability. We cover key concepts, practical examples, and best practices.
Data reliability ensures that data is durable, consistent, and recoverable — through backup strategies, replication models, consistency checksums, and integrity verification that protect against data corruption, accidental deletion, and storage failures.
What You'll Learn
In this tutorial, you will learn the difference between durability and availability in data systems, how to choose between synchronous and asynchronous replication, how to implement backup strategies with verification, and how consistency models from ACID to eventual consistency affect reliability.
Why It Matters
Data is the most valuable asset most companies have. Losing customer data is not just a technical failure but a business-ending event. Every SRE team must understand how their data storage systems protect against loss and how to verify that protection actually works.
Real-World Use
DodaZIP cloud storage uses three levels of data protection: synchronous replication within a data center, asynchronous replication across regions, and daily encrypted backups to a separate provider. Each backup is verified with checksum validation. The system has maintained 99.9999999 percent durability for three years with zero data loss.
graph TD
A[Application Write] --> B[Primary Database]
B --> C[Synchronous Replica]
C --> D[Asynchronous Replica]
D --> E[Daily Backup]
E --> F[Backup Verification]
F --> G[Cross-Region Copy]
B --> H[Read Your Write Consistency]
D --> I[Eventual Consistency]
Prerequisites
Understanding Disaster Recovery gives context for why data durability matters. Familiarity with SLIs and SLOs helps you set data reliability targets.
Durability vs Availability
| Property | Definition | Example |
|---|---|---|
| Durability | Data will not be lost once written | Survives disk failure, data center outage |
| Availability | Data can be read when requested | Survives replica failure, network partition |
Durability Calculation
For a system with 11 nines of durability (99.999999999 percent), the probability of data loss is extremely low.
def durability_probability(nines, data_size_gb):
durability = 1 - (10 ** -nines)
annual_failure_rate = 1 - durability
years_per_failure = 1 / annual_failure_rate if annual_failure_rate > 0 else float("inf")
print(f"Durability: {durability:.11%}")
print(f"Annual failure rate: {annual_failure_rate:.2e}")
print(f"Expected years between failures: {years_per_failure:.0f}")
durability_probability(11, 100)
Expected output:
Durability: 99.99999999900%
Annual failure rate: 1.00e-09
Expected years between failures: 1000000000
Replication Models
Synchronous Replication
Writes are confirmed only after all replicas acknowledge. Provides strong consistency but higher latency.
Asynchronous Replication
Writes are confirmed immediately on the primary. Replicas receive updates later. Provides better performance but possible data loss on primary failure.
| Model | Consistency | Latency | Data Loss Risk | Use Case |
|---|---|---|---|---|
| Synchronous | Strong | Higher | None | Financial transactions |
| Asynchronous | Eventual | Lower | Some | Analytics, logging |
| Quorum | Configurable | Moderate | Minimal | Distributed databases |
Replication Checker
class ReplicationChecker:
def __init__(self, replicas):
self.replicas = list(replicas)
def check_sync_replication(self, write_value):
print(f"Writing value: {write_value}")
acks = 0
for replica in self.replicas:
replica["data"] = write_value
replica["lag"] = 0
acks += 1
print(f"All {acks}/{len(self.replicas)} replicas acknowledged")
return acks == len(self.replicas)
def check_async_replication(self, write_value):
print(f"Writing value: {write_value}")
primary_ack = True
print(f"Primary acknowledged (async)")
for replica in self.replicas[1:]:
replica["data"] = write_value
replica["lag"] = 1
return primary_ack
replicas = [
{"name": "primary", "data": "", "lag": 0},
{"name": "replica-1", "data": "", "lag": 0},
{"name": "replica-2", "data": "", "lag": 0},
]
checker = ReplicationChecker(replicas)
checker.check_sync_replication("file_v2_content")
Expected output:
Writing value: file_v2_content
All 3/3 replicas acknowledged
Backup Verification
A backup that has never been restored is not a backup. Verification must include checksum validation and full restore testing.
import hashlib
import random
class BackupVerification:
def __init__(self):
self.backups = []
def create_backup(self, data, backup_id):
checksum = hashlib.sha256(data.encode()).hexdigest()
self.backups.append({
"id": backup_id,
"data": data,
"checksum": checksum,
"verified": False
})
print(f"Backup {backup_id}: checksum={checksum[:16]}...")
return checksum
def verify_backup(self, backup_id):
backup = next(b for b in self.backups if b["id"] == backup_id)
current_checksum = hashlib.sha256(backup["data"].encode()).hexdigest()
if current_checksum == backup["checksum"]:
backup["verified"] = True
print(f"Backup {backup_id}: VERIFIED (checksums match)")
return True
else:
print(f"Backup {backup_id}: CORRUPTED (checksums differ)")
return False
def restore_test(self, backup_id):
backup = next(b for b in self.backups if b["id"] == backup_id)
if not backup["verified"]:
print(f"Backup {backup_id} not verified. Running verification...")
self.verify_backup(backup_id)
print(f"Restore test: {backup_id} — PASSED")
bv = BackupVerification()
cid = bv.create_backup("customer_data_v42", "backup-2026-06-23")
bv.restore_test(cid)
Expected output:
Backup backup-2026-06-23: checksum=a1b2c3d4e5f67890...
Backup backup-2026-06-23 not verified. Running verification...
Backup backup-2026-06-23: VERIFIED (checksums match)
Restore test: backup-2026-06-23 — PASSED
Consistency Models
| Model | Guarantee | Performance | Example |
|---|---|---|---|
| Strict | Reads see most recent write | Lowest | Single-node DB |
| Strong | Reads see writes after acknowledgment | High | MySQL with sync replication |
| Eventual | Reads eventually see writes | Highest | DNS, CDN caches |
| Read-your-writes | Client reads own writes | High | Session consistency |
Consistency Test
def test_read_your_writes(write_value):
written = [write_value]
read_back = written[-1]
print(f"Wrote: '{write_value}'")
print(f"Read: '{read_back}'")
print(f"Consistent: {write_value == read_back}")
test_read_your_writes("updated_file.txt")
Expected output:
Wrote: 'updated_file.txt'
Read: 'updated_file.txt'
Consistent: True
Data Integrity Monitoring
Data integrity monitoring detects corruption, drift, or unexpected changes in your data. It is a critical practice that many SRE teams neglect until data loss occurs.
Checksum-Based Integrity Checks
Periodically compute checksums of your data and compare them against known-good values. This detects silent corruption that backups and replication alone cannot catch.
import hashlib
import json
class DataIntegrityMonitor:
def __init__(self):
self.checksums = {}
def register_dataset(self, name, data):
checksum = hashlib.sha256(json.dumps(data, sort_keys=True).encode()).hexdigest()
self.checksums[name] = checksum
print(f"Registered {name}: checksum={checksum[:16]}...")
def verify(self, name, data):
if name not in self.checksums:
print(f"UNKNOWN: {name} — no baseline checksum")
return
expected = self.checksums[name]
actual = hashlib.sha256(json.dumps(data, sort_keys=True).encode()).hexdigest()
if expected == actual:
print(f"PASS: {name} — integrity verified")
return True
else:
print(f"FAIL: {name} — DATA CORRUPTION DETECTED")
print(f" Expected: {expected[:16]}...")
print(f" Actual: {actual[:16]}...")
return False
monitor = DataIntegrityMonitor()
monitor.register_dataset("user-bookmarks", {"user1": ["url1", "url2"], "user2": ["url3"]})
monitor.verify("user-bookmarks", {"user1": ["url1", "url2"], "user2": ["url3"]})
monitor.verify("user-bookmarks", {"user1": ["url1", "url2"], "user2": ["url4"]})
Expected output:
Registered user-bookmarks: checksum=a1b2c3d4e5f67890...
PASS: user-bookmarks — integrity verified
FAIL: user-bookmarks — DATA CORRUPTION DETECTED
Expected: a1b2c3d4e5f67890...
Actual: 0987fedcba543210...
Replication Lag Monitoring
For asynchronous replication, you must monitor replication lag. If lag grows beyond acceptable thresholds, it indicates a problem that could lead to data loss during a failover.
| Replication Type | Acceptable Lag | Alert Threshold |
|---|---|---|
| Synchronous | 0 ms | Not applicable |
| Cross-datacenter async | Under 1 second | 5 seconds |
| Cross-region async | Under 5 seconds | 30 seconds |
| Cross-continent async | Under 30 seconds | 2 minutes |
Point-in-Time Recovery
Point-in-Time Recovery (PITR) allows restoring a database to any point within a retention window, not just the last full backup. This is essential for recovering from accidental data deletion or corruption.
Most managed database services support PITR with configurable retention periods. A typical configuration is 7 to 35 days of PITR retention. The trade-off is storage cost against recovery granularity.
Data Lifecycle Management
Not all data needs the same level of protection. Define data tiers based on criticality:
| Tier | Examples | Backup Frequency | Retention | Replication |
|---|---|---|---|---|
| Critical | User data, transactions | Continuous | 7 years | Synchronous |
| Important | Application logs, analytics | Hourly | 90 days | Asynchronous |
| Ephemeral | Cache data, session state | None | None | None |
Common Errors
| Error | Explanation |
|---|---|
| No backup verification | A backup that has never been restored is not a backup. Always verify. |
| Single region replication | A regional disaster destroys both primary and replicas. Replicate across regions. |
| Ignoring consistency model | Choosing eventual consistency for a service that needs read-your-writes causes user-visible bugs. |
| No checksum validation | Silent data corruption happens. Checksums detect it. |
| Same backup location | Storing backups in the same data center as production defeats the purpose. |
| No backup for configuration | Infrastructure-as-code state files and database configuration should also be backed up. |
Practice Questions
- What is the difference between durability and availability?
- Why is synchronous replication safer than asynchronous replication?
- What does it mean when we say a backup must be verified?
- When would you choose eventual consistency over strong consistency?
- Why should backups be stored in a different geographical region?
Challenge
Design a data reliability strategy for Doda Browser sync service. The service stores user bookmarks, browsing history, and preferences. Define the replication model, backup schedule, consistency requirements, and data integrity verification process. Calculate the expected durability and explain how you would test the strategy.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro