Data Reliability — Backups, Replication, Consistency

DodaTech Updated 2026-06-23 8 min read

In this tutorial, you'll learn about Data Reliability. We cover key concepts, practical examples, and best practices.

Data reliability ensures that data is durable, consistent, and recoverable — through backup strategies, replication models, consistency checksums, and integrity verification that protect against data corruption, accidental deletion, and storage failures.

What You'll Learn

In this tutorial, you will learn the difference between durability and availability in data systems, how to choose between synchronous and asynchronous replication, how to implement backup strategies with verification, and how consistency models from ACID to eventual consistency affect reliability.

Why It Matters

Data is the most valuable asset most companies have. Losing customer data is not just a technical failure but a business-ending event. Every SRE team must understand how their data storage systems protect against loss and how to verify that protection actually works.

Real-World Use

DodaZIP cloud storage uses three levels of data protection: synchronous replication within a data center, asynchronous replication across regions, and daily encrypted backups to a separate provider. Each backup is verified with checksum validation. The system has maintained 99.9999999 percent durability for three years with zero data loss.

graph TD
    A[Application Write] --> B[Primary Database]
    B --> C[Synchronous Replica]
    C --> D[Asynchronous Replica]
    D --> E[Daily Backup]
    E --> F[Backup Verification]
    F --> G[Cross-Region Copy]
    B --> H[Read Your Write Consistency]
    D --> I[Eventual Consistency]

Prerequisites

Understanding Disaster Recovery gives context for why data durability matters. Familiarity with SLIs and SLOs helps you set data reliability targets.

Durability vs Availability

Property	Definition	Example
Durability	Data will not be lost once written	Survives disk failure, data center outage
Availability	Data can be read when requested	Survives replica failure, network partition

Durability Calculation

For a system with 11 nines of durability (99.999999999 percent), the probability of data loss is extremely low.

def durability_probability(nines, data_size_gb):
    durability = 1 - (10 ** -nines)
    annual_failure_rate = 1 - durability
    years_per_failure = 1 / annual_failure_rate if annual_failure_rate > 0 else float("inf")
    print(f"Durability: {durability:.11%}")
    print(f"Annual failure rate: {annual_failure_rate:.2e}")
    print(f"Expected years between failures: {years_per_failure:.0f}")

durability_probability(11, 100)

Expected output:

Durability: 99.99999999900%
Annual failure rate: 1.00e-09
Expected years between failures: 1000000000

Replication Models

Synchronous Replication

Writes are confirmed only after all replicas acknowledge. Provides strong consistency but higher latency.

Asynchronous Replication

Writes are confirmed immediately on the primary. Replicas receive updates later. Provides better performance but possible data loss on primary failure.

Model	Consistency	Latency	Data Loss Risk	Use Case
Synchronous	Strong	Higher	None	Financial transactions
Asynchronous	Eventual	Lower	Some	Analytics, logging
Quorum	Configurable	Moderate	Minimal	Distributed databases

Replication Checker

class ReplicationChecker:
    def __init__(self, replicas):
        self.replicas = list(replicas)

    def check_sync_replication(self, write_value):
        print(f"Writing value: {write_value}")
        acks = 0
        for replica in self.replicas:
            replica["data"] = write_value
            replica["lag"] = 0
            acks += 1
        print(f"All {acks}/{len(self.replicas)} replicas acknowledged")
        return acks == len(self.replicas)

    def check_async_replication(self, write_value):
        print(f"Writing value: {write_value}")
        primary_ack = True
        print(f"Primary acknowledged (async)")
        for replica in self.replicas[1:]:
            replica["data"] = write_value
            replica["lag"] = 1
        return primary_ack

replicas = [
    {"name": "primary", "data": "", "lag": 0},
    {"name": "replica-1", "data": "", "lag": 0},
    {"name": "replica-2", "data": "", "lag": 0},
]
checker = ReplicationChecker(replicas)
checker.check_sync_replication("file_v2_content")

Expected output:

Writing value: file_v2_content
All 3/3 replicas acknowledged

Backup Verification

A backup that has never been restored is not a backup. Verification must include checksum validation and full restore testing.

import hashlib
import random

class BackupVerification:
    def __init__(self):
        self.backups = []

    def create_backup(self, data, backup_id):
        checksum = hashlib.sha256(data.encode()).hexdigest()
        self.backups.append({
            "id": backup_id,
            "data": data,
            "checksum": checksum,
            "verified": False
        })
        print(f"Backup {backup_id}: checksum={checksum[:16]}...")
        return checksum

    def verify_backup(self, backup_id):
        backup = next(b for b in self.backups if b["id"] == backup_id)
        current_checksum = hashlib.sha256(backup["data"].encode()).hexdigest()
        if current_checksum == backup["checksum"]:
            backup["verified"] = True
            print(f"Backup {backup_id}: VERIFIED (checksums match)")
            return True
        else:
            print(f"Backup {backup_id}: CORRUPTED (checksums differ)")
            return False

    def restore_test(self, backup_id):
        backup = next(b for b in self.backups if b["id"] == backup_id)
        if not backup["verified"]:
            print(f"Backup {backup_id} not verified. Running verification...")
            self.verify_backup(backup_id)
        print(f"Restore test: {backup_id} — PASSED")

bv = BackupVerification()
cid = bv.create_backup("customer_data_v42", "backup-2026-06-23")
bv.restore_test(cid)

Expected output:

Backup backup-2026-06-23: checksum=a1b2c3d4e5f67890...
Backup backup-2026-06-23 not verified. Running verification...
Backup backup-2026-06-23: VERIFIED (checksums match)
Restore test: backup-2026-06-23 — PASSED

Consistency Models

Model	Guarantee	Performance	Example
Strict	Reads see most recent write	Lowest	Single-node DB
Strong	Reads see writes after acknowledgment	High	MySQL with sync replication
Eventual	Reads eventually see writes	Highest	DNS, CDN caches
Read-your-writes	Client reads own writes	High	Session consistency

Consistency Test

def test_read_your_writes(write_value):
    written = [write_value]
    read_back = written[-1]
    print(f"Wrote: '{write_value}'")
    print(f"Read:  '{read_back}'")
    print(f"Consistent: {write_value == read_back}")

test_read_your_writes("updated_file.txt")

Expected output:

Wrote: 'updated_file.txt'
Read:  'updated_file.txt'
Consistent: True

Data Integrity Monitoring

Data integrity monitoring detects corruption, drift, or unexpected changes in your data. It is a critical practice that many SRE teams neglect until data loss occurs.

Checksum-Based Integrity Checks

Periodically compute checksums of your data and compare them against known-good values. This detects silent corruption that backups and replication alone cannot catch.

import hashlib
import json

class DataIntegrityMonitor:
    def __init__(self):
        self.checksums = {}

    def register_dataset(self, name, data):
        checksum = hashlib.sha256(json.dumps(data, sort_keys=True).encode()).hexdigest()
        self.checksums[name] = checksum
        print(f"Registered {name}: checksum={checksum[:16]}...")

    def verify(self, name, data):
        if name not in self.checksums:
            print(f"UNKNOWN: {name} — no baseline checksum")
            return
        expected = self.checksums[name]
        actual = hashlib.sha256(json.dumps(data, sort_keys=True).encode()).hexdigest()
        if expected == actual:
            print(f"PASS: {name} — integrity verified")
            return True
        else:
            print(f"FAIL: {name} — DATA CORRUPTION DETECTED")
            print(f"  Expected: {expected[:16]}...")
            print(f"  Actual:   {actual[:16]}...")
            return False

monitor = DataIntegrityMonitor()
monitor.register_dataset("user-bookmarks", {"user1": ["url1", "url2"], "user2": ["url3"]})
monitor.verify("user-bookmarks", {"user1": ["url1", "url2"], "user2": ["url3"]})
monitor.verify("user-bookmarks", {"user1": ["url1", "url2"], "user2": ["url4"]})

Expected output:

Registered user-bookmarks: checksum=a1b2c3d4e5f67890...
PASS: user-bookmarks — integrity verified
FAIL: user-bookmarks — DATA CORRUPTION DETECTED
  Expected: a1b2c3d4e5f67890...
  Actual:   0987fedcba543210...

Replication Lag Monitoring

For asynchronous replication, you must monitor replication lag. If lag grows beyond acceptable thresholds, it indicates a problem that could lead to data loss during a failover.

Replication Type	Acceptable Lag	Alert Threshold
Synchronous	0 ms	Not applicable
Cross-datacenter async	Under 1 second	5 seconds
Cross-region async	Under 5 seconds	30 seconds
Cross-continent async	Under 30 seconds	2 minutes

Point-in-Time Recovery

Point-in-Time Recovery (PITR) allows restoring a database to any point within a retention window, not just the last full backup. This is essential for recovering from accidental data deletion or corruption.

Most managed database services support PITR with configurable retention periods. A typical configuration is 7 to 35 days of PITR retention. The trade-off is storage cost against recovery granularity.

Data Lifecycle Management

Not all data needs the same level of protection. Define data tiers based on criticality:

Tier	Examples	Backup Frequency	Retention	Replication
Critical	User data, transactions	Continuous	7 years	Synchronous
Important	Application logs, analytics	Hourly	90 days	Asynchronous
Ephemeral	Cache data, session state	None	None	None

Common Errors

Error	Explanation
No backup verification	A backup that has never been restored is not a backup. Always verify.
Single region replication	A regional disaster destroys both primary and replicas. Replicate across regions.
Ignoring consistency model	Choosing eventual consistency for a service that needs read-your-writes causes user-visible bugs.
No checksum validation	Silent data corruption happens. Checksums detect it.
Same backup location	Storing backups in the same data center as production defeats the purpose.
No backup for configuration	Infrastructure-as-code state files and database configuration should also be backed up.

Practice Questions

What is the difference between durability and availability?
Why is synchronous replication safer than asynchronous replication?
What does it mean when we say a backup must be verified?
When would you choose eventual consistency over strong consistency?
Why should backups be stored in a different geographical region?

Challenge

Design a data reliability strategy for Doda Browser sync service. The service stores user bookmarks, browsing history, and preferences. Define the replication model, backup schedule, consistency requirements, and data integrity verification process. Calculate the expected durability and explain how you would test the strategy.

FAQ

What is data durability?

Data durability is the guarantee that once data is written, it will not be lost. It is typically measured in nines, like 99.999999999 percent.

What is the difference between synchronous and asynchronous replication?

Synchronous replication waits for all replicas to confirm a write before acknowledging. Asynchronous replication acknowledges immediately and updates replicas later.

What is checksum verification?

Checksum verification compares a hash of the backed-up data against the original hash to detect corruption or alteration.

How often should backups be tested?

Test full restores quarterly and incremental restores monthly. Automated verification should run after every backup.

What consistency model should I use?

Use strong consistency for transactional data, eventual consistency for high-volume read-heavy data, and read-your-writes consistency for user-facing applications.

← Previous Cost Efficiency in SRE — Balancing Spend and Reliability Next → Security Reliability — Incident Response and Compliance

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Site Reliability Engineering