Skip to content

Network Chaos Testing — Latency, Packet Loss & Bandwidth Limits

DodaTech Updated 2026-06-23 7 min read

In this tutorial, you'll learn about Network Chaos Testing. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Network Chaos Engineering tests how Distributed Systems behave when the network degrades: latency spikes, packet loss, bandwidth constraints, and DNS failures. These are the most common failure modes in cloud environments and the hardest to predict without active testing.

What You Will Learn

This tutorial teaches you how to inject network faults using Chaos Mesh, tc (traffic control), iptables, and cloud networking tools: latency, packet loss, bandwidth throttling, DNS manipulation, and asymmetric network partitions.

Why It Matters

Network failures are invisible until they cause an outage. A 200ms latency increase between services may go unnoticed for weeks until a traffic spike pushes request timeouts over the edge. Proactive network chaos testing reveals timeouts, retry storms, and Connection Pool leaks before they trigger cascading failures.

Real-World Use

DodaTech runs a monthly "network degradation day" where each microservice team must demonstrate that their service can operate with 500ms added latency to its three most critical dependencies. This exercise has uncovered 12 timeout-related bugs in the Doda Browser backend services.

Prerequisites

Before starting you should understand:

  • Chaos Engineering fundamentals
  • Basic TCP and DNS concepts
  • Linux networking tools (tc, iptables)
  • Chaos Mesh for Kubernetes network faults

Step 1: Inject Network Latency with tc

Use Linux traffic control to add latency to a network interface:

# Add 300ms latency to the eth0 interface
sudo tc qdisc add dev eth0 root netem delay 300ms

# Verify the latency
ping -c 4 localhost
# Expected output:
# round-trip min/avg/max/mdev = 300.123/301.456/302.789/0.987 ms

# Add jitter (variation) to make it realistic
sudo tc qdisc change dev eth0 root netem delay 300ms 50ms distribution normal

# Test with jitter
ping -c 4 localhost
# Expected output:
# round-trip min/avg/max/mdev = 275.234/301.567/335.890/12.345 ms
# (Latency varies between 250ms and 350ms)

# Remove the latency
sudo tc qdisc del dev eth0 root

Step 2: Simulate Packet Loss

Packet loss causes TCP retransmissions and application-level retries:

# Add 5% packet loss to eth0
sudo tc qdisc add dev eth0 root netem loss 5%

# Test packet loss with ping
ping -c 20 google.com
# Expected output:
# 20 packets transmitted, 19 received, 5% packet loss

# Add correlated packet loss (bursts of loss)
sudo tc qdisc change dev eth0 root netem loss 5% 25%

# The 25% correlation means loss happens in bursts rather than randomly
# This simulates a flapping network interface more realistically

# Remove packet loss
sudo tc qdisc del dev eth0 root
#!/usr/bin/env python3
"""Measure application behavior under packet loss."""
import requests
import time
import statistics

ENDPOINT = "http://localhost:8080/api/health"
NUM_REQUESTS = 50
timeouts = 0
latencies = []

for i in range(NUM_REQUESTS):
    try:
        start = time.time()
        response = requests.get(ENDPOINT, timeout=5)
        elapsed = (time.time() - start) * 1000
        latencies.append(elapsed)
        print(f"Request {i+1}: {response.status_code} in {elapsed:.0f}ms")
    except requests.Timeout:
        timeouts += 1
        print(f"Request {i+1}: TIMEOUT")

print(f"\nResults with 5% packet loss:")
print(f"Successful requests: {NUM_REQUESTS - timeouts}")
print(f"Timeouts: {timeouts}")
if latencies:
    print(f"Average latency: {statistics.mean(latencies):.0f}ms")
    print(f"Max latency: {max(latencies):.0f}ms")

# Expected output (with 5% packet loss):
# Request 1: 200 in 45ms
# Request 2: 200 in 1203ms
# Request 3: 200 in 52ms
# Request 4: TIMEOUT
# ...
# Successful requests: 47
# Timeouts: 3
# Average latency: 234ms
# Max latency: 5120ms

Step 3: Throttle Bandwidth

Limit bandwidth to simulate network congestion:

# Limit bandwidth to 1Mbps on eth0
sudo tc qdisc add dev eth0 root handle 1: htb default 30
sudo tc class add dev eth0 parent 1: classid 1:1 htb rate 1mbit
sudo tc class add dev eth0 parent 1:1 classid 1:10 htb rate 1mbit

# Test bandwidth
curl -o /dev/null -w "Speed: %{speed_download} bytes/sec\n" http://test-server/large-file.iso
# Expected output:
# Speed: 125000 bytes/sec
# (Approximately 1 Mbps)

# Remove bandwidth limit
sudo tc qdisc del dev eth0 root
# bandwidth-chaos-mesh.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: bandwidth-limit
spec:
  action: bandwidth
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: file-upload-service
  bandwidth:
    rate: "1mbps"
    limit: 20000
    buffer: 10000
  direction: both
  duration: 120s
kubectl apply -f bandwidth-chaos-mesh.yaml
# Expected output:
# networkchaos.chaos-mesh.org/bandwidth-limit created

# Verify bandwidth limit inside a pod
kubectl exec -it -l app=file-upload-service -- wget -O /dev/null http://storage-service:8080/test-file
# Expected output:
# 2026-06-23 10:00:01 (0.98 Mb/s) - saved
# The download speed is limited to approximately 1 Mbps

Step 4: Manipulate DNS Responses

DNS chaos tests how services handle resolution failures and wrong addresses:

# dns-manipulation.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: dns-manipulation
spec:
  action: random
  mode: all
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: web-service
  patterns:
    - "*.payment-provider.com"
    - "*.email-service.com"
  duration: 60s
kubectl apply -f dns-manipulation.yaml
# Expected output:
# dnschaos.chaos-mesh.org/dns-manipulation created

# From inside the affected pod
kubectl exec -it -l app=web-service -- nslookup payment-provider.com
# Expected output (DNSChaos returns random IP):
# Name:   payment-provider.com
# Address: 203.0.113.42
# (This is not the real payment provider IP)

# Verify the application detects the DNS manipulation
kubectl exec -it -l app=web-service -- curl -s -o /dev/null -w "%{http_code}" https://payment-provider.com/api/charge
# Expected output:
# 000
# (Connection failed because the DNS resolved to a wrong IP)

Step 5: Test Asymmetric Network Conditions

Real networks often have asymmetric failures where one direction works but the reverse does not:

# Block incoming traffic on port 8080 but allow outgoing
sudo iptables -A INPUT -p tcp --dport 8080 -j DROP

# From the target machine, check if it can reach services
curl -s -o /dev/null -w "%{http_code}" http://other-service:8080/health
# Expected output:
# 200
# (Outgoing connections work)

# From another machine, check if it can reach the target
curl -s -o /dev/null -w "%{http_code}" http://target-machine:8080/health
# Expected output:
# 000
# (Incoming connections are blocked)

# Remove the iptables rule
sudo iptables -D INPUT -p tcp --dport 8080 -j DROP

Learning Path

flowchart LR
  A[Network Partitioning] --> B[Network Chaos Testing]
  B --> C[Database Chaos]
  C --> D[Kubernetes Chaos Testing]
  D --> E[Chaos Observability]
  style B fill:#f90,color:#fff

Common Errors

  1. Applying tc rules to the wrong interface: On multi-homed systems, applying latency to the wrong interface may have no effect or affect the wrong traffic. Double-check interface names.
  2. DNS chaos patterns that are too broad: Blocking Kubernetes.default.svc.cluster.local will break DNS resolution for the entire pod. Scope patterns to external services only.
  3. Bandwidth limits that make the service completely unusable: A 1Kbps limit may cause connections to time out rather than slow down. Start with realistic limits and adjust.
  4. Forgetting to clean up tc qdisc and iptables rules: Network faults persist until explicitly removed. Automate cleanup with experiment wrappers.
  5. Asymmetric partition tests that block Kubernetes health checks: kubelet health checks may fail if you block traffic on the health check port, causing kubelet to restart the pod.

Practice Questions

  1. How does the tc netem tool add latency and packet loss to a network interface?
  2. What is the difference between random and correlated packet loss?
  3. How do you limit network bandwidth using tc?
  4. What does DNSChaos action: random do in Chaos Mesh?
  5. Why is asymmetric network testing important for Distributed Systems?

Challenge

Create a comprehensive network Chaos Experiment suite that tests a three-service architecture (web, API, database) under four network conditions: 500ms latency between web and API, 10 percent packet loss between API and database, 5Mbps bandwidth limit on the web service egress, and DNS errors for the external payment provider. Verify that each service degrades gracefully and recovers fully when the network faults are removed.

FAQ

What is network chaos testing?

Network chaos testing injects latency, packet loss, bandwidth limits, DNS failures, and asymmetric network conditions to validate that Distributed Systems handle network degradation gracefully.

How do you inject network latency in Linux?

Use the tc (traffic control) tool with the netem module: tc qdisc add dev eth0 root netem delay 300ms.

What is packet loss correlation?

Correlation means packet loss happens in bursts rather than randomly. A 5% loss with 25% correlation means loss events cluster together, simulating a flapping interface.

How does DNS chaos work in Chaos Mesh?

DNSChaos intercepts DNS queries from affected pods and can return errors, random IPs, modified responses, or delayed responses for specified domain patterns.

What is asymmetric network failure?

Asymmetric failure means network communication works in one direction but not the reverse. For example, service A can reach service B but service B cannot respond to service A.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro