Infrastructure Faults — CPU, Memory, Disk & Node Failures
In this tutorial, you'll learn about Infrastructure Faults. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Infrastructure faults in Chaos Engineering target the underlying hardware and operating system resources: CPU, memory, disk, and the host itself. These faults simulate what happens when a server is overwhelmed by traffic, a disk fills up with logs, or a node fails entirely.
What You Will Learn
This tutorial teaches you how to inject CPU stress, memory pressure, disk space exhaustion, and node failures using built-in Linux tools and Chaos Engineering platforms.
Why It Matters
Infrastructure failures are inevitable in any production environment. CPU throttling, memory swapping, and disk-full conditions are among the most common real-world incidents. Proactively testing these scenarios ensures your application degrades gracefully instead of crashing or corrupting data.
Real-World Use
DodaTech runs weekly CPU stress tests on Durga Antivirus Pro scanning nodes. The test verifies that the antivirus Process yields CPU to other system processes when CPU usage exceeds 90 percent, preventing the scanner from starving critical system services.
Prerequisites
Before starting you should understand:
- Chaos Engineering fundamentals (hypothesis, Blast Radius)
- Linux system administration basics (CPU, memory, disk commands)
- Kubernetes node and pod concepts
Step 1: Inject CPU Stress
Use the stress tool to saturate CPU cores:
# Install stress tool
sudo apt-get install -y stress
# Saturate 2 CPU cores for 60 seconds
stress --cpu 2 --timeout 60
# In another terminal, observe CPU usage
top -bn1 | head -10
# Expected output:
# %Cpu0 :100.0 us
# %Cpu1 :100.0 us
# %Cpu2 : 0.0 us
# %Cpu3 : 0.0 us
Step 2: Simulate Memory Pressure
Consume memory to trigger swapping and OOM conditions:
# Allocate 512MB of memory with stress
stress --vm 1 --vm-bytes 512M --timeout 30
# Monitor memory usage
free -h
# Expected output:
# total used free
# Mem: 7.6G 5.8G 1.8G
# Swap: 2.0G 0.5G 1.5G
# Memory pressure has triggered swapping
# Check if OOM killer is active
dmesg | tail -5
# Expected output:
# [12345.678] oom-kill: constraint=CONSTRAINT_NONE ...
Step 3: Fill Disk Space
Simulate a full disk scenario:
# Create a large file to fill disk space
dd if=/dev/zero of=/tmp/fill-disk bs=1M count=1024
# Expected output:
# 1024+0 records in
# 1024+0 records out
# 1073741824 bytes (1.1 GB) copied
# Check disk usage
df -h /
# Expected output:
# Filesystem Size Used Avail Use% Mounted on
# /dev/sda1 20G 19G 1.0G 95% /
# Application trying to write to disk
touch /tmp/test-file
# Expected output:
# touch: cannot touch '/tmp/test-file': No space left on device
Step 4: Simulate Disk I/O Throttling
Use iotop and dd to simulate disk I/O contention:
# Create heavy disk I/O load
dd if=/dev/zero of=/tmp/io-test bs=1M count=4096 &
dd if=/dev/zero of=/tmp/io-test2 bs=1M count=4096 &
# Monitor I/O wait
iostat -x 2 3
# Expected output:
# Device r/s w/s rkB/s wkB/s await %util
# sda 12.3 456.7 98.5 4567.8 12.3 99.5
# Disk utilization at 99.5% indicates saturation
Step 5: Use Chaos Mesh for Infrastructure Faults
Use StressChaos for managed infrastructure Fault Injection:
# infrastructure-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress-test
spec:
mode: one
selector:
namespaces:
- staging
labelSelectors:
app: payment-service
stressors:
cpu:
workers: 2
load: 100
duration: 120s
kubectl apply -f infrastructure-stress.yaml
# Expected output:
# stresschaos.chaos-mesh.org/cpu-stress-test created
Learning Path
flowchart LR A[Network Partitioning] --> B[Infrastructure Faults] B --> C[Kubernetes Chaos] C --> D[Chaos Engineering Pipeline] style B fill:#f90,color:#fff
Common Errors
- Running CPU stress on a single-core production server: Saturating the only CPU core will also prevent the chaos agent from recovering the system. Use multi-core servers or limit stress to fewer cores.
- Filling the root partition completely: A completely full root partition can prevent SSH logins and system recovery. Leave at least 1GB free.
- Ignoring swap space exhaustion: Memory pressure that triggers swapping can degrade performance far more than the application slowdown you intended to test.
- Not cleaning up stress test files: The
ddcommand creates large files that consume disk space permanently if not deleted withrm /tmp/fill-disk. - Using OOM-killable stress processes without monitoring: The kernel OOM killer may terminate the wrong Process. Always monitor which Process gets killed.
Practice Questions
- How do you saturate CPU cores using the stress command?
- What happens when the OOM killer is triggered?
- How do you create a disk-full scenario safely?
- What is I/O wait and how does it affect application performance?
- How does Chaos Mesh StressChaos differ from running stress directly?
Challenge
Design an infrastructure fault experiment plan that tests how your application responds to: CPU saturation at 50 percent, 75 percent, and 100 percent; memory pressure at 80 percent of available RAM; and disk space at 90 percent full. For each scenario document the application behavior and whether the SLOs were maintained.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro