Skip to content

Infrastructure Faults — CPU, Memory, Disk & Node Failures

DodaTech Updated 2026-06-21 5 min read

In this tutorial, you'll learn about Infrastructure Faults. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Infrastructure faults in Chaos Engineering target the underlying hardware and operating system resources: CPU, memory, disk, and the host itself. These faults simulate what happens when a server is overwhelmed by traffic, a disk fills up with logs, or a node fails entirely.

What You Will Learn

This tutorial teaches you how to inject CPU stress, memory pressure, disk space exhaustion, and node failures using built-in Linux tools and Chaos Engineering platforms.

Why It Matters

Infrastructure failures are inevitable in any production environment. CPU throttling, memory swapping, and disk-full conditions are among the most common real-world incidents. Proactively testing these scenarios ensures your application degrades gracefully instead of crashing or corrupting data.

Real-World Use

DodaTech runs weekly CPU stress tests on Durga Antivirus Pro scanning nodes. The test verifies that the antivirus Process yields CPU to other system processes when CPU usage exceeds 90 percent, preventing the scanner from starving critical system services.

Prerequisites

Before starting you should understand:

  • Chaos Engineering fundamentals (hypothesis, Blast Radius)
  • Linux system administration basics (CPU, memory, disk commands)
  • Kubernetes node and pod concepts

Step 1: Inject CPU Stress

Use the stress tool to saturate CPU cores:

# Install stress tool
sudo apt-get install -y stress

# Saturate 2 CPU cores for 60 seconds
stress --cpu 2 --timeout 60

# In another terminal, observe CPU usage
top -bn1 | head -10
# Expected output:
# %Cpu0  :100.0 us
# %Cpu1  :100.0 us
# %Cpu2  :  0.0 us
# %Cpu3  :  0.0 us

Step 2: Simulate Memory Pressure

Consume memory to trigger swapping and OOM conditions:

# Allocate 512MB of memory with stress
stress --vm 1 --vm-bytes 512M --timeout 30

# Monitor memory usage
free -h
# Expected output:
#               total        used        free
# Mem:           7.6G        5.8G        1.8G
# Swap:          2.0G        0.5G        1.5G
# Memory pressure has triggered swapping

# Check if OOM killer is active
dmesg | tail -5
# Expected output:
# [12345.678] oom-kill: constraint=CONSTRAINT_NONE ...

Step 3: Fill Disk Space

Simulate a full disk scenario:

# Create a large file to fill disk space
dd if=/dev/zero of=/tmp/fill-disk bs=1M count=1024
# Expected output:
# 1024+0 records in
# 1024+0 records out
# 1073741824 bytes (1.1 GB) copied

# Check disk usage
df -h /
# Expected output:
# Filesystem      Size  Used Avail Use% Mounted on
# /dev/sda1        20G   19G  1.0G  95% /

# Application trying to write to disk
touch /tmp/test-file
# Expected output:
# touch: cannot touch '/tmp/test-file': No space left on device

Step 4: Simulate Disk I/O Throttling

Use iotop and dd to simulate disk I/O contention:

# Create heavy disk I/O load
dd if=/dev/zero of=/tmp/io-test bs=1M count=4096 &
dd if=/dev/zero of=/tmp/io-test2 bs=1M count=4096 &

# Monitor I/O wait
iostat -x 2 3
# Expected output:
# Device            r/s    w/s    rkB/s    wkB/s  await  %util
# sda             12.3  456.7    98.5   4567.8   12.3   99.5
# Disk utilization at 99.5% indicates saturation

Step 5: Use Chaos Mesh for Infrastructure Faults

Use StressChaos for managed infrastructure Fault Injection:

# infrastructure-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress-test
spec:
  mode: one
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: payment-service
  stressors:
    cpu:
      workers: 2
      load: 100
  duration: 120s
kubectl apply -f infrastructure-stress.yaml
# Expected output:
# stresschaos.chaos-mesh.org/cpu-stress-test created

Learning Path

flowchart LR
  A[Network Partitioning] --> B[Infrastructure Faults]
  B --> C[Kubernetes Chaos]
  C --> D[Chaos Engineering Pipeline]
  style B fill:#f90,color:#fff

Common Errors

  1. Running CPU stress on a single-core production server: Saturating the only CPU core will also prevent the chaos agent from recovering the system. Use multi-core servers or limit stress to fewer cores.
  2. Filling the root partition completely: A completely full root partition can prevent SSH logins and system recovery. Leave at least 1GB free.
  3. Ignoring swap space exhaustion: Memory pressure that triggers swapping can degrade performance far more than the application slowdown you intended to test.
  4. Not cleaning up stress test files: The dd command creates large files that consume disk space permanently if not deleted with rm /tmp/fill-disk.
  5. Using OOM-killable stress processes without monitoring: The kernel OOM killer may terminate the wrong Process. Always monitor which Process gets killed.

Practice Questions

  1. How do you saturate CPU cores using the stress command?
  2. What happens when the OOM killer is triggered?
  3. How do you create a disk-full scenario safely?
  4. What is I/O wait and how does it affect application performance?
  5. How does Chaos Mesh StressChaos differ from running stress directly?

Challenge

Design an infrastructure fault experiment plan that tests how your application responds to: CPU saturation at 50 percent, 75 percent, and 100 percent; memory pressure at 80 percent of available RAM; and disk space at 90 percent full. For each scenario document the application behavior and whether the SLOs were maintained.

FAQ

What are infrastructure faults in Chaos Engineering?

Infrastructure faults target hardware and OS resources: CPU stress, memory pressure, disk exhaustion, I/O throttling, and node failures.

How do I safely simulate CPU stress?

Use tools like stress, stress-ng, or StressChaos with a limited number of workers and a set duration. Never stress all CPU cores on a production server without redundancy.

What happens when memory is exhausted?

The kernel OOM killer terminates processes to free memory. The Process chosen by OOM killer is unpredictable and may not be the one causing the pressure.

How do I simulate a full disk without causing permanent issues?

Create a large file using dd in /tmp (which is often on a separate partition) and remove it immediately after the experiment. Set up monitoring to alert when disk usage exceeds a threshold during the test.

Can node failures be simulated safely?

Yes. Use kubectl drain and kubectl delete node in Kubernetes, or stop the node service in a controlled test environment. Ensure workloads are replicated across multiple nodes.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro