High CPU Usage Troubleshooting -- Process Analysis, Thread Dumps & Performance Tuning
High CPU usage slows applications, increases cloud costs, and can trigger autoscaling events unexpectedly -- this guide shows you how to identify the specific process, thread, and code path consuming CPU and apply targeted fixes in Linux environments.
What You'll Learn
Why It Matters
A runaway process consuming 100% CPU on a single core might go unnoticed until it affects other services. At cloud rates, an extra 1 vCPU running 24/7 costs $700+ per year. Knowing how to pinpoint exactly which function is burning CPU is essential for any systems engineer.
Real-World Use
When your web server's CPU is at 100% but request traffic is normal, a background job consumes all cores during peak hours, or a system process uses 50% CPU for no apparent reason, these techniques isolate the root cause.
Common High CPU Issues Table
| Issue | Symptom | Cause | Diagnostic Tool |
|---|---|---|---|
| Runaway process | Single process at 100% CPU | Infinite loop, busy wait | top, htop, ps |
| Thread contention | High CPU across many threads | Excessive locking, context switching | pidstat, perf top |
| Kernel module leak | High system CPU time |
Device driver or filesystem bug | dmesg, perf top -U |
| I/O wait | High iowait CPU |
Disk bottleneck causing wait | iostat, iotop |
| Interrupt storm | High si (softirq) CPU |
Network or hardware interrupt flood | /proc/interrupts, mpstat |
| Memory pressure | High kswapd CPU usage |
Insufficient RAM, constant swapping | vmstat 1, sar -S |
Step-by-Step Fixes
Fix 1: Find the CPU-Hungry Process
# Real-time process viewer (sorted by CPU)
top -o %CPU
# One-shot snapshot of top CPU consumers
ps aux --sort=-%cpu | head -10
# Show only processes using >50% CPU
ps aux | awk '$3 > 50 {print $0}'
# Historical CPU usage for a specific PID
pidstat -p <PID> 1 10
# Show threads of a process sorted by CPU
top -H -p <PID>
Expected output:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
appuser 1234 98.3 2.5 1.2g 256MB ? R 10:00 45:23 node app.js
appuser 5678 2.1 0.5 450M 89MB ? S 10:00 1:12 python worker.py
Fix 2: Profile CPU Usage with Perf
# Record CPU performance counters for 10 seconds
sudo perf record -g -F 99 -- sleep 10
# Generate a flamegraph-compatible report
sudo perf script > perf.stacks
# Show the hottest functions in the kernel
sudo perf top -U
# Record only a specific PID
sudo perf record -g -F 99 -p <PID> -- sleep 20
# Show the call graph summary
sudo perf report --stdio -g
# Install and use flamegraph scripts
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
# Generate flamegraph SVG
sudo perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > cpu-flamegraph.svg
Expected output:
+ 100.00% 0x0
+ 99.50% node
+ 68.20% LazyCompile:~processRequest /app/server.js:45
+ 22.10% LazyCompile:~parseData /app/parser.js:120
+ 9.20% Builtin:~Stringify
Fix 3: Analyze Thread Contention
// ThreadContention.java -- Simulate excessive locking
import java.util.concurrent.*;
import java.util.concurrent.locks.*;
public class ThreadContention {
private static final ReentrantLock lock = new ReentrantLock(true);
private static int sharedCounter = 0;
public static void main(String[] args) {
ExecutorService executor = Executors.newFixedThreadPool(32);
for (int i = 0; i < 50; i++) {
executor.submit(() -> {
for (int j = 0; j < 100000; j++) {
lock.lock();
try {
sharedCounter++; // Heavy contention on this lock
Thread.sleep(1); // Holding lock unnecessarily long
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} finally {
lock.unlock();
}
}
return null;
});
}
executor.shutdown();
}
}
# Run the Java application and capture thread dumps
javac ThreadContention.java
java ThreadContention &
# Take thread dump (3 samples, 2 seconds apart)
for i in 1 2 3; do
kill -3 $(pgrep -f ThreadContention)
sleep 2
done
# Analyze thread dump for blocked threads
grep -A5 "BLOCKED" /tmp/threaddump.txt
# Use jstack to capture and analyze
jstack -l <PID> > threaddump.txt
grep -c "BLOCKED" threaddump.txt
Expected output:
"pool-1-thread-12" #22 prio=5 os_prio=0 cpu=1234.56ms
java.lang.Thread.State: BLOCKED (on object monitor)
at ThreadContention.lambda$main$0(ThreadContention.java:15)
- waiting to lock <0x0000000123456789> (a java.util.concurrent.locks.ReentrantLock)
Fix 4: Diagnose Kernel Module CPU Usage
# Check if high CPU is in kernel or user space
top -p <PID>
# %CPU: 99.3, but check:
# us (user) vs sy (system) ratio in top header
# Profile kernel functions
sudo perf top -U -k 1
# Check for specific kernel modules using CPU
sudo perf top -e cycles:k | head -20
# Check network interrupt distribution
cat /proc/interrupts | grep -E "CPU|eth"
# Check if a specific kernel module is stuck
lsmod | grep <module>
sudo cat /proc/modules
Expected output:
Samples: 1M of event 'cycles:k', 4000 Hz
Event count (approx.): 450000000000
Overhead Shared Object Symbol
45.2% [kernel] [k] _raw_spin_lock
12.3% [kernel] [k] do_syscall_64
8.7% [e1000] [k] e1000_clean_rx_irq # High driver CPU
Fix 5: Reduce I/O Wait with Process Prioritization
# Check I/O wait percentage
iostat -x 1 5
# Find processes with high I/O
sudo iotop -o
# Renice a CPU-hungry process to lower priority
# Niceness range: -20 (highest) to 19 (lowest)
renice -n 10 -p <PID>
# Set I/O scheduling class to idle (best-effort, lowest priority)
sudo ionice -c 3 -p <PID>
# Limit CPU usage with cpulimit
sudo apt install cpulimit
cpulimit -p <PID> -l 50 # Limit to 50% of one core
# Use cgroups to set hard CPU limits
sudo cgcreate -g cpu:/limited-app
echo 50000 | sudo tee /sys/fs/cgroup/cpu/limited-app/cpu.cfs_quota_us # 50% of one core
echo <PID> | sudo tee /sys/fs/cgroup/cpu/limited-app/tasks
Expected output:
avg-cpu: %user %nice %system %iowait %steal %idle
12.5 0.0 3.2 35.5 0.0 48.8 # High iowait
High CPU Troubleshooting Flowchart
flowchart TD
A[High CPU Alert] --> B{Which metric is high?}
B -->|User CPU| C[Find top process with top/ps]
C --> D[Profile with perf record]
D --> E[Generate flamegraph]
E --> F[Identify hot function in code]
F --> G[Optimize algorithm or add caching]
B -->|System CPU| H[Check kernel functions with perf top -U]
H --> I[Look for driver or filesystem module]
I --> J[Update driver, disable module, or tune kernel]
B -->|I/O Wait| K[Find I/O-heavy process with iotop]
K --> L[Move process to slower storage or ionice]
L --> M[Check disk health and RAID status]
B -->|SoftIRQ| N[Check /proc/interrupts for flood]
N --> O[Check network driver and NIC health]
O --> P[Tune interrupt coalescing or spread IRQs]
G --> Q[CPU Normalized]
J --> Q
M --> Q
P --> Q
Prevention Tips
- Set CPU limits on all containers and VMs with cgroups or Docker
--cpusflags - Profile applications under load during development with
perfand address hotspots before production - Monitor CPU usage trends with Prometheus and set alerts on sustained >80% utilization
- Use
niceandionicefor background batch jobs so they do not starve interactive processes - Spread hardware interrupts across CPU cores with
irqbalanceor manual/proc/irq/SMP affinity - Implement backpressure mechanisms in queue consumers to prevent runaway processing
Practice Questions
What is the difference between
topandperf topfor debugging CPU usage? Answer:topshows CPU usage at the process level -- which process is consuming CPU cycles.perf topshows CPU usage at the function level -- which specific function or kernel symbol is burning CPU. Usetopto find the culprit process, thenperfto drill into what that process is doing.How do you determine whether high CPU is caused by user code, kernel code, or interrupts? Answer: Run
topand check the header line --us(user),sy(system),si(softirq),hi(hardirq),wa(iowait). User code shows asus, kernel module or system calls show assy, network/disk interrupts show assi/hi, and disk-bound processes show aswa.What is the relationship between high CPU usage and I/O wait, and how do you distinguish them? Answer: High user/system CPU means the CPU is actively executing code (compute-bound). High iowait means the CPU is idle but processes are blocked waiting for disk or network I/O to complete. A process can also be compute-heavy and cause other processes to experience iowait if the filesystem or storage cannot keep up.
Challenge: Write a bash script that takes a PID, samples CPU usage with
perffor 30 seconds, generates a flamegraph, and identifies the top 3 functions consuming CPU. Answer:#!/bin/bash PID=${1:?"Usage: $0 <PID>"} DURATION=${2:-30} sudo perf record -g -F 99 -p "$PID" --sleep "$DURATION" -o /tmp/perf.data sudo perf script -i /tmp/perf.data | /opt/FlameGraph/stackcollapse-perf.pl > /tmp/out.folded /opt/FlameGraph/flamegraph.pl /tmp/out.folded > /tmp/cpu-${PID}.svg sudo perf report -i /tmp/perf.data --stdio -g | head -20 echo "Flamegraph: /tmp/cpu-${PID}.svg"
Quick Reference
| Issue | Diagnostic | Resolution |
|---|---|---|
| Runaway process | ps aux --sort=-%cpu |
Kill or fix infinite loop |
| Thread contention | kill -3 <PID> (thread dump) |
Reduce lock scope, use lock-free structures |
| Kernel module high CPU | perf top -U |
Update driver, disable module |
| I/O wait | iostat -x 1 |
Move load, ionice, upgrade storage |
| Interrupt storm | cat /proc/interrupts |
Spread IRQs, update NIC driver |
FAQ
What is a "runaway process" and how do you safely stop it?
A runaway process is a program that enters an infinite loop or busy-wait, consuming 100% of a CPU core indefinitely. First try kill -TERM <PID> for a graceful shutdown. If it ignores SIGTERM, use kill -KILL <PID> (cannot be caught). Use renice -n 19 -p <PID> to reduce impact first if you need to investigate before killing.
What causes high si (softirq) CPU usage and how do you reduce it?
High softirq CPU is typically caused by a flood of network packets or disk interrupts that the kernel cannot process fast enough. Solutions: (1) enable interrupt coalescing with ethtool -C eth0 rx-usecs 100, (2) spread interrupts across cores using irqbalance or manual /proc/irq/ affinity, (3) upgrade the NIC driver, (4) reduce network packet rates with rate limiting or better application batching.
How do you distinguish between a CPU performance issue and an application-level bottleneck?
If CPU is at 100% and requests per second are below expected throughput, it is a CPU performance issue (inefficient code, missing indexes, no caching). If CPU is below 100% but throughput is below expected, the bottleneck is elsewhere -- I/O, network, external API latency, or lock contention. Use the USE method (Utilization, Saturation, Errors) for every resource to find the actual bottleneck.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro