High CPU Usage Troubleshooting -- Process Analysis, Thread Dumps & Performance Tuning

DodaTech Updated 2026-06-23 9 min read

High CPU usage slows applications, increases cloud costs, and can trigger autoscaling events unexpectedly -- this guide shows you how to identify the specific process, thread, and code path consuming CPU and apply targeted fixes in Linux environments.

What You'll Learn

Why It Matters

A runaway process consuming 100% CPU on a single core might go unnoticed until it affects other services. At cloud rates, an extra 1 vCPU running 24/7 costs $700+ per year. Knowing how to pinpoint exactly which function is burning CPU is essential for any systems engineer.

Real-World Use

When your web server's CPU is at 100% but request traffic is normal, a background job consumes all cores during peak hours, or a system process uses 50% CPU for no apparent reason, these techniques isolate the root cause.

Common High CPU Issues Table

Issue	Symptom	Cause	Diagnostic Tool
Runaway process	Single process at 100% CPU	Infinite loop, busy wait	`top`, `htop`, `ps`
Thread contention	High CPU across many threads	Excessive locking, context switching	`pidstat`, `perf top`
Kernel module leak	High `system` CPU time	Device driver or filesystem bug	`dmesg`, `perf top -U`
I/O wait	High `iowait` CPU	Disk bottleneck causing wait	`iostat`, `iotop`
Interrupt storm	High `si` (softirq) CPU	Network or hardware interrupt flood	`/proc/interrupts`, `mpstat`
Memory pressure	High `kswapd` CPU usage	Insufficient RAM, constant swapping	`vmstat 1`, `sar -S`

Step-by-Step Fixes

Fix 1: Find the CPU-Hungry Process

# Real-time process viewer (sorted by CPU)
top -o %CPU

# One-shot snapshot of top CPU consumers
ps aux --sort=-%cpu | head -10

# Show only processes using >50% CPU
ps aux | awk '$3 > 50 {print $0}'

# Historical CPU usage for a specific PID
pidstat -p <PID> 1 10

# Show threads of a process sorted by CPU
top -H -p <PID>

Expected output:

USER      PID  %CPU %MEM    VSZ   RSS TTY     STAT START   TIME COMMAND
appuser  1234 98.3  2.5  1.2g 256MB ?        R    10:00   45:23 node app.js
appuser  5678  2.1  0.5  450M  89MB ?        S    10:00    1:12 python worker.py

Fix 2: Profile CPU Usage with Perf

# Record CPU performance counters for 10 seconds
sudo perf record -g -F 99 -- sleep 10

# Generate a flamegraph-compatible report
sudo perf script > perf.stacks

# Show the hottest functions in the kernel
sudo perf top -U

# Record only a specific PID
sudo perf record -g -F 99 -p <PID> -- sleep 20

# Show the call graph summary
sudo perf report --stdio -g

# Install and use flamegraph scripts
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph

# Generate flamegraph SVG
sudo perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > cpu-flamegraph.svg

Expected output:

+ 100.00%     0x0
+  99.50%     node
+  68.20%     LazyCompile:~processRequest /app/server.js:45
+  22.10%     LazyCompile:~parseData /app/parser.js:120
+   9.20%     Builtin:~Stringify

Fix 3: Analyze Thread Contention

// ThreadContention.java -- Simulate excessive locking
import java.util.concurrent.*;
import java.util.concurrent.locks.*;

public class ThreadContention {
    private static final ReentrantLock lock = new ReentrantLock(true);
    private static int sharedCounter = 0;

    public static void main(String[] args) {
        ExecutorService executor = Executors.newFixedThreadPool(32);

        for (int i = 0; i < 50; i++) {
            executor.submit(() -> {
                for (int j = 0; j < 100000; j++) {
                    lock.lock();
                    try {
                        sharedCounter++;  // Heavy contention on this lock
                        Thread.sleep(1);  // Holding lock unnecessarily long
                    } catch (InterruptedException e) {
                        Thread.currentThread().interrupt();
                    } finally {
                        lock.unlock();
                    }
                }
                return null;
            });
        }
        executor.shutdown();
    }
}

# Run the Java application and capture thread dumps
javac ThreadContention.java
java ThreadContention &

# Take thread dump (3 samples, 2 seconds apart)
for i in 1 2 3; do
    kill -3 $(pgrep -f ThreadContention)
    sleep 2
done

# Analyze thread dump for blocked threads
grep -A5 "BLOCKED" /tmp/threaddump.txt

# Use jstack to capture and analyze
jstack -l <PID> > threaddump.txt
grep -c "BLOCKED" threaddump.txt

Expected output:

"pool-1-thread-12" #22 prio=5 os_prio=0 cpu=1234.56ms
   java.lang.Thread.State: BLOCKED (on object monitor)
    at ThreadContention.lambda$main$0(ThreadContention.java:15)
    - waiting to lock <0x0000000123456789> (a java.util.concurrent.locks.ReentrantLock)

Fix 4: Diagnose Kernel Module CPU Usage

# Check if high CPU is in kernel or user space
top -p <PID>
# %CPU: 99.3, but check:
# us (user) vs sy (system) ratio in top header

# Profile kernel functions
sudo perf top -U -k 1

# Check for specific kernel modules using CPU
sudo perf top -e cycles:k | head -20

# Check network interrupt distribution
cat /proc/interrupts | grep -E "CPU|eth"

# Check if a specific kernel module is stuck
lsmod | grep <module>
sudo cat /proc/modules

Expected output:

Samples: 1M of event 'cycles:k', 4000 Hz
Event count (approx.): 450000000000
Overhead  Shared Object     Symbol
  45.2%  [kernel]          [k] _raw_spin_lock
  12.3%  [kernel]          [k] do_syscall_64
   8.7%  [e1000]           [k] e1000_clean_rx_irq  # High driver CPU

Fix 5: Reduce I/O Wait with Process Prioritization

# Check I/O wait percentage
iostat -x 1 5

# Find processes with high I/O
sudo iotop -o

# Renice a CPU-hungry process to lower priority
# Niceness range: -20 (highest) to 19 (lowest)
renice -n 10 -p <PID>

# Set I/O scheduling class to idle (best-effort, lowest priority)
sudo ionice -c 3 -p <PID>

# Limit CPU usage with cpulimit
sudo apt install cpulimit
cpulimit -p <PID> -l 50  # Limit to 50% of one core

# Use cgroups to set hard CPU limits
sudo cgcreate -g cpu:/limited-app
echo 50000 | sudo tee /sys/fs/cgroup/cpu/limited-app/cpu.cfs_quota_us  # 50% of one core
echo <PID> | sudo tee /sys/fs/cgroup/cpu/limited-app/tasks

Expected output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.5     0.0    3.2    35.5     0.0     48.8   # High iowait

High CPU Troubleshooting Flowchart

flowchart TD
    A[High CPU Alert] --> B{Which metric is high?}
    B -->|User CPU| C[Find top process with top/ps]
    C --> D[Profile with perf record]
    D --> E[Generate flamegraph]
    E --> F[Identify hot function in code]
    F --> G[Optimize algorithm or add caching]
    B -->|System CPU| H[Check kernel functions with perf top -U]
    H --> I[Look for driver or filesystem module]
    I --> J[Update driver, disable module, or tune kernel]
    B -->|I/O Wait| K[Find I/O-heavy process with iotop]
    K --> L[Move process to slower storage or ionice]
    L --> M[Check disk health and RAID status]
    B -->|SoftIRQ| N[Check /proc/interrupts for flood]
    N --> O[Check network driver and NIC health]
    O --> P[Tune interrupt coalescing or spread IRQs]
    G --> Q[CPU Normalized]
    J --> Q
    M --> Q
    P --> Q

Prevention Tips

Set CPU limits on all containers and VMs with cgroups or Docker --cpus flags
Profile applications under load during development with perf and address hotspots before production
Monitor CPU usage trends with Prometheus and set alerts on sustained >80% utilization
Use nice and ionice for background batch jobs so they do not starve interactive processes
Spread hardware interrupts across CPU cores with irqbalance or manual /proc/irq/ SMP affinity
Implement backpressure mechanisms in queue consumers to prevent runaway processing

Practice Questions

What is the difference between top and perf top for debugging CPU usage? Answer: top shows CPU usage at the process level -- which process is consuming CPU cycles. perf top shows CPU usage at the function level -- which specific function or kernel symbol is burning CPU. Use top to find the culprit process, then perf to drill into what that process is doing.
How do you determine whether high CPU is caused by user code, kernel code, or interrupts? Answer: Run top and check the header line -- us (user), sy (system), si (softirq), hi (hardirq), wa (iowait). User code shows as us, kernel module or system calls show as sy, network/disk interrupts show as si/hi, and disk-bound processes show as wa.
What is the relationship between high CPU usage and I/O wait, and how do you distinguish them? Answer: High user/system CPU means the CPU is actively executing code (compute-bound). High iowait means the CPU is idle but processes are blocked waiting for disk or network I/O to complete. A process can also be compute-heavy and cause other processes to experience iowait if the filesystem or storage cannot keep up.

Challenge: Write a bash script that takes a PID, samples CPU usage with perf for 30 seconds, generates a flamegraph, and identifies the top 3 functions consuming CPU. Answer:

#!/bin/bash
PID=${1:?"Usage: $0 <PID>"}
DURATION=${2:-30}

sudo perf record -g -F 99 -p "$PID" --sleep "$DURATION" -o /tmp/perf.data
sudo perf script -i /tmp/perf.data | /opt/FlameGraph/stackcollapse-perf.pl > /tmp/out.folded
/opt/FlameGraph/flamegraph.pl /tmp/out.folded > /tmp/cpu-${PID}.svg
sudo perf report -i /tmp/perf.data --stdio -g | head -20
echo "Flamegraph: /tmp/cpu-${PID}.svg"

Quick Reference

Issue	Diagnostic	Resolution
Runaway process	`ps aux --sort=-%cpu`	Kill or fix infinite loop
Thread contention	`kill -3 <PID>` (thread dump)	Reduce lock scope, use lock-free structures
Kernel module high CPU	`perf top -U`	Update driver, disable module
I/O wait	`iostat -x 1`	Move load, ionice, upgrade storage
Interrupt storm	`cat /proc/interrupts`	Spread IRQs, update NIC driver

FAQ

What is a "runaway process" and how do you safely stop it?

A runaway process is a program that enters an infinite loop or busy-wait, consuming 100% of a CPU core indefinitely. First try kill -TERM <PID> for a graceful shutdown. If it ignores SIGTERM, use kill -KILL <PID> (cannot be caught). Use renice -n 19 -p <PID> to reduce impact first if you need to investigate before killing.

What causes high `si` (softirq) CPU usage and how do you reduce it?

High softirq CPU is typically caused by a flood of network packets or disk interrupts that the kernel cannot process fast enough. Solutions: (1) enable interrupt coalescing with ethtool -C eth0 rx-usecs 100, (2) spread interrupts across cores using irqbalance or manual /proc/irq/ affinity, (3) upgrade the NIC driver, (4) reduce network packet rates with rate limiting or better application batching.

How do you distinguish between a CPU performance issue and an application-level bottleneck?

If CPU is at 100% and requests per second are below expected throughput, it is a CPU performance issue (inefficient code, missing indexes, no caching). If CPU is below 100% but throughput is below expected, the bottleneck is elsewhere -- I/O, network, external API latency, or lock contention. Use the USE method (Utilization, Saturation, Errors) for every resource to find the actual bottleneck.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous CI/CD Pipeline Troubleshooting -- Build Failures, Flaky Tests & Deployment Errors Next → Introduction to Troubleshooting — Complete Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Troubleshooting

High CPU Usage Troubleshooting -- Process Analysis, Thread Dumps & Performance Tuning

What You'll Learn

Why It Matters

Real-World Use

Common High CPU Issues Table

Step-by-Step Fixes

Fix 1: Find the CPU-Hungry Process

Fix 2: Profile CPU Usage with Perf

Fix 3: Analyze Thread Contention

Fix 4: Diagnose Kernel Module CPU Usage

Fix 5: Reduce I/O Wait with Process Prioritization

High CPU Troubleshooting Flowchart

Prevention Tips

Practice Questions

Quick Reference

FAQ

What is a "runaway process" and how do you safely stop it?

What causes high si (softirq) CPU usage and how do you reduce it?

How do you distinguish between a CPU performance issue and an application-level bottleneck?

What causes high `si` (softirq) CPU usage and how do you reduce it?