Skip to content

High CPU Usage Troubleshooting -- Process Analysis, Thread Dumps & Performance Tuning

DodaTech Updated 2026-06-23 9 min read

High CPU usage slows applications, increases cloud costs, and can trigger autoscaling events unexpectedly -- this guide shows you how to identify the specific process, thread, and code path consuming CPU and apply targeted fixes in Linux environments.

What You'll Learn

Why It Matters

A runaway process consuming 100% CPU on a single core might go unnoticed until it affects other services. At cloud rates, an extra 1 vCPU running 24/7 costs $700+ per year. Knowing how to pinpoint exactly which function is burning CPU is essential for any systems engineer.

Real-World Use

When your web server's CPU is at 100% but request traffic is normal, a background job consumes all cores during peak hours, or a system process uses 50% CPU for no apparent reason, these techniques isolate the root cause.

Common High CPU Issues Table

Issue Symptom Cause Diagnostic Tool
Runaway process Single process at 100% CPU Infinite loop, busy wait top, htop, ps
Thread contention High CPU across many threads Excessive locking, context switching pidstat, perf top
Kernel module leak High system CPU time Device driver or filesystem bug dmesg, perf top -U
I/O wait High iowait CPU Disk bottleneck causing wait iostat, iotop
Interrupt storm High si (softirq) CPU Network or hardware interrupt flood /proc/interrupts, mpstat
Memory pressure High kswapd CPU usage Insufficient RAM, constant swapping vmstat 1, sar -S

Step-by-Step Fixes

Fix 1: Find the CPU-Hungry Process

# Real-time process viewer (sorted by CPU)
top -o %CPU

# One-shot snapshot of top CPU consumers
ps aux --sort=-%cpu | head -10

# Show only processes using >50% CPU
ps aux | awk '$3 > 50 {print $0}'

# Historical CPU usage for a specific PID
pidstat -p <PID> 1 10

# Show threads of a process sorted by CPU
top -H -p <PID>

Expected output:

USER      PID  %CPU %MEM    VSZ   RSS TTY     STAT START   TIME COMMAND
appuser  1234 98.3  2.5  1.2g 256MB ?        R    10:00   45:23 node app.js
appuser  5678  2.1  0.5  450M  89MB ?        S    10:00    1:12 python worker.py

Fix 2: Profile CPU Usage with Perf

# Record CPU performance counters for 10 seconds
sudo perf record -g -F 99 -- sleep 10

# Generate a flamegraph-compatible report
sudo perf script > perf.stacks

# Show the hottest functions in the kernel
sudo perf top -U

# Record only a specific PID
sudo perf record -g -F 99 -p <PID> -- sleep 20

# Show the call graph summary
sudo perf report --stdio -g
# Install and use flamegraph scripts
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph

# Generate flamegraph SVG
sudo perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > cpu-flamegraph.svg

Expected output:

+ 100.00%     0x0
+  99.50%     node
+  68.20%     LazyCompile:~processRequest /app/server.js:45
+  22.10%     LazyCompile:~parseData /app/parser.js:120
+   9.20%     Builtin:~Stringify

Fix 3: Analyze Thread Contention

// ThreadContention.java -- Simulate excessive locking
import java.util.concurrent.*;
import java.util.concurrent.locks.*;

public class ThreadContention {
    private static final ReentrantLock lock = new ReentrantLock(true);
    private static int sharedCounter = 0;

    public static void main(String[] args) {
        ExecutorService executor = Executors.newFixedThreadPool(32);

        for (int i = 0; i < 50; i++) {
            executor.submit(() -> {
                for (int j = 0; j < 100000; j++) {
                    lock.lock();
                    try {
                        sharedCounter++;  // Heavy contention on this lock
                        Thread.sleep(1);  // Holding lock unnecessarily long
                    } catch (InterruptedException e) {
                        Thread.currentThread().interrupt();
                    } finally {
                        lock.unlock();
                    }
                }
                return null;
            });
        }
        executor.shutdown();
    }
}
# Run the Java application and capture thread dumps
javac ThreadContention.java
java ThreadContention &

# Take thread dump (3 samples, 2 seconds apart)
for i in 1 2 3; do
    kill -3 $(pgrep -f ThreadContention)
    sleep 2
done

# Analyze thread dump for blocked threads
grep -A5 "BLOCKED" /tmp/threaddump.txt

# Use jstack to capture and analyze
jstack -l <PID> > threaddump.txt
grep -c "BLOCKED" threaddump.txt

Expected output:

"pool-1-thread-12" #22 prio=5 os_prio=0 cpu=1234.56ms
   java.lang.Thread.State: BLOCKED (on object monitor)
    at ThreadContention.lambda$main$0(ThreadContention.java:15)
    - waiting to lock <0x0000000123456789> (a java.util.concurrent.locks.ReentrantLock)

Fix 4: Diagnose Kernel Module CPU Usage

# Check if high CPU is in kernel or user space
top -p <PID>
# %CPU: 99.3, but check:
# us (user) vs sy (system) ratio in top header

# Profile kernel functions
sudo perf top -U -k 1

# Check for specific kernel modules using CPU
sudo perf top -e cycles:k | head -20

# Check network interrupt distribution
cat /proc/interrupts | grep -E "CPU|eth"

# Check if a specific kernel module is stuck
lsmod | grep <module>
sudo cat /proc/modules

Expected output:

Samples: 1M of event 'cycles:k', 4000 Hz
Event count (approx.): 450000000000
Overhead  Shared Object     Symbol
  45.2%  [kernel]          [k] _raw_spin_lock
  12.3%  [kernel]          [k] do_syscall_64
   8.7%  [e1000]           [k] e1000_clean_rx_irq  # High driver CPU

Fix 5: Reduce I/O Wait with Process Prioritization

# Check I/O wait percentage
iostat -x 1 5

# Find processes with high I/O
sudo iotop -o

# Renice a CPU-hungry process to lower priority
# Niceness range: -20 (highest) to 19 (lowest)
renice -n 10 -p <PID>

# Set I/O scheduling class to idle (best-effort, lowest priority)
sudo ionice -c 3 -p <PID>

# Limit CPU usage with cpulimit
sudo apt install cpulimit
cpulimit -p <PID> -l 50  # Limit to 50% of one core

# Use cgroups to set hard CPU limits
sudo cgcreate -g cpu:/limited-app
echo 50000 | sudo tee /sys/fs/cgroup/cpu/limited-app/cpu.cfs_quota_us  # 50% of one core
echo <PID> | sudo tee /sys/fs/cgroup/cpu/limited-app/tasks

Expected output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.5     0.0    3.2    35.5     0.0     48.8   # High iowait

High CPU Troubleshooting Flowchart

flowchart TD
    A[High CPU Alert] --> B{Which metric is high?}
    B -->|User CPU| C[Find top process with top/ps]
    C --> D[Profile with perf record]
    D --> E[Generate flamegraph]
    E --> F[Identify hot function in code]
    F --> G[Optimize algorithm or add caching]
    B -->|System CPU| H[Check kernel functions with perf top -U]
    H --> I[Look for driver or filesystem module]
    I --> J[Update driver, disable module, or tune kernel]
    B -->|I/O Wait| K[Find I/O-heavy process with iotop]
    K --> L[Move process to slower storage or ionice]
    L --> M[Check disk health and RAID status]
    B -->|SoftIRQ| N[Check /proc/interrupts for flood]
    N --> O[Check network driver and NIC health]
    O --> P[Tune interrupt coalescing or spread IRQs]
    G --> Q[CPU Normalized]
    J --> Q
    M --> Q
    P --> Q

Prevention Tips

  • Set CPU limits on all containers and VMs with cgroups or Docker --cpus flags
  • Profile applications under load during development with perf and address hotspots before production
  • Monitor CPU usage trends with Prometheus and set alerts on sustained >80% utilization
  • Use nice and ionice for background batch jobs so they do not starve interactive processes
  • Spread hardware interrupts across CPU cores with irqbalance or manual /proc/irq/ SMP affinity
  • Implement backpressure mechanisms in queue consumers to prevent runaway processing

Practice Questions

  1. What is the difference between top and perf top for debugging CPU usage? Answer: top shows CPU usage at the process level -- which process is consuming CPU cycles. perf top shows CPU usage at the function level -- which specific function or kernel symbol is burning CPU. Use top to find the culprit process, then perf to drill into what that process is doing.

  2. How do you determine whether high CPU is caused by user code, kernel code, or interrupts? Answer: Run top and check the header line -- us (user), sy (system), si (softirq), hi (hardirq), wa (iowait). User code shows as us, kernel module or system calls show as sy, network/disk interrupts show as si/hi, and disk-bound processes show as wa.

  3. What is the relationship between high CPU usage and I/O wait, and how do you distinguish them? Answer: High user/system CPU means the CPU is actively executing code (compute-bound). High iowait means the CPU is idle but processes are blocked waiting for disk or network I/O to complete. A process can also be compute-heavy and cause other processes to experience iowait if the filesystem or storage cannot keep up.

  4. Challenge: Write a bash script that takes a PID, samples CPU usage with perf for 30 seconds, generates a flamegraph, and identifies the top 3 functions consuming CPU. Answer:

    #!/bin/bash
    PID=${1:?"Usage: $0 <PID>"}
    DURATION=${2:-30}
    
    sudo perf record -g -F 99 -p "$PID" --sleep "$DURATION" -o /tmp/perf.data
    sudo perf script -i /tmp/perf.data | /opt/FlameGraph/stackcollapse-perf.pl > /tmp/out.folded
    /opt/FlameGraph/flamegraph.pl /tmp/out.folded > /tmp/cpu-${PID}.svg
    sudo perf report -i /tmp/perf.data --stdio -g | head -20
    echo "Flamegraph: /tmp/cpu-${PID}.svg"
    

Quick Reference

Issue Diagnostic Resolution
Runaway process ps aux --sort=-%cpu Kill or fix infinite loop
Thread contention kill -3 <PID> (thread dump) Reduce lock scope, use lock-free structures
Kernel module high CPU perf top -U Update driver, disable module
I/O wait iostat -x 1 Move load, ionice, upgrade storage
Interrupt storm cat /proc/interrupts Spread IRQs, update NIC driver

FAQ

What is a "runaway process" and how do you safely stop it?

A runaway process is a program that enters an infinite loop or busy-wait, consuming 100% of a CPU core indefinitely. First try kill -TERM <PID> for a graceful shutdown. If it ignores SIGTERM, use kill -KILL <PID> (cannot be caught). Use renice -n 19 -p <PID> to reduce impact first if you need to investigate before killing.

What causes high si (softirq) CPU usage and how do you reduce it?

High softirq CPU is typically caused by a flood of network packets or disk interrupts that the kernel cannot process fast enough. Solutions: (1) enable interrupt coalescing with ethtool -C eth0 rx-usecs 100, (2) spread interrupts across cores using irqbalance or manual /proc/irq/ affinity, (3) upgrade the NIC driver, (4) reduce network packet rates with rate limiting or better application batching.

How do you distinguish between a CPU performance issue and an application-level bottleneck?

If CPU is at 100% and requests per second are below expected throughput, it is a CPU performance issue (inefficient code, missing indexes, no caching). If CPU is below 100% but throughput is below expected, the bottleneck is elsewhere -- I/O, network, external API latency, or lock contention. Use the USE method (Utilization, Saturation, Errors) for every resource to find the actual bottleneck.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro