Skip to content

Linux Troubleshooting Guide -- System Boot, Kernel Panic & Service Failure

DodaTech Updated 2026-06-23 7 min read

Linux system failures like kernel panics, boot hangs, and crashed services can take down production servers in seconds -- this guide shows you how to diagnose and recover from each using built-in recovery tools and systematic debugging.

What You'll Learn

Why It Matters

A server that fails to boot or crashes with a kernel panic means downtime. Knowing how to use Linux recovery tools like single-user mode, journalctl, and rescue shells lets you bring systems back online without reinstalling.

Real-World Use

When your production web server hangs on boot after a kernel update, a service keeps crashing with "Failed to start", or a filesystem corruption prevents SSH access, these recovery procedures get you back online.

Common Linux System Issues Table

Issue Symptom Cause Recovery Method
Boot hang after kernel update System stops at purple screen or GRUB menu Faulty kernel module or boot parameter Boot into previous kernel from GRUB
Kernel panic "Kernel panic - not syncing" on console Hardware fault, corrupt kernel, or bad module Boot with nomodeset or rescue kernel
Service fails to start "Failed to start service: Unit not found" Missing dependency, bad config, or permission Check journalctl logs and systemctl status
Filesystem corruption "Input/output error" or "Structure needs cleaning" Unexpected power loss or disk failure Run fsck from a live USB
Out of memory (OOM) Process killed by kernel OOM killer Application memory leak or insufficient RAM Check dmesg for OOM entries and adjust limits
GRUB rescue shell "grub rescue>" prompt on boot Corrupt or missing bootloader config Reinstall GRUB from a live environment

Step-by-Step Fixes

Fix 1: Boot Failure After Kernel Update

# At GRUB menu, select "Advanced options for Ubuntu"
# Choose the previous kernel version to boot into

# Once booted, remove the bad kernel
sudo apt remove linux-image-5.15.0-XX

# Regenerate GRUB config
sudo update-grub

# Prevent the kernel from being reinstalled accidentally
sudo apt-mark hold linux-image-5.15.0-XX

Expected output:

Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.15.0-XX-generic
Found initrd image: /boot/initrd.img-5.15.0-XX-generic
done

Fix 2: Kernel Panic Recovery

# Boot with the "nomodeset" kernel parameter
# At GRUB, press 'e' on the boot entry
# Find the line starting with "linux" and add: nomodeset

# Once booted, check the kernel log for errors
dmesg | grep -i panic

# Check for hardware issues
sudo journalctl -k -b -1 --no-pager | grep -i "error\|fail\|panic"

# Test memory for faults
sudo memtester 1024 5

Expected output:

[    0.123456] Kernel panic - not syncing: Fatal exception
[    0.123457] CPU: 1 PID: 123 Comm: swapper/0 Tainted: G

Fix 3: Service Crash Debugging

# Check the status of a failed service
sudo systemctl status nginx

# View the last 50 log lines for the service
sudo journalctl -u nginx -n 50 --no-pager

# Follow logs in real time
sudo journalctl -u nginx -f

# Test the service configuration
sudo nginx -t

# Restart after fixing the config
sudo systemctl restart nginx

Expected output:

● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded
     Active: failed (Result: exit-code) since ...
   Main PID: 1234 (code=exited, status=1/FAILURE)

Fix 4: Filesystem Check

# Check filesystem errors (unmount first)
sudo umount /dev/sda1

# Run filesystem check
sudo fsck -f -y /dev/sda1

# Check disk SMART status
sudo smartctl -H /dev/sda

# Mount and verify
sudo mount /dev/sda1 /mnt
df -h /mnt

Expected output:

fsck from util-linux 2.34
/dev/sda1: |-- newsd306.newsrc  |         |-- orca1.orca
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
/dev/sda1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda1: 1234/65536 files (0.1% non-contiguous)

Fix 5: OOM Killer Investigation

# Check if OOM killer has been active
dmesg | grep -i "oom-killer\|Out of memory"

# See total and available memory
free -h

# Find top memory-consuming processes
ps aux --sort=-%mem | head -10

# Set memory limit for a systemd service
sudo systemctl set-property my-service.service MemoryMax=1G

# Verify the limit
systemctl show my-service.service | grep MemoryMax

Expected output:

[12345.678901] oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[12345.678902] [ pid ]   uid  tgid total_vm      rss
[12345.678903] [ 1234]     0  1234   123456    78901  my-app

Linux Troubleshooting Flowchart

flowchart TD
    A[Linux System Failure] --> B{Type of Failure?}
    B -->|Boot Hang| C[Enter GRUB advanced options]
    C --> D[Boot previous kernel version]
    D --> E[Remove or hold bad kernel]
    B -->|Kernel Panic| F[Boot with nomodeset]
    F --> G[Check dmesg for panic cause]
    G --> H[Run memtester and disk checks]
    B -->|Service Crash| I[Run systemctl status]
    I --> J[Check journalctl logs]
    J --> K[Validate config and restart]
    B -->|Filesystem Error| L[Unmount and run fsck]
    L --> M[Check SMART status]
    M --> N[Replace disk if failing]
    B -->|OOM Kills| O[Run dmesg | grep oom]
    O --> P[Identify and fix memory leak]
    P --> Q[Set systemd memory limits]
    E --> R[System Stable]
    H --> R
    K --> R
    N --> R
    Q --> R

Prevention Tips

  • Always keep one previous kernel installed using sudo apt-mark hold on a known-good version
  • Set up early boot console logging with netconsole to capture kernel panics remotely
  • Monitor system health with Prometheus and alert on OOM events, high load, and disk errors
  • Schedule regular fsck checks using tune2fs -c 30 /dev/sda1 for every 30 mounts
  • Use systemd service hardening with MemoryMax, CPUQuota, and Restart=always to limit blast radius

Practice Questions

  1. How do you boot into a previous kernel version from GRUB? Answer: At the GRUB menu, select "Advanced options for Ubuntu" and choose an earlier kernel entry. If GRUB does not show, hold Shift (BIOS) or press Esc (UEFI) during boot.

  2. What commands show why a systemd service failed to start? Answer: sudo systemctl status <service> shows the exit code and status. sudo journalctl -u <service> -n 50 shows the recent log entries. sudo journalctl -u <service> -f follows logs in real time.

  3. How do you find and stop an out-of-memory condition on a live server? Answer: Run dmesg | grep -i oom to see which process was killed. Then ps aux --sort=-%mem | head -10 to find top memory consumers. Kill the offending process or restart the service with memory limits.

  4. Challenge: Write a bash script that checks disk SMART health for all drives, runs fsck on any ext4 partitions with errors reported in dmesg, and sends a summary. Answer:

    #!/bin/bash
    for disk in /dev/sd[a-z]; do
        health=$(sudo smartctl -H "$disk" | grep "SMART overall-health" | awk '{print $NF}')
        echo "$disk: $health"
        if [ "$health" != "PASSED" ]; then
            echo "WARNING: $disk may be failing"
        fi
    done
    dmesg | grep -i "error" | grep -i "sd[a-z]" | head -20
    

Quick Reference

| Issue | Diagnostic Command | Recovery Action | |-------|--------------------|-----------------| | Boot failure | GRUB advanced options | Boot previous kernel, sudo apt-mark hold | | Kernel panic | dmesg | grep -i panic | Boot with nomodeset, check hardware | | Service crash | journalctl -u <service> | Fix config, systemctl restart | | Filesystem error | dmesg | grep "I/O error" | fsck -f -y /dev/sdX from live USB | | OOM killer | dmesg | grep oom | Kill process or set MemoryMax |

FAQ

What is a kernel panic and how is it different from a regular crash?

A kernel panic is the operating system's last resort when it encounters a fatal error it cannot recover from -- it halts all processing and displays a diagnostic message on the console. A regular application crash only affects the single process, while a kernel panic takes down the entire system and requires a reboot.

Can you recover data from a server with a corrupted filesystem?

Yes. Boot from a live USB, run fsck -f -y /dev/sdX on the affected partition, then mount it read-only first to assess damage. For critical data, use ddrescue to create a disk image before attempting repairs on the original drive.

How do you prevent a bad kernel update from breaking your server?

Always test kernel updates on a staging environment first. On production servers, pin the current kernel with sudo apt-mark hold linux-image-$(uname -r) and keep the previous kernel installed by setting GRUB_DISABLE_SUBMENU=y and GRUB_SAVEDEFAULT=true in /etc/default/grub.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro