Linux Troubleshooting Guide -- System Boot, Kernel Panic & Service Failure
Linux system failures like kernel panics, boot hangs, and crashed services can take down production servers in seconds -- this guide shows you how to diagnose and recover from each using built-in recovery tools and systematic debugging.
What You'll Learn
Why It Matters
A server that fails to boot or crashes with a kernel panic means downtime. Knowing how to use Linux recovery tools like single-user mode, journalctl, and rescue shells lets you bring systems back online without reinstalling.
Real-World Use
When your production web server hangs on boot after a kernel update, a service keeps crashing with "Failed to start", or a filesystem corruption prevents SSH access, these recovery procedures get you back online.
Common Linux System Issues Table
| Issue | Symptom | Cause | Recovery Method |
|---|---|---|---|
| Boot hang after kernel update | System stops at purple screen or GRUB menu | Faulty kernel module or boot parameter | Boot into previous kernel from GRUB |
| Kernel panic | "Kernel panic - not syncing" on console | Hardware fault, corrupt kernel, or bad module | Boot with nomodeset or rescue kernel |
| Service fails to start | "Failed to start service: Unit not found" | Missing dependency, bad config, or permission | Check journalctl logs and systemctl status |
| Filesystem corruption | "Input/output error" or "Structure needs cleaning" | Unexpected power loss or disk failure | Run fsck from a live USB |
| Out of memory (OOM) | Process killed by kernel OOM killer | Application memory leak or insufficient RAM | Check dmesg for OOM entries and adjust limits |
| GRUB rescue shell | "grub rescue>" prompt on boot | Corrupt or missing bootloader config | Reinstall GRUB from a live environment |
Step-by-Step Fixes
Fix 1: Boot Failure After Kernel Update
# At GRUB menu, select "Advanced options for Ubuntu"
# Choose the previous kernel version to boot into
# Once booted, remove the bad kernel
sudo apt remove linux-image-5.15.0-XX
# Regenerate GRUB config
sudo update-grub
# Prevent the kernel from being reinstalled accidentally
sudo apt-mark hold linux-image-5.15.0-XX
Expected output:
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.15.0-XX-generic
Found initrd image: /boot/initrd.img-5.15.0-XX-generic
done
Fix 2: Kernel Panic Recovery
# Boot with the "nomodeset" kernel parameter
# At GRUB, press 'e' on the boot entry
# Find the line starting with "linux" and add: nomodeset
# Once booted, check the kernel log for errors
dmesg | grep -i panic
# Check for hardware issues
sudo journalctl -k -b -1 --no-pager | grep -i "error\|fail\|panic"
# Test memory for faults
sudo memtester 1024 5
Expected output:
[ 0.123456] Kernel panic - not syncing: Fatal exception
[ 0.123457] CPU: 1 PID: 123 Comm: swapper/0 Tainted: G
Fix 3: Service Crash Debugging
# Check the status of a failed service
sudo systemctl status nginx
# View the last 50 log lines for the service
sudo journalctl -u nginx -n 50 --no-pager
# Follow logs in real time
sudo journalctl -u nginx -f
# Test the service configuration
sudo nginx -t
# Restart after fixing the config
sudo systemctl restart nginx
Expected output:
● nginx.service - A high performance web server and a reverse proxy server
Loaded: loaded
Active: failed (Result: exit-code) since ...
Main PID: 1234 (code=exited, status=1/FAILURE)
Fix 4: Filesystem Check
# Check filesystem errors (unmount first)
sudo umount /dev/sda1
# Run filesystem check
sudo fsck -f -y /dev/sda1
# Check disk SMART status
sudo smartctl -H /dev/sda
# Mount and verify
sudo mount /dev/sda1 /mnt
df -h /mnt
Expected output:
fsck from util-linux 2.34
/dev/sda1: |-- newsd306.newsrc | |-- orca1.orca
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
/dev/sda1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda1: 1234/65536 files (0.1% non-contiguous)
Fix 5: OOM Killer Investigation
# Check if OOM killer has been active
dmesg | grep -i "oom-killer\|Out of memory"
# See total and available memory
free -h
# Find top memory-consuming processes
ps aux --sort=-%mem | head -10
# Set memory limit for a systemd service
sudo systemctl set-property my-service.service MemoryMax=1G
# Verify the limit
systemctl show my-service.service | grep MemoryMax
Expected output:
[12345.678901] oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[12345.678902] [ pid ] uid tgid total_vm rss
[12345.678903] [ 1234] 0 1234 123456 78901 my-app
Linux Troubleshooting Flowchart
flowchart TD
A[Linux System Failure] --> B{Type of Failure?}
B -->|Boot Hang| C[Enter GRUB advanced options]
C --> D[Boot previous kernel version]
D --> E[Remove or hold bad kernel]
B -->|Kernel Panic| F[Boot with nomodeset]
F --> G[Check dmesg for panic cause]
G --> H[Run memtester and disk checks]
B -->|Service Crash| I[Run systemctl status]
I --> J[Check journalctl logs]
J --> K[Validate config and restart]
B -->|Filesystem Error| L[Unmount and run fsck]
L --> M[Check SMART status]
M --> N[Replace disk if failing]
B -->|OOM Kills| O[Run dmesg | grep oom]
O --> P[Identify and fix memory leak]
P --> Q[Set systemd memory limits]
E --> R[System Stable]
H --> R
K --> R
N --> R
Q --> R
Prevention Tips
- Always keep one previous kernel installed using
sudo apt-mark holdon a known-good version - Set up early boot console logging with
netconsoleto capture kernel panics remotely - Monitor system health with Prometheus and alert on OOM events, high load, and disk errors
- Schedule regular
fsckchecks usingtune2fs -c 30 /dev/sda1for every 30 mounts - Use
systemdservice hardening withMemoryMax,CPUQuota, andRestart=alwaysto limit blast radius
Practice Questions
How do you boot into a previous kernel version from GRUB? Answer: At the GRUB menu, select "Advanced options for Ubuntu" and choose an earlier kernel entry. If GRUB does not show, hold Shift (BIOS) or press Esc (UEFI) during boot.
What commands show why a systemd service failed to start? Answer:
sudo systemctl status <service>shows the exit code and status.sudo journalctl -u <service> -n 50shows the recent log entries.sudo journalctl -u <service> -ffollows logs in real time.How do you find and stop an out-of-memory condition on a live server? Answer: Run
dmesg | grep -i oomto see which process was killed. Thenps aux --sort=-%mem | head -10to find top memory consumers. Kill the offending process or restart the service with memory limits.Challenge: Write a bash script that checks disk SMART health for all drives, runs fsck on any ext4 partitions with errors reported in dmesg, and sends a summary. Answer:
#!/bin/bash for disk in /dev/sd[a-z]; do health=$(sudo smartctl -H "$disk" | grep "SMART overall-health" | awk '{print $NF}') echo "$disk: $health" if [ "$health" != "PASSED" ]; then echo "WARNING: $disk may be failing" fi done dmesg | grep -i "error" | grep -i "sd[a-z]" | head -20
Quick Reference
| Issue | Diagnostic Command | Recovery Action |
|-------|--------------------|-----------------|
| Boot failure | GRUB advanced options | Boot previous kernel, sudo apt-mark hold |
| Kernel panic | dmesg | grep -i panic | Boot with nomodeset, check hardware |
| Service crash | journalctl -u <service> | Fix config, systemctl restart |
| Filesystem error | dmesg | grep "I/O error" | fsck -f -y /dev/sdX from live USB |
| OOM killer | dmesg | grep oom | Kill process or set MemoryMax |
FAQ
What is a kernel panic and how is it different from a regular crash?
A kernel panic is the operating system's last resort when it encounters a fatal error it cannot recover from -- it halts all processing and displays a diagnostic message on the console. A regular application crash only affects the single process, while a kernel panic takes down the entire system and requires a reboot.
Can you recover data from a server with a corrupted filesystem?
Yes. Boot from a live USB, run fsck -f -y /dev/sdX on the affected partition, then mount it read-only first to assess damage. For critical data, use ddrescue to create a disk image before attempting repairs on the original drive.
How do you prevent a bad kernel update from breaking your server?
Always test kernel updates on a staging environment first. On production servers, pin the current kernel with sudo apt-mark hold linux-image-$(uname -r) and keep the previous kernel installed by setting GRUB_DISABLE_SUBMENU=y and GRUB_SAVEDEFAULT=true in /etc/default/grub.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro