Skip to content

How to Troubleshoot Any Linux Problem

This is the hub for the troubleshooting playbooks. Instead of a single bug, it teaches a method you can apply to anything: a service that won't start, a full disk, a server that's slow, or a host you can't reach. Master the workflow once and the specific playbooks become quick reference.

Tested on

AlmaLinux 9 / RHEL 9. Commands use systemd, journalctl, and dnf. On Debian/Ubuntu the same tools exist; substitute apt/apt-get for dnf and check /var/log/apt/history.log instead of dnf history.

The mindset

Most outages are solved by discipline, not cleverness. Three habits matter more than any single command:

  • What changed recently? Systems that ran fine for months rarely break on their own. A deploy, a package update, a config edit, a cert expiry, or a full disk usually precedes the failure. Find the change and you're halfway to the fix.
  • Reproduce it. A bug you can trigger on demand is a bug you can fix. Note the exact steps, the exact time, and whether it's constant or intermittent.
  • Read the ACTUAL error message. Do not paraphrase it, guess at it, or skim past it. Copy the literal text. "It's broken" wastes an hour; No space left on device or Permission denied points straight at the cause.

Change one thing at a time

When you start fixing, change a single variable, then re-test. Changing five things at once means you won't know which one worked — or which one made it worse.

A repeatable workflow

  1. Capture the exact symptom and error. What command/page failed, what was the literal output, what time, who reported it.
  2. Check service status and logs. systemctl status <unit> and journalctl -u <unit> -e. The error is almost always written down somewhere.
  3. Narrow the layer. Decide whether the problem is in the application, the OS, the network, or hardware. Each layer has different tools (below). Don't tune the database if DNS is broken.
  4. Check resources. Disk, inodes, memory, CPU, and load. A surprising share of "weird" failures are just a full disk or OOM kills.
  5. Check recent changes. dnf history, file modification times, deploy logs, who logged in.
  6. Isolate / bisect. Disable half the variables; does it still fail? Roll back the suspected change; does it recover? Binary-search your way to the culprit.
  7. Fix, verify, document. Apply the smallest fix, confirm the symptom is gone with the same test from step 1, then write down what happened and how you fixed it so the next person (often future-you) doesn't start from zero.

Verify before you walk away

"It should be fixed now" is not the same as "I reproduced the original failure and it no longer happens." Always re-run the failing test.

The USE method

For performance and resource problems, Brendan Gregg's USE method gives you a checklist for every resource (CPU, memory, disk, network):

  • Utilization — how busy is the resource? (e.g. CPU at 95%, disk at 100% busy)
  • Saturation — how much extra work is queued/waiting? (e.g. run queue length, swap activity, I/O wait)
  • Errors — is the resource reporting errors? (e.g. disk read errors in dmesg, dropped network packets)

Walk each resource through U, S, E and you'll rarely miss a bottleneck.

Essential first-look commands

Run these in roughly this order when you land on a misbehaving box. They take seconds and rule out the most common causes.

Services and logs

# Any units that failed to start?
systemctl --failed

# Errors in this boot, highest priority first
journalctl -p err -b

# The most recent log activity (great right after a crash)
journalctl -xe

# Logs for one specific unit
journalctl -u <unit> -e

Resources

# Disk space — the single most common silent killer
df -h

# Inodes — a disk can be "full" with space left if inodes are exhausted
df -i

# Memory and swap
free -h

# Load average and how long the box has been up (did it reboot?)
uptime

# Kernel ring buffer with human timestamps — OOM kills, disk/HW errors, segfaults
sudo dmesg -T | tail -n 40

df -i is easy to forget

If df -h shows free space but writes still fail with No space left on device, you've almost certainly run out of inodes. See Disk Full: No space left on device.

Network and access

# What's listening, and which process owns each socket?
sudo ss -tulpn

# Successful logins (last) and failed login attempts (lastb)
last
sudo lastb

Recent changes

# What did dnf install/update/remove, and when?
sudo dnf history

# Details of a specific transaction
sudo dnf history info <ID>

# Which config files changed in the last day?
sudo find /etc -type f -mtime -1 -ls

File mtimes are a free audit log

find /etc -mtime -1 (or -mmin -120 for the last two hours) often reveals exactly which config someone edited right before things broke — even when there's no formal change record.

Picking the layer

Symptom points to… Start with Playbook
A daemon that won't start or keeps restarting systemctl status, journalctl -u systemd Service Management
Writes failing, services crashing, can't log in df -h, df -i Disk Full: No space left on device
Slow or unresponsive box uptime, free -h, top, dmesg -T (use the USE method above)
Need to read what actually happened journalctl, log files Logs & journald
Mount, volume, or filesystem trouble df, lsblk, mount Storage & Filesystems

When in doubt, return here, run the first-look commands, and let the output tell you which playbook to open next.

Test yourself