Troubleshooting

Server Is Slow or Load Is High¶

The box feels sluggish, SSH lags, or your monitoring fired a "high load" alert. The job is to answer one question fast: is the system CPU-bound, memory-bound, or I/O-bound? Each points to a different culprit and a different fix.

Tested on

AlmaLinux 9 / RHEL 9. The same tools exist on Debian/Ubuntu; install the optional ones with sudo apt install sysstat htop (RHEL: sudo dnf install sysstat htop).

Symptom¶

uptime shows load average well above your CPU core count.
Interactive commands stall; applications time out.
A monitoring alert flags high load, low free memory, or high iowait.

Understanding load average¶

$ uptime
 10:42:11 up 37 days,  3:18,  2 users,  load average: 8.21, 6.05, 3.10

The three numbers are the 1-, 5-, and 15-minute averages of runnable + uninterruptible (D-state, usually I/O) processes. Read them relative to CPU core count (nproc):

Load ≈ cores → fully utilized but keeping up.
Load > cores → tasks are queueing; the trend across the three numbers tells you if it is rising or recovering.
High load with low CPU usage → the queue is full of processes blocked on I/O, not CPU.

Load is not CPU%

A load of 8 on a 16-core box can be perfectly healthy. A load of 8 on a 2-core box is a problem. Always check nproc first.

Likely causes¶

Class	Signature
CPU-bound	High `%us`/`%sy` in top, one or more processes pinned near 100%
Memory-bound	`free -h` near zero available, swap filling, OOM kills in dmesg
I/O-bound	High `%wa` (iowait), processes in `D` state, busy disk in `iostat`

Diagnose¶

Work top-down: load → which resource → which process.

uptime           # quick load snapshot
w                # load plus who is logged in and doing what
nproc            # core count to interpret the load against

Live view — watch %CPU, the load line, and especially %wa (iowait):

top              # press '1' to see per-core, 'M' sort by memory, 'P' by CPU
htop             # friendlier, color-coded (install separately)

Memory and swap:

free -h          # look at 'available' (not 'free') and the Swap row

System-wide activity, one sample per second:

vmstat 1         # r=run queue, b=blocked, si/so=swap in/out, wa=iowait

Per-device I/O (from the sysstat package):

iostat -xz 1     # %util near 100 and high await = a saturated disk

Top consumers by CPU and by memory:

ps aux --sort=-%cpu | head        # biggest CPU users
ps aux --sort=-%mem | head        # biggest memory users
pidstat 1                         # per-process CPU over time (sysstat)
pidstat -d 1                      # per-process disk I/O

The OOM killer¶

When the system runs out of memory (and swap), the kernel's Out-Of-Memory killer picks a process and kills it to stay alive. Symptoms: a service vanishes for no apparent reason, or load spikes during heavy swapping. Confirm it:

sudo dmesg -T | grep -i -E 'killed process|out of memory|oom'
sudo journalctl -k | grep -i -E 'oom|killed process'

A match like Out of memory: Killed process 4821 (mysqld) tells you the kernel reclaimed memory by killing that PID — your real problem is memory pressure, not the service "randomly crashing". See Logs & journald for retaining these kernel messages.

Fix¶

Match the fix to the bound you identified.

CPU-boundMemory-boundI/O-bound

Find the process, then renice it (lower priority) or kill/fix the runaway. See Process Management.

sudo renice +10 -p <PID>     # deprioritize a non-critical hog
sudo kill <PID>              # stop it (SIGTERM); -9 only as last resort

If it is a legitimate workload consistently maxing cores, add CPUs or scale out.

Restart a leaking service, then address the leak. If the host is simply short on RAM, add swap as a stopgap:

sudo systemctl restart <leaky-service>
sudo fallocate -l 4G /swapfile && sudo chmod 600 /swapfile
sudo mkswap /swapfile && sudo swapon /swapfile
free -h

Make swap persistent by adding /swapfile none swap sw 0 0 to /etc/fstab. Swap buys time — it does not fix a leak. See Storage & Filesystems.

Find the disk culprit with iostat -xz 1 and pidstat -d 1, then throttle or reschedule it (e.g. move a backup or dd job off peak). Heavy D-state processes blocking everything else usually point at one greedy job or a failing disk — check dmesg for I/O errors.

After mitigating, re-run uptime and iostat to confirm load is trending down.

Prevent¶

Monitor and alert on load relative to cores, memory available, swap usage, and disk %util — not just raw load. Tie OOM events to a page.

Set systemd resource limits so one service can't starve the host:

[Service]
MemoryMax=2G       # hard cap; cgroup OOM-kills only this service
MemoryHigh=1500M   # soft throttle before the hard cap
CPUQuota=200%      # at most 2 full cores

Apply with sudo systemctl daemon-reload && sudo systemctl restart <svc>. See systemd Service Management.

Capacity-plan from trends: if load and memory creep up week over week, size up before it becomes an incident.
Keep sysstat enabled (sudo systemctl enable --now sysstat) so sar retains historical CPU/IO/memory data for post-mortems.