Linux Utilities for Diagnostics

I spend a fair amount of time troubleshooting issues on Linux and other Unix and Unix-like systems. While there are dozens of utilities I use for diagnosing and resolving issues, I consistently employ a small set of tools to do quick, high-level checks of system health. These checks are in the categories of disk utilization, memory and CPU utilization, and networking and connectivity. Triaging the health of the system in each of these categories allows me to quickly hone in on where a problem may exist.

These utilities are usually available on all Linux systems. Most are available, or have analogues, on other Unix and Unix-like systems.

Disk Utilization

Generally, disk utilization is the first thing I check as a lack of free disk space spells certain doom for most user and kernel processes. I have seen more strange behavior from a lack of free disk space than anything else.

  • df reports filesystem disk space usage. This quickly allows me to see how much free space remains on each filesystem.
  • df -h displays, in human-readable format, the free space available on all mounted filesystems.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda        47G   26G   19G  58% /
devtmpfs        4.0G   12K  4.0G   1% /dev
none            802M  184K  802M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            4.0G     0  4.0G   0% /run/shm
  • du estimates file space usage. This allows me to pinpoint which fields are taking up large amounts of disk space so I can investigate further.
  • du -sh * summarizes, in human-readable format, the space utilized by all files/folders in the current directory.

$ du -sh *
18M     bundle
8.6M    cached-copy
444M    log
4.0K    pids
4.0K    system

Memory, CPU Utilization, and I/O

Running out of available memory is also a major cause of performance problems and strange behavior on systems. CPU utilization and I/O rates can quickly provide clues as to whether performance problems are due to bottlenecks internal to a given system, or from external sources.

  • free reports the amount of free and used memory on the system. This provides immediate feedback on whether a system lacks free memory.
  • free -m displays, in megabytes, the amount of used and free physical and swap memory, and the amount of memory used for buffers/caching.

$ free -m
                    total       used       free     shared    buffers     cached
Mem:          8014       6339       1674          0        136       3887
-/+ buffers/cache:       2314       5699
Swap:          511        153        358
  • vmstat reports on memory, swap, I/O, system activity, and CPU activity. This provides averages of various metrics since boot and can report continuously on current metrics. Analyzing the metrics can provide insight into what the system is doing at a given time (e.g. frequently swapping, waiting on I/O, etc.).
  • vmstat 1 will print out the metrics once every second until halted, using megabytes instead of bytes.

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0     80     59     43   1791    0    0     0     0 1131 1057 15  2 83  0
 0  0     80     57     43   1791    0    0     8    96 1031  936 19  2 79  0
 0  0     80     60     43   1791    0    0    40    64 1666 1444  9  2 89  0
 0  0     80     60     43   1791    0    0     8     0  667  553  0  0 100  0
 1  0     80     57     43   1791   16    0    16   104  808  748 12  2 86  0
 0  0     80     59     43   1791    0    0    12  3028 1813 1723 44  5 50  0
 0  0     80     59     43   1791    0    0     0    56 1119 1066 17  1 81  0
 1  0     80     50     43   1791    0    0    68     0 1219 1024 25  4 71  0
 0  0     80     60     43   1791    0    0    52    68 1725 1435 12  1 86  0
 0  0     80     60     43   1791    0    0     8     0 2236 1699 35  5 60  0
 0  0     80     60     43   1791    0    0     0    68  163  209  0  0 99  0
 1  0     80     60     43   1791    0    0     0   140 1456 1379 22  3 74  0
 1  0     80     61     43   1791    0    0     0    56 1481 1242 24  4 72  0
 0  0     80     60     43   1791    0    0   356     0 1359  930 11  3 86  0
 0  0     80     60     43   1792    0    0   428     0 1619  992  2  1 97  0
 0  0     80     60     43   1792    0    0     8  2196  313  396  0  0 100  0
 0  0     80     60     43   1792    0    0     0     0  144  181  0  0 100  0

Networking and Connectivity

Network connectivity and routing issues are usually apparent. However, trying to determine the exact nature of or reason for the issue can be a bit more difficult.

  • ping sends an ICMP echo request to a host. This provides immediate confirmation of whether or not a remote host is accessible.
  • ping 8.8.8.8 will ping Google’s DNS servers, which usually indicates with a high degree of certainty whether or not Internet connectivity is available.

$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_req=1 ttl=54 time=0.681 ms
64 bytes from 8.8.8.8: icmp_req=2 ttl=54 time=0.679 ms
64 bytes from 8.8.8.8: icmp_req=3 ttl=54 time=0.703 ms
64 bytes from 8.8.8.8: icmp_req=4 ttl=54 time=0.703 ms
64 bytes from 8.8.8.8: icmp_req=5 ttl=54 time=0.677 ms
  • mtr combines ping with traceroute and prints the route packet trace to a remote host, along with packet response times and loss percentages.
  • mtr -c 5 -r 8.8.8.8 will send five packets to Google’s DNS servers and report back the intermediate routers, with details about response times and packet loss along the way.

$ mtr -c 5 -r  8.8.8.8
HOST: localhost                 Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- router2-dal.linode.com     0.0%     5    0.9   0.7   0.6   0.9   0.2
  2.|-- ae2.car02.dllstx2.network  0.0%     5    0.3   6.3   0.3  30.5  13.5
  3.|-- po102.dsr01.dllstx2.netwo  0.0%     5    1.1   0.6   0.5   1.1   0.3
  4.|-- po21.dsr01.dllstx3.networ  0.0%     5    1.3   2.5   0.6   8.0   3.1
  5.|-- ae17.bbr02.eq01.dal03.net  0.0%     5    0.5   0.6   0.5   0.8   0.1
  6.|-- ae7.bbr01.eq01.dal03.netw  0.0%     5    0.5   0.6   0.5   0.7   0.1
  7.|-- 25.10.6132.ip4.static.sl-  0.0%     5    0.6   0.9   0.6   2.2   0.7
  8.|-- 216.239.50.89              0.0%     5    0.5   0.6   0.5   0.8   0.1
  9.|-- 64.233.174.69              0.0%     5    1.0   0.8   0.8   1.0   0.1
 10.|-- google-public-dns-a.googl  0.0%     5    0.8   0.8   0.7   0.8   0.0
  • netstat displays information about network connections, routing tables, and interfaces. While it is a very sophisticated tool which has many different possible applications, it provides an easy way to display a few important bits of data:
  • netstat -nlp displays information about processes that are currently listening on a socket.

$ sudo netstat -nlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:3306          0.0.0.0:*               LISTEN      2858/mysqld
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      2665/sshd
tcp        0      0 0.0.0.0:25              0.0.0.0:*               LISTEN      3133/master
tcp6       0      0 :::8080                 :::*                    LISTEN      3160/apache2
tcp6       0      0 :::22                   :::*                    LISTEN      2665/sshd
tcp6       0      0 :::25                   :::*                    LISTEN      3133/master
tcp6       0      0 :::443                  :::*                    LISTEN      3160/apache2
udp        0      0 0.0.0.0:68              0.0.0.0:*                           2633/dhclient3
  • netstat -rn displays the current routing table.

$ netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         173.255.206.1   0.0.0.0         UG        0 0          0 eth0
173.255.206.0   0.0.0.0         255.255.255.0   U         0 0          0 eth0

Conclusion

The examples above show some of the most common ways these utilities can be used to perform diagnostics on systems based on disk utilization, memory and CPU utilization, and network activity and connectivity. Some of these utilities (particularly netstat) are quite powerful, and could be used to display or diagnose much more than shown in the examples above. Past troubleshooting experience, and the specific histories of given systems, guide the particular ways that I deploy these tools to assist in the investigation and resolution of system issues.

Conversation
  • Chris G. Sellers says:

    Justin, these are staple tools and make for a great list. Thanks for enumerating them. May I also offer a few other next step level of tools that could be useful for developers or devops folks.

    nethogs: a nifty network top-like tool that will accumulate your network usage per interface and associate processes to a counter. This tool could be useful to see if you have runaway network connections, performance issues, or want to measure how much bandwidth your app may be used and what level of computer/server/instance/etc. you may need to allocate to it. It is available in Debian/Ubuntu via apt and is available for other Linux as well. (available on sourceforge)

    netcat: netcat (or nc) is a tool that allows the creation of a TCP (or UDP) connection to a given address/ip on a specific port. This can be useful to verify network connectivity, firewall rules, or that routes are working. It can also be used to test things like SMTP, HTTP, or other protocol connections similar to the way Telnet may be used. Netcat is also capable of performing a ‘cat’ of information across a network stream, such as transferring data from one system to another by creating a network link to transfer the data. It can act as a ‘client’ and/or a ‘server’ so it can be useful to simulate a back or front end for an app or service that you are creating. I know of a situation where data was transferred from a raw disk on one system to a replacement system via netcat and tar, instead of using ssh, scp, ftp, or other transfer. It may prove faster due to less overhead.
    (ex. nc -v mydevsite.local.name 8080 ) Netcat is usually built-in to most UNIX like operating systems, and may be called nc or netcat.

    I hope these are helpful to those who visit.

  • Comments are closed.