# Debugging Gnu/Linux
Here are notes on system level commands I
find useful for diagnosing problems with a
process.
See more sections
on Gnu/Linux , including NFS and routing
problems.
``gdb'' is well-documented, so I won't
bother. Try ``ddd'' as a friendly wrapper
for ``gdb.''
** Books **
Check out "Advanced Linux Programming"
http://www.advancedlinuxprogramming.com/ from
http://www.codesourcery.com/ .
Find some great suggestions at
http://people.redhat.com/alikins/system_tuning.html
** Aliases for ``ps'' **
Many prefer interactive tools like ``top''
for watching running programs, but the
command-line tool ``ps'' is actually more
flexible and less intrusive, if you construct
some adequate aliases. Here are my three
favorites. The first sorts by memory, and
the second by CPU usage.
=>
function psc {
ps --cols=1000 --sort='-%cpu,uid,pgid,ppid,pid' -e \
-o user,pid,ppid,pgid,stime,stat,wchan,time,pcpu,pmem,vsz,rss,sz,args |
sed 's/^/ /' | less
}
<=
=>
function psm {
ps --cols=1000 --sort='-vsz,uid,pgid,ppid,pid' -e \
-o user,pid,ppid,pgid,stime,stat,wchan,time,pcpu,pmem,vsz,rss,sz,args |
sed 's/^/ /' | less
}
<=
Programs swapped to disk are shown in
brackets without arguments.
The STAT column shows the process status:
=>
D uninterruptible sleep (usually IO)
R runnable (on run queue)
S sleeping
T traced or stopped
Z a defunct ("zombie") process
W has no resident pages
< high-priority process
N low-priority task
L has pages locked into memory (for real-time and custom IO)
<=
The WCHAN column shows the resource the
system is waiting for -- mapped to ascii
according to the file
``/boot/System.map-`uname -r`''
** CPU activity **
To see a history of CPU usage install the
``sysstat'' package, which contains ``sar''.
The ``sar -A'' output distinguishes user CPU
from system CPU, which is consumed for
operations like I/O, swapping, and error
handling. You should investigate any
unusually high system CPU usage. ``strace''
will confirm whether a specific program is
making too many system requests.
To see what CPU's you have type ``cat
/proc/cpuinfo''
You may want to install and display
``xosview'' for a quick glance at system
activity. If you use the gnome desktop,
try the ``System Monitor'' applet that
comes preinstalled, or upgrade to ``gkrellm''.
See the alias ``psc'' in the previous section.
** The /proc filesystem **
To see the status of a running process look
at ``/proc/PID'' where PID is the process ID
of the thread of interest. ``/proc/self''
will point to the current process. See
memory and CPU usage, file descriptors,
environment, working directory, mapped shared
objects, the command line, and a symbolic
link to the executable.
The command ``procinfo -a'' is handy for
certain information from the ``/proc''
filesystem.
See
http://www.redhat.com/docs/manuals/linux/RHL-8.0-Manual/ref-guide/ch-proc.html
for more info.
** Getting Process ID's **
Each thread is managed by the kernel as a
separate process with shared memory. You
often need the process ID's of all threads.
If you are running only one instance of a
particular executable, then you can get the
PIDs with ``pidof program''.
Otherwise you may want to identify some
string that appears as a command-line
argument only for your particular process.
Write a script like the following to return a
list of the PIDs.
=>
if [ $# -ne 1 ] ; then echo 'Usage: greppids "unique string"' ; fi
UNIQUE_STRING="$1"
MYUSERNAME=`whoami`
ps --cols=1000 -e -o pid,user,args | grep " $MYUSERNAME " | sed 's/$/ /' |
grep "$UNIQUE_STRING" | grep -v grep | awk '{print $1}' | sort -n | paste -s -d" " -
<=
Use ``kill -0'' to see if a process ID or
process group ID is active.
=>
$ if kill -0 $PID ; then echo "Yes, process $PID is running" ; done
$ if kill -0 -$PGID ; then echo "Yes, process group ID $PGID is running" ; done
<=
** Resource limits **
See your resource limits with ``ulimit -a''
and ``sysctl -a'', and change with other
options. Within bash, type ``help ulimit''.
Set user limits permanently inside
``/etc/security/limits.conf''
** Shared library dependencies **
``ldd -v filename'' will print shared
libraries required by a program or another
shared library, and will show where these
libraries are found on your file system for
the current ``LD_LIBRARY_PATH.'' Your system
will also search directories listed in
``/etc/ld.so.conf'', which is reread after
typing ``ldconfig''. Print out symbols with
``nm -C -u -g filename''.
** System activity **
The best overview of recent system activity
(usually 12 hours) is from ``sar -A''.
Use ``strace'' to monitor all communication
between a thread and the kernel. Use
``strace'' on any process that is running
unusually slowly or is causing high system
usage. Hung processes can often be diagnosed
this way. Attach an ``strace'' to each
thread of an executable ``foo.exe'' with a
script like the following:
=>
P=`pidof foo.exe`
for n in $P ; do
echo strace -p$n
( strace -v -p$n 2>&1 | sed -e "s/^/$n\| /" )&
done
sleep 30
killall strace
<=
``addr2line'' is a handy utility for
converting program addresses into file names
and line numbers.
** System failures **
Look for possible system failures with ``tac
/var/log/messages | less'' or ``dmesg''.
** Files and I/O activity **
Use ``fuser'' or ``lsof'' to find out what
processes are using a file.
Use ``lsof'' to find out what files are being
used by a process, as
=>
$ lsof -p PID
$ lsof -c program
<=
See which process is using a tcp port
(say 8080) with
=>
$ fuser -n tcp 8080
or
$ fuser 8080/tcp
<=
To investigate I/O activity, type ``iostat''
and ``iostat -x'' to see which devices are
being used. ``iostat'' is also from the
``sysstat'' package.
Check all the inode times and id's with
``stat filename''. Check your user's id's
with ``id -a [username]''.
Check local disk speed with ``hdparm -Tt
/dev/hda'' where the device is the one you
see mounted with ``df''. You may want to
change some default options (``hdparm
/dev/hda'') for improved performance. For
IDE disk, look at changes like ``hdparm -c3
-m16 /dev/hda''. Test riskier changes in
single user mode in case you hang.
** Memory **
Try the following commands to see whether you
are swapping memory or not. Look at the man
pages first.
=>
$ cat /proc/meminfo
$ vmstat 5
$ free -s 5
$ procinfo -n5 -f
$ top
<=
Watch the ``si'' and ``so'' columns of
``vmstat 5'' as you run various processes.
The presence of swapped memory is not
necessarily bad. In fact, idle process
memory (like kde daemons) SHOULD be swapped
out to disk, even when large amounts of
memory are free. You just don't want to see
them swap very often. Usually, the swapping
does not occur until another process requires
the space. Then the idle process may stay
swapped out indefinitely.
** Memory leaks **
To find memory leaks and errors in C code
(using malloc and free), try ``valgrind''
from http://developer.kde.org/~sewardj/
Try ``mcheck.h'' with ``mtrace'' that comes
with glibc.
Try linking ``ccmalloc'' from
http://cs.ecs.baylor.edu/~donahoo/tools/ccmalloc/
or Electric Fence from
http://perens.com/FreeSoftware/
**Shared memory, message queues, semaphores**
``ipcs'' will show the status of shared
memory, message queues, and semaphores.
** Network activity **
Check for network collisions and dropped
packets with ``netstat -i'', ``netstat -s'',
and ``ifconfig'' Your network may be
saturated.
``usernet'' will help you watch for surges in
network activity.
Look for permanently lost packets on the disk
server with
=>
$ head -2 /proc/net/snmp | cut -d' ' -f17
ReasmFails
2
<=
If you can see this number increasing during
network activity, then you are losing
packets.
You can reduce the number of lost packets by
increasing the buffer size for fragmented
packets to double the default:
=>
$ echo 524288 > /proc/sys/net/ipv4/ipfrag_low_thresh
$ echo 524288 > /proc/sys/net/ipv4/ipfrag_high_thresh
<=
** Sockets **
See what programs are using a given socket
with ``netstat -pta'' or ``lsof -i'' See how
many sockets are in each state with
=>
$ netstat -tan | grep "^tcp" | cut -c 68- | sort | uniq -c | sort -n
<=
See how many sockets are active with ``cat
/proc/net/sockstat''.
See what process ids are using a specific TCP
socket with
=>
$ fuser -n tcp 5006 | sed -e 's/.*: *//'
or
$ lsof -i tcp:5006
<=
** NFS activity **
Look for high nfs failure rates with
``nfsstat -o rpc''. If more than 3% of calls
are restransmitted, then there are problems
with the network or NFS server. If packets
are getting lost on the network then it may
help to lower ``rsize'' and ``wsize'' mount
parameters (read and write block sizes) in
``/etc/fstab''. If the server is responding
too slowly, then either replace the server or
increase the ``timeo'' mount parameter. See
my separate
section on NFS .
** Handy commands **
Type ``kill -l'' to get a list of signals and
their numbers.
Bill Harlan, 2002-2005