# Debugging Gnu/Linux Here are notes on system level commands I find useful for diagnosing problems with a process. See more sections on Gnu/Linux , including NFS and routing problems. ``gdb'' is well-documented, so I won't bother. Try ``ddd'' as a friendly wrapper for ``gdb.'' ** Books ** Check out "Advanced Linux Programming" http://www.advancedlinuxprogramming.com/ from http://www.codesourcery.com/ . Find some great suggestions at http://people.redhat.com/alikins/system_tuning.html ** Aliases for ``ps'' ** Many prefer interactive tools like ``top'' for watching running programs, but the command-line tool ``ps'' is actually more flexible and less intrusive, if you construct some adequate aliases. Here are my three favorites. The first sorts by memory, and the second by CPU usage. => function psc { ps --cols=1000 --sort='-%cpu,uid,pgid,ppid,pid' -e \ -o user,pid,ppid,pgid,stime,stat,wchan,time,pcpu,pmem,vsz,rss,sz,args | sed 's/^/ /' | less } <= => function psm { ps --cols=1000 --sort='-vsz,uid,pgid,ppid,pid' -e \ -o user,pid,ppid,pgid,stime,stat,wchan,time,pcpu,pmem,vsz,rss,sz,args | sed 's/^/ /' | less } <= Programs swapped to disk are shown in brackets without arguments. The STAT column shows the process status: => D uninterruptible sleep (usually IO) R runnable (on run queue) S sleeping T traced or stopped Z a defunct ("zombie") process W has no resident pages < high-priority process N low-priority task L has pages locked into memory (for real-time and custom IO) <= The WCHAN column shows the resource the system is waiting for -- mapped to ascii according to the file ``/boot/System.map-`uname -r`'' ** CPU activity ** To see a history of CPU usage install the ``sysstat'' package, which contains ``sar''. The ``sar -A'' output distinguishes user CPU from system CPU, which is consumed for operations like I/O, swapping, and error handling. You should investigate any unusually high system CPU usage. ``strace'' will confirm whether a specific program is making too many system requests. To see what CPU's you have type ``cat /proc/cpuinfo'' You may want to install and display ``xosview'' for a quick glance at system activity. If you use the gnome desktop, try the ``System Monitor'' applet that comes preinstalled, or upgrade to ``gkrellm''. See the alias ``psc'' in the previous section. ** The /proc filesystem ** To see the status of a running process look at ``/proc/PID'' where PID is the process ID of the thread of interest. ``/proc/self'' will point to the current process. See memory and CPU usage, file descriptors, environment, working directory, mapped shared objects, the command line, and a symbolic link to the executable. The command ``procinfo -a'' is handy for certain information from the ``/proc'' filesystem. See http://www.redhat.com/docs/manuals/linux/RHL-8.0-Manual/ref-guide/ch-proc.html for more info. ** Getting Process ID's ** Each thread is managed by the kernel as a separate process with shared memory. You often need the process ID's of all threads. If you are running only one instance of a particular executable, then you can get the PIDs with ``pidof program''. Otherwise you may want to identify some string that appears as a command-line argument only for your particular process. Write a script like the following to return a list of the PIDs. => if [ $# -ne 1 ] ; then echo 'Usage: greppids "unique string"' ; fi UNIQUE_STRING="$1" MYUSERNAME=`whoami` ps --cols=1000 -e -o pid,user,args | grep " $MYUSERNAME " | sed 's/$/ /' | grep "$UNIQUE_STRING" | grep -v grep | awk '{print $1}' | sort -n | paste -s -d" " - <= Use ``kill -0'' to see if a process ID or process group ID is active. => $ if kill -0 $PID ; then echo "Yes, process $PID is running" ; done $ if kill -0 -$PGID ; then echo "Yes, process group ID $PGID is running" ; done <= ** Resource limits ** See your resource limits with ``ulimit -a'' and ``sysctl -a'', and change with other options. Within bash, type ``help ulimit''. Set user limits permanently inside ``/etc/security/limits.conf'' ** Shared library dependencies ** ``ldd -v filename'' will print shared libraries required by a program or another shared library, and will show where these libraries are found on your file system for the current ``LD_LIBRARY_PATH.'' Your system will also search directories listed in ``/etc/ld.so.conf'', which is reread after typing ``ldconfig''. Print out symbols with ``nm -C -u -g filename''. ** System activity ** The best overview of recent system activity (usually 12 hours) is from ``sar -A''. Use ``strace'' to monitor all communication between a thread and the kernel. Use ``strace'' on any process that is running unusually slowly or is causing high system usage. Hung processes can often be diagnosed this way. Attach an ``strace'' to each thread of an executable ``foo.exe'' with a script like the following: => P=`pidof foo.exe` for n in $P ; do echo strace -p$n ( strace -v -p$n 2>&1 | sed -e "s/^/$n\| /" )& done sleep 30 killall strace <= ``addr2line'' is a handy utility for converting program addresses into file names and line numbers. ** System failures ** Look for possible system failures with ``tac /var/log/messages | less'' or ``dmesg''. ** Files and I/O activity ** Use ``fuser'' or ``lsof'' to find out what processes are using a file. Use ``lsof'' to find out what files are being used by a process, as => $ lsof -p PID $ lsof -c program <= See which process is using a tcp port (say 8080) with => $ fuser -n tcp 8080 or $ fuser 8080/tcp <= To investigate I/O activity, type ``iostat'' and ``iostat -x'' to see which devices are being used. ``iostat'' is also from the ``sysstat'' package. Check all the inode times and id's with ``stat filename''. Check your user's id's with ``id -a [username]''. Check local disk speed with ``hdparm -Tt /dev/hda'' where the device is the one you see mounted with ``df''. You may want to change some default options (``hdparm /dev/hda'') for improved performance. For IDE disk, look at changes like ``hdparm -c3 -m16 /dev/hda''. Test riskier changes in single user mode in case you hang. ** Memory ** Try the following commands to see whether you are swapping memory or not. Look at the man pages first. => $ cat /proc/meminfo $ vmstat 5 $ free -s 5 $ procinfo -n5 -f $ top <= Watch the ``si'' and ``so'' columns of ``vmstat 5'' as you run various processes. The presence of swapped memory is not necessarily bad. In fact, idle process memory (like kde daemons) SHOULD be swapped out to disk, even when large amounts of memory are free. You just don't want to see them swap very often. Usually, the swapping does not occur until another process requires the space. Then the idle process may stay swapped out indefinitely. ** Memory leaks ** To find memory leaks and errors in C code (using malloc and free), try ``valgrind'' from http://developer.kde.org/~sewardj/ Try ``mcheck.h'' with ``mtrace'' that comes with glibc. Try linking ``ccmalloc'' from http://cs.ecs.baylor.edu/~donahoo/tools/ccmalloc/ or Electric Fence from http://perens.com/FreeSoftware/ **Shared memory, message queues, semaphores** ``ipcs'' will show the status of shared memory, message queues, and semaphores. ** Network activity ** Check for network collisions and dropped packets with ``netstat -i'', ``netstat -s'', and ``ifconfig'' Your network may be saturated. ``usernet'' will help you watch for surges in network activity. Look for permanently lost packets on the disk server with => $ head -2 /proc/net/snmp | cut -d' ' -f17 ReasmFails 2 <= If you can see this number increasing during network activity, then you are losing packets. You can reduce the number of lost packets by increasing the buffer size for fragmented packets to double the default: => $ echo 524288 > /proc/sys/net/ipv4/ipfrag_low_thresh $ echo 524288 > /proc/sys/net/ipv4/ipfrag_high_thresh <= ** Sockets ** See what programs are using a given socket with ``netstat -pta'' or ``lsof -i'' See how many sockets are in each state with => $ netstat -tan | grep "^tcp" | cut -c 68- | sort | uniq -c | sort -n <= See how many sockets are active with ``cat /proc/net/sockstat''. See what process ids are using a specific TCP socket with => $ fuser -n tcp 5006 | sed -e 's/.*: *//' or $ lsof -i tcp:5006 <= ** NFS activity ** Look for high nfs failure rates with ``nfsstat -o rpc''. If more than 3% of calls are restransmitted, then there are problems with the network or NFS server. If packets are getting lost on the network then it may help to lower ``rsize'' and ``wsize'' mount parameters (read and write block sizes) in ``/etc/fstab''. If the server is responding too slowly, then either replace the server or increase the ``timeo'' mount parameter. See my separate section on NFS . ** Handy commands ** Type ``kill -l'' to get a list of signals and their numbers. Bill Harlan, 2002-2005