# NFS for Clusters Here are notes I've found useful for configuring reliable shared disk on a linux cluster. See more sections on Gnu/Linux . I've extracted most of this information from "Linux NFS and Automounter Administration" by Erez Zadok, published from Sybex. ** Diagnosing problems ** Increasingly I see a single large RAID disk server being clobbered by 16 or 32 linux at a time. Here are some parameters to check. * Check your version * First make sure you have an up-to-date copy of NFS installed with => $ rpm -q nfs-utils or $ rpm -q -f /usr/sbin/rpc.nfsd <= Check dependencies (like ``portmap'') with ``rpm -q -R nfs-utils'' and check their versions as well. See what files are affected by ``rpm -q -l nfs-utils'' See that your services are running with ``rpcinfo -p [hostname]''. On a client machine look for ``portmapper'', ``nlockmgr'' and possibly ``amd'' or ``autofs''. A server will also run ``mountd'' and ``nfs''. * Saturated network ? * First exercise your disk with your own code or with a simple write operation like => $ time dd if=/dev/zero of=testfile bs=4k count=8182 8182+0 records in 8182+0 records out real 0m8.829s user 0m0.000s sys 0m0.160s <= Writing files should be enough to test network saturation. When profiling reads instead of writes, call ``umount'' and ``mount'' to flush caches, or the read will seem instantaneous. => $ cd / $ umount /mnt/test $ mount /mnt/test $ cd /mnt/test $ dd if=testfile of=/dev/null bs=4k count=8192 <= Check for failures on a client machine with => $ nfsstat -c or $ nfsstat -o rpc <= If more than 3% of calls are retransmitted, then there are problems with the network or NFS server. Look for NFS failures on a shared disk server with => $ nfsstat -s or $ nfsstat -o rpc <= It is not unreasonable to expect 0 badcalls. You should have very few "badcalls" out of the total number of "calls." * Lost packets * NFS must resend packets that are lost by a busy host. Look for permanently lost packets on the disk server with => $ head -2 /proc/net/snmp | cut -d' ' -f17 ReasmFails 2 <= If you can see this number increasing during nfs activity, then you are losing packets. You can reduce the number of lost packets on the server by increasing the buffer size for fragmented packets. => $ echo 524288 > /proc/sys/net/ipv4/ipfrag_low_thresh $ echo 524288 > /proc/sys/net/ipv4/ipfrag_high_thresh <= This is about double the default. * Server threads * See if your server is receiving too many overlapping requests with => $ grep th /proc/net/rpc/nfsd th 8 594 3733.140 83.850 96.660 0.000 73.510 30.560 16.330 2.380 0.000 2.150 <= The first number is the number of threads available for servicing requests, and the the second number is the number of times that all threads have been needed. The remaining 10 numbers are a histogram showing how many seconds a certain fraction of the threads have been busy, starting with less than 10% of the threads and ending with more than 90% of the threads. If the last few numbers have accumulated a significant amount of time, then your server probably needs more threads. Increase the number of threads used by the server to 16 by changing ``RPCNFSDCOUNT=16'' in ``/etc/rc.d/init.d/nfs'' * Invisible or stale files * If separate clients are sharing information through NFS disks, then you have special problems. You may delete a file on one client node and cause a different client to get a stale file handle. Different clients may have cached inconsistent versions of the same file. A single client may even create a file or directory and be unable to see it immediately. If these problems sound familiar, then you may want to adjust NFS caching parameters and code multiple attempts in your applications. ** Mount properties ** * Changing server properties * The server side of NFS allows no real configuration for performance or reliability. NFS servers are stateless so you don't have to worry much about cached state, except for delayed asynchronous writes. Default asynchronous writes are not very risky unless you expect your disk servers to crash often. ``sync'' guarantees that all writes are completed when the client thinks they are. All client machines should still see consistent states with ``async'' because all access the same server. Client caching is a much greater risk. I recommend the default ``async'' on the server side. If you change server export properties in ``/etc/exports'' re-export with ``exportfs -rav''. * Changing client mount properties * You can see what parameters you are using with ``cat /proc/mounts''. Edit ``/etc/fstab'' to change properties. (Hard mounts are simpler for a data processing platform, so I have little to say about auto-mounts. Do not run both ``amd'' and ``autofs''. Check with ``chkconfig --list''. You may find it useful to add ``dismount_interval=1800'' in the global section of ``/etc/amd.conf'' for a long 30 minute wait to keep automounted directories around.) When you change mount attributes, remount with ``mount -a'' Here are client properties that you may want to change from their default values. * rw * Usually you want the flag ``rw'' to allow read-write access, and it is off by default. * intr * Allow users to interrupt hung processes with this flag (off by default). This might sound risky, but in fact this property is consistent with the original nfs design and is well supported. Unnecessary hangs will be more destabilizing. * lock * If your code needs file locking, then by all means turn this on. But if you are certain that locking is not required (as in my current project), then turn it off. I could create unnecessary opportunities for timeouts. * hard * Avoid the complexity of amd if you can for simple clusters. Use ``hard'' * nfsvers=3 * This appears as ``v2'' or ``v3'' in /proc/mounts''. The NFS version supposedly defaults to version 2, but version 3 is faster and supports big files. I get ``v3'' by default much of the time. * tcp or udp? * Almost everyone runs NFS under ``udp'' for performance. But udp is an unreliable protocol and can perform worse than ``tcp'' on a saturated host or network. If nfs errors occur too often, then you may want to try ``tcp'' instead of ``udp''. * wsize and rsize * If packets are getting lost on the network then it may help to lower rsize and wsize mount parameters (read and write block sizes) in /etc/fstab. For reliability, prefer smaller rsize and wsize values in ``/etc/fstab''. I recommend ``rsize=1024,wsize=1024 '' instead of the defaults of 4096. * timeo and retrans * If the server is responding too slowly, then either replace the server or increase the timeo or retrans parameters. For more reliability when the machine stays overloaded, set ``retrans=10'' to retry sending RPC commands 10 times instead of the default 3 times. The default timeout between retries is ``timeo=7'' (seven tenths of a second). Increase to `` timeo=20'' (two full seconds) to avoid hammering an already overloaded server. -- * acregmin, acregmax, acdirmin, acdirmax, noac, cto * _ acregmax and acdirmax are the maximum number of seconds to cache attributes for files and directories respectively. Both default to 60 seconds. 0 disables caching and ``noac'' disables all caching. ``cto'' (on by default), guarantees that files will be rechecked after closing and reopening. Minimum numbers of seconds are set with acregmin and acdirmin. acdirmin defaults to 30 seconds and acregmin to 3 seconds. I recommend setting ``acdirmin=0,acdirmax=0'' to disable caching of directory information and reduce ``acregmax=10'' because we have had so many problems with directories and files not appearing to exist shortly after created. * noatime or atime * Performance should improve by adding the noatime flag. Everytime a client reads from a file, the server must update the server's inode time stamp for most recently accessed time. Most applications don't care about the most recent access time, so you can set the ``noatime'' with impunity. Nevertheless, this flag is rarely set on a general purpose machine, and if you are more concerned about reliability, then use the default ``atime''. ** Synchronize your clocks ** It is surprising how often cluster nodes are allowed to run with totally inconsistent clocks. Caching should not be affected, but file properties will be a mess. If you are on a network with a time server, add hostnames of timeservers to ``/etc/ntp/step-tickers'' or ``/etc/ntp.conf'', and start the service with ``chkconfig ntpd on''. Bill Harlan, 2002