NFS for clusters

Here are notes I've found useful for configuring reliable shared disk on a linux cluster.

See more sections on Gnu/Linux .

I've extracted most of this information from "Linux NFS and Automounter Administration" by Erez Zadok, published from Sybex.


Diagnosing problems

Increasingly I see a single large RAID disk server being clobbered by 16 or 32 linux at a time. Here are some parameters to check.

§    Check your version

First make sure you have an up-to-date copy of NFS installed with
  $ rpm -q nfs-utils 
or 
  $ rpm -q -f /usr/sbin/rpc.nfsd

Check dependencies (like portmap) with rpm -q -R nfs-utils and check their versions as well. See what files are affected by rpm -q -l nfs-utils

See that your services are running with rpcinfo -p [hostname]. On a client machine look for portmapper, nlockmgr and possibly amd or autofs. A server will also run mountd and nfs.

§    Saturated network ?

First exercise your disk with your own code or with a simple write operation like
  $ time dd if=/dev/zero of=testfile bs=4k count=8182
  8182+0 records in
  8182+0 records out
  real    0m8.829s
  user    0m0.000s
  sys     0m0.160s

Writing files should be enough to test network saturation.

When profiling reads instead of writes, call umount and mount to flush caches, or the read will seem instantaneous.
  $ cd /
  $ umount /mnt/test
  $ mount /mnt/test
  $ cd /mnt/test
  $ dd if=testfile of=/dev/null bs=4k count=8192

Check for failures on a client machine with
  $ nfsstat -c
or
  $ nfsstat -o rpc

If more than 3% of calls are retransmitted, then there are problems with the network or NFS server.

Look for NFS failures on a shared disk server with
  $ nfsstat -s
or
  $ nfsstat -o rpc

It is not unreasonable to expect 0 badcalls. You should have very few "badcalls" out of the total number of "calls."

§    Lost packets

NFS must resend packets that are lost by a busy host. Look for permanently lost packets on the disk server with
  $ head -2 /proc/net/snmp | cut -d' ' -f17
  ReasmFails
  2

If you can see this number increasing during nfs activity, then you are losing packets.

You can reduce the number of lost packets on the server by increasing the buffer size for fragmented packets.
  $ echo 524288 > /proc/sys/net/ipv4/ipfrag_low_thresh
  $ echo 524288 > /proc/sys/net/ipv4/ipfrag_high_thresh 

This is about double the default.

§    Server threads

See if your server is receiving too many overlapping requests with
  $ grep th /proc/net/rpc/nfsd
  th 8 594 3733.140 83.850 96.660 0.000 73.510 30.560 16.330 2.380 0.000 2.150

The first number is the number of threads available for servicing requests, and the the second number is the number of times that all threads have been needed. The remaining 10 numbers are a histogram showing how many seconds a certain fraction of the threads have been busy, starting with less than 10% of the threads and ending with more than 90% of the threads. If the last few numbers have accumulated a significant amount of time, then your server probably needs more threads.

Increase the number of threads used by the server to 16 by changing RPCNFSDCOUNT=16 in /etc/rc.d/init.d/nfs

§    Invisible or stale files

If separate clients are sharing information through NFS disks, then you have special problems. You may delete a file on one client node and cause a different client to get a stale file handle. Different clients may have cached inconsistent versions of the same file. A single client may even create a file or directory and be unable to see it immediately. If these problems sound familiar, then you may want to adjust NFS caching parameters and code multiple attempts in your applications.


Mount properties

§    Changing server properties

The server side of NFS allows no real configuration for performance or reliability. NFS servers are stateless so you don't have to worry much about cached state, except for delayed asynchronous writes.

Default asynchronous writes are not very risky unless you expect your disk servers to crash often. sync guarantees that all writes are completed when the client thinks they are. All client machines should still see consistent states with async because all access the same server. Client caching is a much greater risk. I recommend the default async on the server side.

If you change server export properties in /etc/exports re-export with exportfs -rav.

§    Changing client mount properties

You can see what parameters you are using with cat /proc/mounts.

Edit /etc/fstab to change properties.

(Hard mounts are simpler for a data processing platform, so I have little to say about auto-mounts. Do not run both amd and autofs. Check with chkconfig --list. You may find it useful to add dismount_interval=1800 in the global section of /etc/amd.conf for a long 30 minute wait to keep automounted directories around.)

When you change mount attributes, remount with mount -a

Here are client properties that you may want to change from their default values.

§    rw

Usually you want the flag rw to allow read-write access, and it is off by default.

§    intr

Allow users to interrupt hung processes with this flag (off by default). This might sound risky, but in fact this property is consistent with the original nfs design and is well supported. Unnecessary hangs will be more destabilizing.

§    lock

If your code needs file locking, then by all means turn this on. But if you are certain that locking is not required (as in my current project), then turn it off. I could create unnecessary opportunities for timeouts.

§    hard

Avoid the complexity of amd if you can for simple clusters. Use hard

§    nfsvers=3

This appears as v2 or v3 in /proc/mounts. The NFS version supposedly defaults to version 2, but version 3 is faster and supports big files. I get v3 by default much of the time.

§    tcp or udp?

Almost everyone runs NFS under udp for performance. But udp is an unreliable protocol and can perform worse than tcp on a saturated host or network. If nfs errors occur too often, then you may want to try tcp instead of udp.

§    wsize and rsize

If packets are getting lost on the network then it may help to lower rsize and wsize mount parameters (read and write block sizes) in /etc/fstab.

For reliability, prefer smaller rsize and wsize values in /etc/fstab. I recommend rsize=1024,wsize=1024 instead of the defaults of 4096.

§    timeo and retrans

If the server is responding too slowly, then either replace the server or increase the timeo or retrans parameters.

For more reliability when the machine stays overloaded, set retrans=10 to retry sending RPC commands 10 times instead of the default 3 times.

The default timeout between retries is timeo=7 (seven tenths of a second). Increase to timeo=20 (two full seconds) to avoid hammering an already overloaded server.

§    acregmin, acregmax, acdirmin, acdirmax, noac, cto

acregmax and acdirmax are the maximum number of seconds to cache attributes for files and directories respectively. Both default to 60 seconds. 0 disables caching and noac disables all caching. cto (on by default), guarantees that files will be rechecked after closing and reopening.

Minimum numbers of seconds are set with acregmin and acdirmin. acdirmin defaults to 30 seconds and acregmin to 3 seconds.

I recommend setting acdirmin=0,acdirmax=0 to disable caching of directory information and reduce acregmax=10 because we have had so many problems with directories and files not appearing to exist shortly after created.

§    noatime or atime

Performance should improve by adding the noatime flag. Everytime a client reads from a file, the server must update the server's inode time stamp for most recently accessed time. Most applications don't care about the most recent access time, so you can set the noatime with impunity.

Nevertheless, this flag is rarely set on a general purpose machine, and if you are more concerned about reliability, then use the default atime.


Synchronize your clocks

It is surprising how often cluster nodes are allowed to run with totally inconsistent clocks. Caching should not be affected, but file properties will be a mess.

If you are on a network with a time server, add hostnames of timeservers to /etc/ntp/step-tickers or /etc/ntp.conf, and start the service with chkconfig ntpd on.

Bill Harlan, 2002


Return to parent directory.