Here are notes I've found useful for configuring reliable shared disk on a linux cluster.
See more sections on Gnu/Linux .
I've extracted most of this information from "Linux NFS and Automounter Administration" by Erez Zadok, published from Sybex.
Increasingly I see a single large RAID disk server being clobbered by 16 or 32 linux at a time. Here are some parameters to check.
First make sure you have an up-to-date copy of NFS installed with
$ rpm -q nfs-utils or $ rpm -q -f /usr/sbin/rpc.nfsd
Check dependencies (like
rpm -q -R nfs-utils and check their
versions as well. See what files are
rpm -q -l nfs-utils
See that your services are running with
rpcinfo -p [hostname]. On a client
machine look for
autofs. A server
will also run
First exercise your disk with your own code or with a simple write operation like
$ time dd if=/dev/zero of=testfile bs=4k count=8182 8182+0 records in 8182+0 records out real 0m8.829s user 0m0.000s sys 0m0.160s
Writing files should be enough to test network saturation.
When profiling reads instead of writes, call
mount to flush caches, or
the read will seem instantaneous.
$ cd / $ umount /mnt/test $ mount /mnt/test $ cd /mnt/test $ dd if=testfile of=/dev/null bs=4k count=8192
Check for failures on a client machine with
$ nfsstat -c or $ nfsstat -o rpc
If more than 3% of calls are retransmitted, then there are problems with the network or NFS server.
Look for NFS failures on a shared disk server with
$ nfsstat -s or $ nfsstat -o rpc
It is not unreasonable to expect 0 badcalls. You should have very few "badcalls" out of the total number of "calls."
NFS must resend packets that are lost by a busy host. Look for permanently lost packets on the disk server with
$ head -2 /proc/net/snmp | cut -d' ' -f17 ReasmFails 2
If you can see this number increasing during nfs activity, then you are losing packets.
You can reduce the number of lost packets on the server by increasing the buffer size for fragmented packets.
$ echo 524288 > /proc/sys/net/ipv4/ipfrag_low_thresh $ echo 524288 > /proc/sys/net/ipv4/ipfrag_high_thresh
This is about double the default.
See if your server is receiving too many overlapping requests with
$ grep th /proc/net/rpc/nfsd th 8 594 3733.140 83.850 96.660 0.000 73.510 30.560 16.330 2.380 0.000 2.150
The first number is the number of threads available for servicing requests, and the the second number is the number of times that all threads have been needed. The remaining 10 numbers are a histogram showing how many seconds a certain fraction of the threads have been busy, starting with less than 10% of the threads and ending with more than 90% of the threads. If the last few numbers have accumulated a significant amount of time, then your server probably needs more threads.
Increase the number of threads used by the
server to 16 by changing
If separate clients are sharing information through NFS disks, then you have special problems. You may delete a file on one client node and cause a different client to get a stale file handle. Different clients may have cached inconsistent versions of the same file. A single client may even create a file or directory and be unable to see it immediately. If these problems sound familiar, then you may want to adjust NFS caching parameters and code multiple attempts in your applications.
The server side of NFS allows no real configuration for performance or reliability. NFS servers are stateless so you don't have to worry much about cached state, except for delayed asynchronous writes.
Default asynchronous writes are not very
risky unless you expect your disk servers to
sync guarantees that all
writes are completed when the client thinks
they are. All client machines should still
see consistent states with
all access the same server. Client caching
is a much greater risk. I recommend the
async on the server side.
If you change server export properties in
/etc/exports re-export with
You can see what parameters you are using
/etc/fstab to change properties.
(Hard mounts are simpler for a data
processing platform, so I have little to say
about auto-mounts. Do not run both
autofs. Check with
--list. You may find it useful to add
dismount_interval=1800 in the global
/etc/amd.conf for a long 30
minute wait to keep automounted directories
When you change mount attributes, remount
Here are client properties that you may want to change from their default values.
Usually you want the flag
rw to allow
read-write access, and it is off by default.
Allow users to interrupt hung processes with this flag (off by default). This might sound risky, but in fact this property is consistent with the original nfs design and is well supported. Unnecessary hangs will be more destabilizing.
If your code needs file locking, then by all means turn this on. But if you are certain that locking is not required (as in my current project), then turn it off. I could create unnecessary opportunities for timeouts.
Avoid the complexity of amd if you can for
simple clusters. Use
This appears as
/proc/mounts. The NFS version supposedly
defaults to version 2, but version 3 is
faster and supports big files. I get
by default much of the time.
Almost everyone runs NFS under
performance. But udp is an unreliable
protocol and can perform worse than
on a saturated host or network. If nfs
errors occur too often, then you may want to
tcp instead of
If packets are getting lost on the network then it may help to lower rsize and wsize mount parameters (read and write block sizes) in /etc/fstab.
For reliability, prefer smaller rsize and
wsize values in
/etc/fstab. I recommend
rsize=1024,wsize=1024 instead of the
defaults of 4096.
If the server is responding too slowly, then either replace the server or increase the timeo or retrans parameters.
For more reliability when the machine stays
retrans=10 to retry
sending RPC commands 10 times instead of the
default 3 times.
The default timeout between retries is
timeo=7 (seven tenths of a second).
timeo=20 (two full seconds)
to avoid hammering an already overloaded
acregmax and acdirmax are the maximum number
of seconds to cache attributes for files and
directories respectively. Both default to 60
seconds. 0 disables caching and
disables all caching.
cto (on by
default), guarantees that files will be
rechecked after closing and reopening.
Minimum numbers of seconds are set with acregmin and acdirmin. acdirmin defaults to 30 seconds and acregmin to 3 seconds.
I recommend setting
to disable caching of directory information
acregmax=10 because we have
had so many problems with directories and
files not appearing to exist shortly after
Performance should improve by adding the
noatime flag. Everytime a client reads from
a file, the server must update the server's
inode time stamp for most recently accessed
time. Most applications don't care about the
most recent access time, so you can set the
noatime with impunity.
Nevertheless, this flag is rarely set on a
general purpose machine, and if you are more
concerned about reliability, then use the
It is surprising how often cluster nodes are allowed to run with totally inconsistent clocks. Caching should not be affected, but file properties will be a mess.
If you are on a network with a time server,
add hostnames of timeservers to
/etc/ntp.conf, and start the service with
chkconfig ntpd on.
Bill Harlan, 2002
Return to parent directory.