Is There a Good Way to Detect a Stale Nfs Mount

Is there a good way to detect a stale NFS mount

You could write a C program and check for ESTALE.

#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <iso646.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>

int main(){
    struct stat st;
    int ret;
    ret = stat("/mnt/some_stale", &st);
    if(ret == -1 and errno == ESTALE){
        printf("/mnt/some_stale is stale\n");
        return EXIT_SUCCESS;
    } else {
        return EXIT_FAILURE;
    }
}

loop to test all NFS mount point

In your loop it should be:

read -r -t1 < <(stat -t "$i" 2>&-)

Currently it's just reading the first array value and $i isn't used.

Linux Shell Script: How to detect NFS Mount-point (or the Server) is dead?

"stat" command is a somewhat cleaner way:

statresult=`stat /my/mountpoint 2>&1 | grep -i "stale"`
if [ "${statresult}" != "" ]; then
  #result not empty: mountpoint is stale; remove it
  umount -f /my/mountpoint
fi

Additionally, you can use rpcinfo to detect whether the remote nfs share is available:

rpcinfo -t remote.system.net nfs > /dev/null 2>&1
if [ $? -eq 0 ]; then
  echo Remote NFS share available.
fi

Added 2013-07-15T14:31:18-05:00:

I looked into this further as I am also working on a script that needs to recognize stale mountpoints. Inspired by one of the replies to "Is there a good way to detect a stale NFS mount", I think the following may be the most reliable way to check for staleness of a specific mountpoint in bash:

read -t1 < <(stat -t "/my/mountpoint")
if [ $? -eq 1 ]; then
   echo NFS mount stale. Removing... 
   umount -f -l /my/mountpoint
fi

"read -t1" construct reliably times out the subshell if stat command hangs for some reason.

Added 2013-07-17T12:03:23-05:00:

Although read -t1 < <(stat -t "/my/mountpoint") works, there doesn't seem to be a way to mute its error output when the mountpoint is stale. Adding > /dev/null 2>&1 either within the subshell, or in the end of the command line breaks it. Using a simple test: if [ -d /path/to/mountpoint ] ; then ... fi also works, and may preferable in scripts. After much testing it is what I ended up using.

Added 2013-07-19T13:51:27-05:00:

A reply to my question "How can I use read timeouts with stat?" provided additional detail about muting the output of stat (or rpcinfo) when the target is not available and the command hangs for a few minutes before it would time out on its own. While [ -d /some/mountpoint ] can be used to detect a stale mountpoint, there is no similar alternative for rpcinfo, and hence use of read -t1 redirection is the best option. The output from the subshell can be muted with 2>&-. Here is an example from CodeMonkey's response:

mountpoint="/my/mountpoint"
read -t1 < <(stat -t "$mountpoint" 2>&-)
if [[ -n "$REPLY" ]]; then
  echo "NFS mount stale. Removing..."
  umount -f -l "$mountpoint"
fi

Perhaps now this question is fully answered :).

Check if NFS Directory Mounted without Large Hangs on Failure

Ok I managed to solve this using the timeout command, I checked back here to see that BroSlow updated his answer with a very similar solution. Thank you BroSlow for your help.

To solve the problem, the code I used is:

if [[ `timeout 5s ls /nfs/machine |wc -l` -gt 0 ]] ; then
      echo "can see machine"
else
      echo "cannot see machine"
fi

I then reduced this to a single line command so that it could be run through ssh and put inside of a loop (to loop through hosts and execute this command).

Stale file handle error, when process trying read the file, that other process already had deleted

This is totally expected. The NFS specification is clear about use of file handles after an object (be it file or directory) has been deleted. Section 4 clearly addresses this. For example:

The persistent filehandle will become stale or invalid when the file system object is removed. When the server is presented with a persistent filehandle that refers to a deleted object, it MUST return an error of NFS4ERR_STALE.

This is such a common problem, it even has its own entry in section A.10 of the NFS FAQ, which says one common cause of ESTALE errors is that:

The file handle refers to a deleted file. After a file is deleted on the server, clients don't find out until they try to access the file with a file handle they had cached from a previous LOOKUP. Using rsync or mv to replace a file while it is in use on another client is a common scenario that results in an ESTALE error.

The expected resolution is that your client app must close and reopen the file to see what has happened. Or, as the FAQ says:

... to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle.

Nagios SNMP Process check hangs on stale nfs mount

Fairly simple - NFS is designed to tolerate server reboots. NFS calls to a mounted file system when it's mounted hard will therefore block and wait for the server to respond. This is to ensure that no data is lost or processes are suspended - they simply 'stall' - which'll be the problem you're having.

There's a mount option to nfs that avoids this problem - simply specify soft when mounting (either in fstab, or -o soft when doing it manually).

Be warned though - you'll get errors when accessing the NFS mount. Most things will tolerate this scenario, but it's always possible that badly written scripts or programs will fall over.

What does 'stale file handle' in Linux mean?

When the directory is deleted, the inode for that directory (and the inodes for its contents) are recycled. The pointer your shell has to that directory's inode (and its contents's inodes) are now no longer valid. When the directory is restored from backup, the old inodes are not (necessarily) reused; the directory and its contents are stored on random inodes. The only thing that stays the same is that the parent directory reuses the same name for the restored directory (because you told it to).

Now if you attempt to access the contents of the directory that your original shell is still pointing to, it communicates that request to the file system as a request for the original inode, which has since been recycled (and may even be in use for something entirely different now). So you get a stale file handle message because you asked for some nonexistent data.

When you perform a cd operation, the shell reevaluates the inode location of whatever destination you give it. Now that your shell knows the new inode for the directory (and the new inodes for its contents), future requests for its contents will be valid.

Is There a Good Way to Detect a Stale Nfs Mount