Open O_Creat | O_Excl on Nfs in Linux

open O_CREAT | O_EXCL on NFS in Linux?

Apparently, the NFS guys claim that anything from NFSv3 and Linux 2.6.5 on is OK.

From http://nfs.sourceforge.net/#faq_d10:

  • D10. I'm trying to use flock()/BSD locks to lock files used on multiple clients, but the files become corrupted. How come?

    • A. flock()/BSD locks act only locally on Linux NFS clients prior to 2.6.12. Use fcntl()/POSIX locks to ensure that file locks are visible to other clients.
    • Here are some ways to serialize access to an NFS file.

      • Use the fcntl()/POSIX locking API. This type of locking provides byte-range locking across multiple clients via the NLM protocol, or via NFSv4.
      • Use a separate lockfile, and create hard links to it. See the description in the O_EXCL section of the creat(2) man page.
    • It's worth noting that until early 2.6 kernels, O_EXCL creates were not atomic on Linux NFS clients. Don't use O_EXCL creates and expect atomic behavior among multiple NFS client unless you are running a kernel newer than 2.6.5.
    • ...

On which systems/filesystems is os.open() atomic?

For UN*X-compliant (certified POSIX / IEEE 1003.1 as per the OpenGroup) systems, the behaviour is guaranteed as the OpenGroups specs for open(2) mandate this. Quote:

O_EXCL

If O_CREAT and O_EXCL are set, open() shall fail if the file exists. The check for the existence of the file and the creation of the file if it does not exist shall be atomic with respect to other threads executing open() naming the same filename in the same directory with O_EXCL and O_CREAT set. If O_EXCL and O_CREAT are set, and path names a symbolic link, open() shall fail and set errno to [EEXIST], regardless of the contents of the symbolic link. If O_EXCL is set and O_CREAT is not set, the result is undefined.

The "common" UN*X and UN*X-like systems (Linux, MacOSX, *BSD, Solaris, AIX, HP/UX) surely behave like that.

Since the Windows API doesn't have open() as such, the library function there is necessarily reimplemented in terms of the native API but it's possible to maintain the semantics.

I don't know which widely-used systems wouldn't comply; QNX, while not POSIX-certified, has the same statement in its docs for open(). The *BSD manpages do not explicitly mention the "atomicity" but Free/Net/OpenBSD implement it. Even exotics like SymbianOS (which like Windows doesn't have a UN*X-ish open system call) can do the atomic open/create.

For more interesting results, try to find an operating system / C runtime library which has open() but doesn't implement the above semantics for it... and on which Python would run with threads (got you there, MSDOS ...).

Edit: My post particularly focuses on "which operating systems have this characteristic for open ?" - for which the answer is, "pretty much all of them". Wrt. to filesystems though, the picture is different because network filesystems - whether NFS, SMB/CIFS or others, do not always maintain O_EXCL as this could result in denial-of-service (if a client does an open(..., O_EXCL, ...) and then simply stops talking with the fileserver / is shut down, everyone else would be locked out).

Is it required to use O_TRUNC and O_APPEND together?


O_APPEND flag is used to append data to the end of the file.

That's true, but incomplete enough to be potentially misleading. And I suspect that you are in fact confused in that regard.

The kernel records a file offset, sometimes also called the read-write offset or pointer. This is the location in the file at which the next read() or write() will commence.

That's also incomplete. There is a file offset associated with at least each seekable file. That is the position where the next read() will commence. It is where the next write() will commence if the file is not open in append mode, but in append mode every write happens at the end of the file, as if it were repositioned with lseek(fd, 0, SEEK_END) before each one. In that case, then, the current file offset might not be the position where the next write() will commence.

I am confused that if the file is truncated and the the kernel does the subsequent writing at the end of the file why the append flag is needed to explicitly tell to append at the end of the file ?

It is not needed to cause the first write (by any process) after truncation to occur at the end of the file because immediately after the file has been truncated there isn't any other position.

With out the append flag (if the file is truncated), the kernel writes at the end of the file for the subsequent write() function call.

It is not needed for subsequent writes either, as long as the file is not repositioned or externally modified. Otherwise, the location of the next write depends on whether the file is open in append mode or not.

In practice, it is not necessarily the case that every combination of flags is useful, but the combination of O_TRUNC and O_APPEND has observably different effect than does either flag without the other, and the combination is useful in certain situations.

C system calls open / read / write / close and O_CREAT|O_EXCL

O_EXCL forces the file to be created. If the file already exists, the call fails.

It is used to ensure that the file has to be created, with the given permissions passed in the third parameter. In short, you have these options:

  • O_CREAT: Create the file with the given permissions if the file doesn't already exist. If the file exists, it is opened and permissions are ignored.
  • O_CREAT | O_EXCL: Create the file with the given permissions if the file doesn't already exist. If the file exists, it fails. This is useful in order to create lockfiles and guarantee exclusive access to the file (as long as all programs which use that file follow the same protocol).
  • O_CREAT | O_TRUNC: Create the file with the given permissions if the file doesn't already exist. Otherwise, truncate the file to zero bytes. This has more of the effect we expect when we think "create a new blank file". Still, it keeps the permissions already present in the existing file.

More information from the manual page:

O_EXCL

When used with O_CREAT, if the file
already exists it is an error and
the open() will fail. In this context,
a symbolic link exists, regardless of
where it points to. O_EXCL is broken
on NFS file systems; programs which
rely on it for performing locking
tasks will contain a race condition.
The solution for performing atomic
file locking using a lockfile is to
create a unique file on the same file
system (e.g., incorporating hostname
and pid), use link(2) to make a link
to the lockfile. If link() returns 0,
the lock is successful. Otherwise, use
stat(2) on the unique file to check if
its link count has increased to 2, in
which case the lock is also
successful.

Implementing a portable file locking mechanism

The answer to your question is provided at the bottom of the link(2) page of the Linux Programmer's Manual:

   On NFS file systems, the return code may  be  wrong  in  case  the  NFS
server performs the link creation and dies before it can say so. Use
stat(2) to find out if the link got created.

howto force refresh NFS cache when checking newly created file?


This solution belongs to the Category B : exit_code or return code of writing function

...only open() and fopen() need to guarantee that they get a consistent handle to a particular file for reading and writing. stat and friends are not required to retrieve fresh attributes. Thus, for the sake of close-to-open cache coherence, only open() and fopen() are considered an "open event" where fresh attributes need to be fetched immediately from the server[1].




The following solutions belong to the Category A : NFS clients setting

i.e. if you do NOT expect cached file/dir entries to be served to the client, disable caching.

Setup a shared cache

If the file in the NFS mount (whose existence is being checked) is created by another application on the same client (possibly using another mount point to the same NFS export) the consider using a single shared NFS cache on the client.

Use the sharecache option to setup the NFS mounts on the client.

This option determines how the client's data-cache and attribute-cache are shared when mounting the same export more than once concurrently. Using the same cache reduces memory requirements on the client and presents identical file contents to applications when the same remote file is accessed via different mount points.


Setting-up a NFS-mount without caching

Disable attribute caching.

Mount the NFS share on the client with the noac option.

Alternately, disable cached directory attributes from being served.

Use acdirmin=0,acdirmax=0 to set the cache timeouts to 0 (effectively disabling caching).


Setting-up a NFS-mount to ignore lookup caches

Use lookupcache=positive OR lookupcache=none

(available options : all, positive and none)

When attempting to access a directory entry over a NFS mount,

if the requested directory entry exists on the server, the result is referred to as positive.

if the requested directory entry does not exist on the server, the result is referred to as negative.

If the lookupcache option is not specified, or if all is specified, the client assumes both types of directory cache entries are valid until their parent directory's cached attributes expire.

If pos or positive is specified, the client assumes positive entries are valid until their parent directory's cached attributes expire, but always revalidates negative entires before an application can use them.

If none is specified, the client revalidates both types of directory cache entries before an application can use them. This permits quick detection of files that were created or removed by other clients, but can impact application and server performance.



References:

1. Close-To-Open Cache Consistency in the Linux NFS Client

2. NFS - Detecting remotely created files programmatically?

3. NFS cache : file content not updated on client when modified on server

4. NFS man page. Especially the "Data And Metadata Coherence" section.

How to create a file only if it doesn't exist?

man 2 open:

O_EXCL Ensure that this call creates the file: if this flag is specified in conjunction with O_CREAT, and pathname already exists, then open()
will fail. The behavior of O_EXCL is undefined if O_CREAT is not specified.

so, you could call fd = open(name, O_CREAT | O_EXCL, 0644); /* Open() is atomic. (for a reason) */

UPDATE: and you should of course OR one of the O_RDONLY, O_WRONLY, or O_RDWR flags into the flags argument.



Related Topics



Leave a reply



Submit