On Which Systems/Filesystems Is Os.Open() Atomic

On which systems/filesystems is os.open() atomic?

For UN*X-compliant (certified POSIX / IEEE 1003.1 as per the OpenGroup) systems, the behaviour is guaranteed as the OpenGroups specs for open(2) mandate this. Quote:

O_EXCL

If O_CREAT and O_EXCL are set, open() shall fail if the file exists. The check for the existence of the file and the creation of the file if it does not exist shall be atomic with respect to other threads executing open() naming the same filename in the same directory with O_EXCL and O_CREAT set. If O_EXCL and O_CREAT are set, and path names a symbolic link, open() shall fail and set errno to [EEXIST], regardless of the contents of the symbolic link. If O_EXCL is set and O_CREAT is not set, the result is undefined.

The "common" UN*X and UN*X-like systems (Linux, MacOSX, *BSD, Solaris, AIX, HP/UX) surely behave like that.

Since the Windows API doesn't have open() as such, the library function there is necessarily reimplemented in terms of the native API but it's possible to maintain the semantics.

I don't know which widely-used systems wouldn't comply; QNX, while not POSIX-certified, has the same statement in its docs for open(). The *BSD manpages do not explicitly mention the "atomicity" but Free/Net/OpenBSD implement it. Even exotics like SymbianOS (which like Windows doesn't have a UN*X-ish open system call) can do the atomic open/create.

For more interesting results, try to find an operating system / C runtime library which has open() but doesn't implement the above semantics for it... and on which Python would run with threads (got you there, MSDOS ...).

Edit: My post particularly focuses on "which operating systems have this characteristic for open ?" - for which the answer is, "pretty much all of them". Wrt. to filesystems though, the picture is different because network filesystems - whether NFS, SMB/CIFS or others, do not always maintain O_EXCL as this could result in denial-of-service (if a client does an open(..., O_EXCL, ...) and then simply stops talking with the fileserver / is shut down, everyone else would be locked out).

Is stat() atomic with respect to the file system

Yes, a stat call can be thought of as atomic, in that all the information it returns is guaranteed to be consistent. If you call stat at the same instant some other process is writing to the file, there should be no possibility that, say, the other process's write is reflected in st_mtime but not st_size.

And in any case, there's certainly no possibility that calling stat at the same instant some other process is writing to the file could cause that other process to fail. (That would be a serious and quite unacceptable bug in the operating system -- one of an OS'es main jobs is to ensure that unrelated processes can't accidentally interact with each other in such ways. This lack-of-interference property isn't usually what we mean by "atomic", though.)

With that said, though, the usual way to monitor a process is via its process ID. And there are probably plenty of prewritten packages out there to help you manage one or more processes that are supposed to run continuously, giving you clean start/stop and monitoring capabilities. (See s6 as an example. I know nothing about this package and am not recommending it; it's just the first one I came across in a web search.)

Another possibility, if you have any kind of IPC mechanism set up between your processes, is to set up a periodic heartbeat that each one publishes, so that a watchdog timer somewhere can detect a process dying.

If you want to keep monitoring your processes by the timeliness of the files they write, though, that sounds like a perfectly fine technique also.

How to make file creation an atomic operation?

Write data to a temporary file and when data has been successfully written, rename the file to the correct destination file e.g

with open(tmpFile, 'w') as f:
f.write(text)
# make sure that all data is on disk
# see http://stackoverflow.com/questions/7433057/is-rename-without-fsync-safe
f.flush()
os.fsync(f.fileno())
os.replace(tmpFile, myFile) # os.rename pre-3.3, but os.rename won't work on Windows

According to doc http://docs.python.org/library/os.html#os.replace

Rename the file or directory src to dst. If dst is a non-empty directory, OSError will be raised. If dst exists and is a file, it will be replaced silently if the user has permission. The operation may fail if src and dst are on different filesystems. If successful, the renaming will be an atomic operation (this is a POSIX requirement).

Note:

  • It may not be atomic operation if src and dest locations are not on same filesystem

  • os.fsync step may be skipped if performance/responsiveness is more important than the data integrity in cases like power failure, system crash etc

Atomicity of `write(2)` to a local filesystem

man 2 write on my system sums it up nicely:

Note that not all file systems are POSIX conforming.

Here is a quote from a recent discussion on the ext4 mailing list:

Currently concurrent reads/writes are atomic only wrt individual pages,
however are not on the system call. This may cause read() to return data
mixed from several different writes, which I do not think it is good
approach. We might argue that application doing this is broken, but
actually this is something we can easily do on filesystem level without
significant performance issues, so we can be consistent. Also POSIX
mentions this as well and XFS filesystem already has this feature.

This is a clear indication that ext4 -- to name just one modern filesystem -- doesn't conform to POSIX.1-2008 in this respect.

Atomic file write operations (cross platform)

AFAIK no.

And the reason is that for such an atomic operation to be possible, there has to be OS support in the form of a transactional file system. And none of the mainstream operating system offer a transactional file system.

EDIT - I'm wrong for POSIX-compliant systems at least. The POSIX rename syscall performs an atomic replace if a file with the target name already exists ... as pointed out by @janneb. That should be sufficient to do the OP's operation atomically.

However, the fact remains that the Java File.renameTo() method is explicitly not guaranteed to be atomic, so it does not provide a cross-platform solution to the OP's problem.

EDIT 2 - With Java 7 you can use java.nio.file.Files.move(Path source, Path target, CopyOption... options) with copyOptions and ATOMIC_MOVE. If this is not supported (by the OS / file system) you should get an exception.

What filesystem operations are required to be atomic?

I'm not sure fsync(2) is atomic; if a file has 100 megabytes dirty in the buffer cache, it'll take several seconds to write that data out, and the kernel may crash while the transfer to disk is in progress. Perhaps the DMA engine on board can only handle 4-megabyte writes. Perhaps there is no DMA support, and the CPU must schedule every write via 512 byte blocks.

What do you mean by 'atomic'?

mkdir is probably 'atomic', either the directory exists on disk and is linked in to a parent directory, or the directory data structure isn't yet linked into a parent directory, and is therefore unreachable --> doesn't exist.

Same might go for mount(2): it would be hard to find a mount(2) half-way complete, and if it fails, the entire mount fails: either the filesystem is mounted, or it isn't.

umount(2) is funny, it can be done lazily, but once it is unmounted, it cannot be used for open(2) or creat(2) calls.

So, I guess it comes down to, what do you mean by 'atomic'? :)

Is rename() atomic?

Yes and no.

rename() is atomic assuming the OS does not crash. It cannot be split by any other filesystem op.

If the system crashes you might see a ln() operation instead.

Also note, when operating on a network filesystem, you might get ENOENT when the operation succeeded successfully. Local filesystem can't do that to you.

Is java.io.File.createNewFile() atomic in a network file system?

No, createNewFile doesn't work properly on a network file system.

Even if the system call is atomic, it's only atomic regarding the OS, and not over the network.
Over the time, I got a couple of collisions, like once every 2-3 months (approx. once every 600k files).

The thing that happens is my program is running in 6 separates instances over 2 separate servers, so let's call them A1,A2,A3 and B1,B2,B3.

When A1, A2, and A3 try to create the same file, the OS can properly ensure that only one file is created, since it is working with itself.

When A1 and B1 try to create the same file at the same exact moment, there is some form of network cache and/or network delays happening, and they both get a true return from File.createNewFile().

My code then proceeds by renaming the parent folder to stop the other instances of the program from unnecessarily trying to process the folder and that's where it fails :

  • On A1, the folder renaming operation is successful, but the lock file can't be removed, so A1 just lets it like that and keeps on processing new incoming folders.
  • On B1, the folder renaming operation (File.renameTo(), can't do much to fix it) gets stuck in a infinite loop because the folder was already renamed (also causing a huge I/O traffic according to my sysadmin), and B1 is unable to process any new file until the program is rebooted.

Is file append atomic in UNIX?

A write that's under the size of 'PIPE_BUF' is supposed to be atomic. That should be at least 512 bytes, though it could easily be larger (linux seems to have it set to 4096).

This assume that you're talking all fully POSIX-compliant components. For instance, this isn't true on NFS.

But assuming you write to a log file you opened in 'O_APPEND' mode and keep your lines (including newline) under 'PIPE_BUF' bytes long, you should be able to have multiple writers to a log file without any corruption issues. Any interrupts will arrive before or after the write, not in the middle. If you want file integrity to survive a reboot you'll also need to call fsync(2) after every write, but that's terrible for performance.

Clarification: read the comments and Oz Solomon's answer. I'm not sure that O_APPEND is supposed to have that PIPE_BUF size atomicity. It's entirely possible that it's just how Linux implemented write(), or it may be due to the underlying filesystem's block sizes.



Related Topics



Leave a reply



Submit