Are Unix Reads and Writes to a Single File Atomically Serialized

Are Unix reads and writes to a single file atomically serialized?

Separate write() calls are processed separately, not as a single atomic write transaction, and interleaving is entirely possible when multiple processes/threads are writing to the same file. The order of the actual writes is determined by the schedulers (both kernel process scheduler, and for "green" threads the thread library's scheduler).

Unless you specify otherwise (O_DIRECT open flag or similar, if supported), read() and write() operate on kernel buffers and read() will use a loaded buffer in preference to reading the disk again.

Note that this may be complicated by local file buffering; for example, stdio and iostreams will read file data by blocks into a buffer in the process which is independent of kernel buffers, so a write() from elsewhere to data that are already buffered in stdio won't be seen. Likewise, with output buffering there won't be any actual kernel-level output until the output buffer is flushed, either automatically because it has filled up or manually due to fflush() or C++'s endl (which implicitly flushes the output buffer).

`write` serialization in POSIX

To answer your first question:

"Occur" refers to the whole read, from the point of the call to the point of the value being returned. All of it has to happen after the previous write, and before the next write. The same page says so:

After a write() to a regular file has successfully returned:

  • Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.

  • Any subsequent successful write() to the same byte position in the file shall overwrite that file data.

POSIX makes no guarantee whatsoever on any sort of interleaving, because implementing additional guarantees is quite difficult.

Regarding the second question:

Again, refer to the above quote. If a process called write() and write() returned successfully, any subsequent read by any processes would reflect the written data.

So the answer is "yes, if the first write() failed".

Implementation:

ext4, and almost every other filesystem, uses a page cache. The page cache is an in-memory representation of the file's data (or a relevant part thereof). Any synchronization that needs to be done, is done using this representation. In that respect, reading and writing from the file is like reading and writing from shared memory.

The page cache, as the name suggests, is built with pages. In most implementations, a page is a region of 4k of memory, and reads and writes happen on a page basis.

This means that e.g. ext4 will serialize reads & writes on the same 4k region of the file, but a 12k write may not be atomic.

AFAICT, ext4 does not allow concurrent multiple writes on the same page, or concurrent reads & writes on the same page, but it is nowhere guaranteed.

edit: The filesystem (on-disk) block size might be smaller than a page, in which case some I/O may be done at a block-size granularity, but that is even less reliable in terms of atomicity.

Is a write operation in unix atomic?

To call the Posix semantics "atomic" is perhaps an oversimplification. Posix requires that reads and writes occur in some order:

Writes can be serialized with respect to other reads and writes. If a read() of file data can be proven (by any means) to occur after a write() of the data, it must reflect that write(), even if the calls are made by different processes. A similar requirement applies to multiple write operations to the same file position. This is needed to guarantee the propagation of data from write() calls to subsequent read() calls. (from the Rationale section of the Posix specification for pwrite and write)

The atomicity guarantee mentioned in APUE refers to the use of the O_APPEND flag, which forces writes to be performed at the end of the file:

If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.

With respect to pread and pwrite, APUE says (correctly, of course) that these interfaces allow the application to seek and perform I/O atomically; in other words, that the I/O operation will occur at the specified file position regardless of what any other process does. (Because the position is specified in the call itself, and does not affect the persistent file position.)

The Posix sequencing guarantee is as follows (from the Description of the write() and pwrite() functions):

After a write() to a regular file has successfully returned:

  • Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.

  • Any subsequent successful write() to the same byte position in the file shall overwrite that file data.

As mentioned in the Rationale, this wording does guarantee that two simultaneous write calls (even in different unrelated processes) will not interleave data, because if data were interleaved during a write which will eventually succeed the second guarantee would be impossible to provide. How this is accomplished is up to the implementation.

It must be noted that not all filesystems conform to Posix, and modular OS design, which allows multiple filesystems to coexist in a single installation, make it impossible for the kernel itself to provide guarantees about write which apply to all available filesystems. Network filesystems are particularly prone to data races (and local mutexes won't help much either), as is mentioned as well by Posix (at the end of the paragraph quoted from the Rationale):

This requirement is particularly significant for networked file systems, where some caching schemes violate these semantics.

The first guarantee (about subsequent reads) requires some bookkeeping in the filesystem, because data which has been successfully "written" to a kernel buffer but not yet synched to disk must be made transparently available to processes reading from that file. This also requires some internal locking of kernel metadata.

Since writing to regular files is typically accomplished via kernel buffers and actually synching the data to the physical storage device is definitely not atomic, the locks necessary to provide these guarantee don't have to be very long-lasting. But they must be done inside the filesystem because nothing in the Posix wording limits the guarantees to simultaneous writes within a single threaded process.

Within a multithreaded process, Posix does require read(), write(), pread() and pwrite() to be atomic when they operate on regular files (or symbolic links). See Thread Interactions with Regular File Operations for a complete list of interfaces which must obey this requirement.

Is file append atomic in UNIX?

A write that's under the size of 'PIPE_BUF' is supposed to be atomic. That should be at least 512 bytes, though it could easily be larger (linux seems to have it set to 4096).

This assume that you're talking all fully POSIX-compliant components. For instance, this isn't true on NFS.

But assuming you write to a log file you opened in 'O_APPEND' mode and keep your lines (including newline) under 'PIPE_BUF' bytes long, you should be able to have multiple writers to a log file without any corruption issues. Any interrupts will arrive before or after the write, not in the middle. If you want file integrity to survive a reboot you'll also need to call fsync(2) after every write, but that's terrible for performance.

Clarification: read the comments and Oz Solomon's answer. I'm not sure that O_APPEND is supposed to have that PIPE_BUF size atomicity. It's entirely possible that it's just how Linux implemented write(), or it may be due to the underlying filesystem's block sizes.

Are POSIX' read() and write() system calls atomic?

I don't believe the text you cites implies anything of the sort. It doesn't even mention read() or write() or POSIX. In fact, read() and write() cannot be relied on to be atomic. The only thing POSIX says is that write() must be atomic if the size of the write is less than PIPE_BUF bytes, and even that only applies to pipes.

I didn't read the context around the part of the paper you cited, but it sounds like the passage you cited is stating constraints which must be placed on an implementation in order for the algorithm to work correctly. In other words, it states that an implementation of this algorithm requires locking.

How you do that locking is up to you (the implementor). If we are dealing with a regular file and multiple independent processes, you might try fcntl(F_SETLKW)-style locking. If your data structure is in memory and you are dealing with multiple threads in the same process, something else might be appropriate.

Atomicity of `write(2)` to a local filesystem

man 2 write on my system sums it up nicely:

Note that not all file systems are POSIX conforming.

Here is a quote from a recent discussion on the ext4 mailing list:

Currently concurrent reads/writes are atomic only wrt individual pages,
however are not on the system call. This may cause read() to return data
mixed from several different writes, which I do not think it is good
approach. We might argue that application doing this is broken, but
actually this is something we can easily do on filesystem level without
significant performance issues, so we can be consistent. Also POSIX
mentions this as well and XFS filesystem already has this feature.

This is a clear indication that ext4 -- to name just one modern filesystem -- doesn't conform to POSIX.1-2008 in this respect.

Does each Unix file description have its own read/write buffers?

This depends a bit on whether you are talking about sockets or actual files.

Strictly speaking, a descriptor never has its own buffers; it's just a handle to a deeper abstraction.

File system objects have their "own" buffers, at least when they are required. That is, if a program writes less than the file system block size, the kernel has no choice but to read a FS block and merge the write with the existing data.

This buffer is attached to the vnode and at a lower level, possibly an inode. It's owned by the file and not the descriptor. It may be kept around for a long time if memory is available.

In the case of a socket, then a stream, but not specifically a single descriptor, does actually have buffers that it owns.

Thread safety of appending to single file from multiple processes?

The question depends on what type of write is going on. If you are using standard I/O with buffering, which is typically most program's default, then the buffer will only be flushed after several lines have been written and will when flushed will not necessarily be a integral number of lines. If you are using write(2) or have changed the default stdio buffering to be line or unbuffered, then it will PROBABLY be interleaved correctly as long as the lines are reasonable sized (certainly if lines are less than 512 bytes).

Understanding concurrent file writes from multiple processes

Atomicity of writes less than PIPE_BUF applies only to pipes and FIFOs. For file writes, POSIX says:

This volume of POSIX.1-2008 does not specify behavior of concurrent
writes to a file from multiple processes. Applications should use some
form of concurrency control.

...which means that you're on your own - different UNIX-likes will give different guarantees.



Related Topics



Leave a reply



Submit