Is Overwriting a Small File Atomic on Ext4

Is overwriting a small file atomic on ext4?

From my experiment it was not atomic.

Basically my experiment was to have two processes, one writer and one reader. The writer writes to a file in a loop and reader reads from the file

Writer Process:

char buf[][18] = {
    "xxxxxxxxxxxxxxxx",
    "yyyyyyyyyyyyyyyy"
};
i = 0;
while (1) {
   pwrite(fd, buf[i], 18, 0);
   i = (i + 1) % 2;
}

Reader Process

while(1) {
    pread(fd, readbuf, 18, 0);
    //check if readbuf is either buf[0] or buf[1]
}

After a while of running both processes, I could see that the readbuf is either xxxxxxxxxxxxxxxxyy or yyyyyyyyyyyyyyyyxx.

So it definitively shows that the writes are not atomic. In my case 16byte writes were always atomic.

The answer was: POSIX doesn't mandate atomicity for writes/reads except for pipes. The 16 byte atomicity that I saw was kernel specific and may/can change in future.

Details of the answer in the actual post:
write(2)/read(2) atomicity between processes in linux

after writing a files to an ext4 volume do I need to do more than flush to guarantee the file is fully written?

flush() ensures that all processes see the file in the same state, but does not guarantee that all bytes have been written to disk. A further call to fsync() or fdatasync() is required.

Atomicity of `write(2)` to a local filesystem

man 2 write on my system sums it up nicely:

Note that not all file systems are POSIX conforming.

Here is a quote from a recent discussion on the ext4 mailing list:

Currently concurrent reads/writes are atomic only wrt individual pages,
however are not on the system call. This may cause read() to return data
mixed from several different writes, which I do not think it is good
approach. We might argue that application doing this is broken, but
actually this is something we can easily do on filesystem level without
significant performance issues, so we can be consistent. Also POSIX
mentions this as well and XFS filesystem already has this feature.

This is a clear indication that ext4 -- to name just one modern filesystem -- doesn't conform to POSIX.1-2008 in this respect.

Are disk sector writes atomic?

The traditional (SCSI, ATA) disk protocol specifications don't guarantee that any/every sector write is atomic in the event of sudden power loss (but see below for discussion of the NVMe spec). However, it seems tacitly agreed that non-ancient "real" disks quietly try their best to offer this behaviour (e.g. Linux kernel developer Christoph Hellwig mentions this off-hand in the 2017 presentation "Failure-Atomic file updates for Linux").

When it comes to synthetic disks (e.g. network attached block devices, certain types of RAID etc.) things are less clear and they may or may not offer sector atomicity guarantees while legally behaving per their given spec. Imagine a RAID 1 array (without a journal) comprised of a disk that offers 512 byte sized sectors but where the other disk offered a 4KiB sized sector thus forcing the RAID to expose a sector size of 4KiB. As a thought experiment, you can construct a scenario where each individual disk offers sector atomicity (relative to its own sector size) but where the RAID device does not in the face of power loss. This is because it would depend on whether the 512 byte sector disk was the one being read by the RAID and how many of the 8 512-byte sectors compromising the 4KiB RAID sector it had written before the power failed.

Sometimes specifications offer atomicity guarantees but only on certain write commands. The SCSI disk spec is an example of this and the optional WRITE ATOMIC(16) command can even give a guarantee beyond a sector but being optional it's rarely implemented (and thus rarely used). The more commonly implemented COMPARE AND WRITE is also atomic (potentially across multiple sectors too) but again it's optional for a SCSI device and comes with different semantics to a plain write...

Curiously, the NVMe spec was written in such a way to guarantee sector atomicity thanks to Linux kernel developer Matthew Wilcox. Devices that are compliant with that spec have to offer a guarantee of sector write atomicity and may choose to offer contiguous multi-sector atomicity up to a specified limit (see the AWUPF field). However, it's unclear how you can discover and use any multi-sector guarantee if you aren't currently in a position to send raw NVMe commands...

Andy Rudoff is an engineer who talks about investigations he has done on the topic of write atomicity. His presentation "Protecting SW From Itself: Powerfail Atomicity for Block Writes" (slides) has a section of video where he talks about how power failure impacts in-flight writes on traditional storage. He describes how he contacted hard drive manufacturers about the statement "a disk's rotational energy is used to ensure that writes are completed in the face of power loss" but the replies were non-committal as to whether that manufacturer actually performed such an action. Further, no manufacturer would say that torn writes never happen and while he was at Sun, ZFS added checksums to blocks which led to them uncovering cases of torn writes during testing. It's not all bleak though - Andy talks about how sector tearing is rare and if a write is interrupted then you usually get only the old sector, or only the new sector, or an error (so at least corruption is not silent). Andy also has an older slide deck Write Atomicity and NVM Drive Design which collects popular claims and cautions that a lot of software (including various popular filesystems on multiple OSes) are actually unknowingly dependent on sector writes being atomic...

(The following takes a Linux centric view but many of the concepts apply to general-purpose OSes that are not being deployed in a tightly controlled hardware environments)

Going back to 2013, BtrFS lead developer Chris Mason talked about how (the now defunct) Fusion-io had created a storage product that implemented atomic operation (Chris was working for Fusion-io at the time). Fusion-io also created a proprietary filesystem "DirectFS" (written by Chris) to expose this feature. The MariaDB developers implemented a mode that could take advantage of this behaviour by no longer doing double buffering resulting in "43% more transactions per second and half the wear on the storage device". Chris proposed a patch so generic filesystems (such as BtrFS) could advertise that they provided atomicity guarantees via a new flag O_ATOMIC but block layer changes would also be needed. Said block layer changes were also proposed by Chris in a later patch series that added a function blk_queue_set_atomic_write(). However, neither of the patch series ever entered the mainline Linux kernel and there is no O_ATOMIC flag in the (current 2020) mainline 5.7 Linux kernel.

Before we go further, it's worth noting that even if a lower level doesn't offer an atomicity guarantee, a higher level can still provide atomicity (albeit with performance overhead) to its users so long as it knows when a write has reached stable storage. If fsync() can tell you when writes are on stable storage (technically not guaranteed by POSIX but the case on modern Linux) then because POSIX rename is atomic you can use the create new file/fsync/rename dance to do atomic file updates thus allowing applications to do double buffering/Write Ahead Logging themselves. Another example lower down in the stack are Copy On Write filesystems like BtrFS and ZFS. These filesystems give userspace programs a guarantee of "all the old data" or "all the new data" after a crash at sizes greater than a sector because of their semantics even though a disk many not offer atomic writes. You can push this idea all the way down into the disk itself where NAND based SSDs don't overwrite the area currently used by an existing LBA and instead write the data to a new region and keep a mapping of where the LBA's data is now.

Resuming our abridged timeline, in 2015 HP researchers wrote a paper Failure-Atomic Updates of Application Data
in a Linux File System (PDF) (media) about introducing a new feature into the Linux port of AdvFS (AdvFS was originally part of DEC's Tru64):

If a file is opened with a new O_ATOMIC flag, the state of its application data will always reflect the most recent successful msync, fsync, or fdatasync. AdvFS furthermore includes a new syncv operation that combines updates to multiple files into a failure-atomic bundle [...]

In 2017, Christoph Hellwig wrote experimental patches to XFS to provide O_ATOMIC. In the "Failure-Atomic file updates for Linux" talk (slides) he explains how he drew inspiration from the 2015 paper (but without the multi-file support) and the patchset extends the XFS reflink work that already existed. However, despite an initial mailing list post, at the time of writing (mid 2020) this patchset is not in the mainline kernel.

During the database track of the 2019 Linux Plumbers Conference, MySQL developer Dimitri Kravtchuk asked if there were plans to support O_ATOMIC (link goes to start of filmed discussion). Those assembled mention the XFS work above, that Intel claim they can do atomicity on Optane but Linux doesn't provide an interface to expose it, that Google claims to provide 16KiB atomicity on GCE storage¹. Another key point is that many database developers need something larger than 4KiB atomicity to avoid having to do double writes - PostgreSQL needs 8KiB, MySQL needs 16KiB and apparently the Oracle database needs 64KiB. Further, Dr Richard Hipp (author of the SQLite database) asked if there's a standard interface to request atomicity because today SQLite makes use of the F2FS filesystem's ability to do atomic updates via custom ioctl()s but the ioctl was tied to one filesystem. Chris replied that for the time being there's nothing standard and nothing provides the O_ATOMIC interface.

At the 2021 Linux Plumbers Conference Darrick Wong re-raised the topic of atomic writes (link goes to start of filmed discussion). He pointed out there are two different things that people mean when they say they want atomic writes:

Hardware provides some atomicity API and this capability is somehow exposed through the software stack
Make the filesystem do all the work to expose some sort of atomic write API irrespective of hardware

Darrick mentioned that Christoph had ideas for 1. in the past but Christoph has not come back to the topic and further there are unanswered questions (how you make userspace aware of limits, if the feature was exposed it would be restricted to direct I/O which may problematic for many programs). Instead Darrick suggested tackling 2. was to propose his FIEXCHANGE_RANGE ioctl which swaps the contents of two files (the swap is restartable if it fails part way through). This approach doesn't have the limits (e.g. smallish contiguous size, maximum number of scatter gather vectors, direct I/O only) that a hardware based solution would have and could theoretically be implementable in the VFS thus being filesystem agnostic...

TLDR; if you are in tight control of your whole stack from application all the way down the the physical disks (so you can control and qualify the whole lot) you can arrange to have what you need to make use of disk atomicity. If you're not in that situation or you're talking about the general case, you should not depend on sector writes being atomic.

When the OS sends the command to write a sector to disk is it atomic?

At the time of writing (mid-2020):

When using a mainline 4.14+ Linux kernel
If you are dealing with a real disk

a sector write sent by the kernel is likely atomic (assuming a sector is no bigger than 4KiB). In controlled cases (battery backed controller, NVMe disk which claims to support atomic writes, SCSI disk where the vendor has given you assurances etc.) a userspace program may be able to use O_DIRECT so long as O_DIRECT wasn't reverting to being buffered, the I/O didn't get split apart/merged at the block layer / you are sending device specific commands and are bypassing the block layer. However, in the general case neither the kernel nor a userspace program can safely assume sector write atomicity.

Can you ever end up with a situation where the data on disk is part X, part Y, and part garbage?

From a specification perspective if you are talking about a SCSI disk doing a regular SCSI WRITE(16) and a power failure happening in the middle of that write then the answer is yes: a sector could contain part X, part Y AND part garbage. A crash during an inflight write means the data read from the area that was being written to is indeterminate and the disk is free to choose what it returns as data from that region. This means all old data, all new data, some old and new, all zeros, all ones, random data etc. are all "legal" values to return for said sector. From an old draft of the SBC-3 spec:

4.9 Write failures
If one or more commands performing write operations are in the task set and are being processed when power is lost (e.g., resulting in a vendor-specific command timeout by the application client) or a medium error or hardware error occurs (e.g., because a removable medium was incorrectly unmounted), the data in the logical blocks being written by those commands is indeterminate. When accessed by a command performing a read or verify operation (e.g., after power on or after the removable medium is mounted), the device server may return old data, new data, or vendor-specific data in those logical blocks.
Before reading logical blocks which encountered such a failure, an application client should reissue any commands performing write operations that were outstanding.

¹ In 2018 Google announced it had tweaked its cloud SQL stack and that this allowed them to use 16k atomic writes MySQL's with innodb_doublewrite=0 via O_DIRECT... The underlying customisations Google performed were described as being in the virtualized storage, kernel, virtio and the ext4 filesystem layers. Further, a no longer available beta document titled Best practices for 16 KB persistent disk and MySQL (archived copy) described what end users had to do to safely make use of the feature. Changes included: using an appropriate Google provided VM, using specialized storage, changing block device parameters and carefully creating an ext4 filesystem with a specific layout. However, at some point in 2020 this document vanished from GCE's online guides suggesting such end user tuning is not supported.

write(2)/read(2) atomicity between processes in linux

POSIX doesn't give any minimum guarantee of atomic operations for read and write except for writes on a pipe (where a write of up to PIPE_BUF (≥ 512) bytes is guaranteed to be atomic, but reads have no atomicity guarantee). The operation of read and write is described in terms of byte values; apart from pipes, a write operation offers no extra guarantees compared to a loop around single-byte write operations.

I'm not aware of any extra guarantee that Linux would give, neither 16 nor 512. In practice I'd expect it to depend on the kernel version, on the filesystem, and possibly on other factors such as the underlying block device, the number of CPUs, the CPU architecture, etc.

The O_SYNC, O_RSYNC and O_DSYNC guarantees (synchronized I/O data integrity completion, given for read and write in the optional SIO feature of POSIX) are not what you need. They guarantee that writes are committed to persistent storage before the read or write system call, but do not make any claim regarding a write that is started while the read operation is in progress.

In your scenario, reading and writing files doesn't look like the right toolset.

If you need to transfer only small amounts of data, use pipes. Don't worry too much about copying: copying data in memory is very fast on the scale of most processing, or of a context switch. Plus Linux is pretty good at optimizing copies.
If you need to transfer large amounts of data, you should probably be using some form of memory mapping: either a shared memory segment if disk backing isn't required, or mmap if it is. This doesn't magically solve the atomicity problem, but is likely to improve the performance of a proper synchronization mechanism. To perform synchronization, there are two basic approaches:
- The producer writes data to shared memory, then sends a notification to the consumer indicating exactly what data is available. The consumer only processes data upon request. The notification may use the same channel (e.g. mmap + msync) or a different channel (e.g. pipe).
- The producer writes data to shared memory, then flushes the write (e.g. msync). Then the producer writes a well-known value to one machine word (a sig_atomic_t will typically work, even though its atomicity is formally guaranteed only for signals — or, in practice, a uintptr_t). The consumer reads that one machine word and only processes the corresponding data if this word has an acceptable value.

Is file append atomic in UNIX?

A write that's under the size of 'PIPE_BUF' is supposed to be atomic. That should be at least 512 bytes, though it could easily be larger (linux seems to have it set to 4096).

This assume that you're talking all fully POSIX-compliant components. For instance, this isn't true on NFS.

But assuming you write to a log file you opened in 'O_APPEND' mode and keep your lines (including newline) under 'PIPE_BUF' bytes long, you should be able to have multiple writers to a log file without any corruption issues. Any interrupts will arrive before or after the write, not in the middle. If you want file integrity to survive a reboot you'll also need to call fsync(2) after every write, but that's terrible for performance.

Clarification: read the comments and Oz Solomon's answer. I'm not sure that O_APPEND is supposed to have that PIPE_BUF size atomicity. It's entirely possible that it's just how Linux implemented write(), or it may be due to the underlying filesystem's block sizes.

Understanding concurrent file writes from multiple processes

Atomicity of writes less than PIPE_BUF applies only to pipes and FIFOs. For file writes, POSIX says:

This volume of POSIX.1-2008 does not specify behavior of concurrent
writes to a file from multiple processes. Applications should use some
form of concurrency control.

...which means that you're on your own - different UNIX-likes will give different guarantees.

What is in-memory preallocation range in ext4 file system

how does ext4 maintain the preallocation space

As per the documentation of ext4 the preallocation space is maintained at two places:

per inode.
per locality/CPU group
For small (while allocation is being done or post allocation) files prealloc blocks from per locality group are picked, for large files those from inode are referred.

how does it calculate the size of the space to be preallocated for the next allocation?

From the ext4 documentation:

Before allocating blocks via buddy cache we normalize the request
blocks. This ensure we ask for more blocks that we needed. The extra
blocks that we get after allocation is added to the respective prealloc
list.

More details of the allocation algorithm (eg. how normalization of a alloc request is done) are explained quite well in the liked header file.

Is Overwriting a Small File Atomic on Ext4