write(2)/read(2) atomicity between processes in linux
POSIX doesn't give any minimum guarantee of atomic operations for read
and write
except for writes on a pipe (where a write of up to PIPE_BUF
(≥ 512) bytes is guaranteed to be atomic, but reads have no atomicity guarantee). The operation of read
and write
is described in terms of byte values; apart from pipes, a write
operation offers no extra guarantees compared to a loop around single-byte write
operations.
I'm not aware of any extra guarantee that Linux would give, neither 16 nor 512. In practice I'd expect it to depend on the kernel version, on the filesystem, and possibly on other factors such as the underlying block device, the number of CPUs, the CPU architecture, etc.
The O_SYNC
, O_RSYNC
and O_DSYNC
guarantees (synchronized I/O data integrity completion, given for read
and write
in the optional SIO feature of POSIX) are not what you need. They guarantee that writes are committed to persistent storage before the read
or write
system call, but do not make any claim regarding a write
that is started while the read
operation is in progress.
In your scenario, reading and writing files doesn't look like the right toolset.
- If you need to transfer only small amounts of data, use pipes. Don't worry too much about copying: copying data in memory is very fast on the scale of most processing, or of a context switch. Plus Linux is pretty good at optimizing copies.
- If you need to transfer large amounts of data, you should probably be using some form of memory mapping: either a shared memory segment if disk backing isn't required, or
mmap
if it is. This doesn't magically solve the atomicity problem, but is likely to improve the performance of a proper synchronization mechanism. To perform synchronization, there are two basic approaches:- The producer writes data to shared memory, then sends a notification to the consumer indicating exactly what data is available. The consumer only processes data upon request. The notification may use the same channel (e.g.
mmap
+msync
) or a different channel (e.g. pipe). - The producer writes data to shared memory, then flushes the write (e.g.
msync
). Then the producer writes a well-known value to one machine word (asig_atomic_t
will typically work, even though its atomicity is formally guaranteed only for signals — or, in practice, auintptr_t
). The consumer reads that one machine word and only processes the corresponding data if this word has an acceptable value.
- The producer writes data to shared memory, then sends a notification to the consumer indicating exactly what data is available. The consumer only processes data upon request. The notification may use the same channel (e.g.
Atomicity of `write(2)` to a local filesystem
man 2 write
on my system sums it up nicely:
Note that not all file systems are POSIX conforming.
Here is a quote from a recent discussion on the ext4
mailing list:
Currently concurrent reads/writes are atomic only wrt individual pages,
however are not on the system call. This may causeread()
to return data
mixed from several different writes, which I do not think it is good
approach. We might argue that application doing this is broken, but
actually this is something we can easily do on filesystem level without
significant performance issues, so we can be consistent. Also POSIX
mentions this as well and XFS filesystem already has this feature.
This is a clear indication that ext4
-- to name just one modern filesystem -- doesn't conform to POSIX.1-2008 in this respect.
`write` serialization in POSIX
To answer your first question:
"Occur" refers to the whole read, from the point of the call to the point of the value being returned. All of it has to happen after the previous write, and before the next write. The same page says so:
After a write() to a regular file has successfully returned:
Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.
Any subsequent successful write() to the same byte position in the file shall overwrite that file data.
POSIX makes no guarantee whatsoever on any sort of interleaving, because implementing additional guarantees is quite difficult.
Regarding the second question:
Again, refer to the above quote. If a process called write()
and write()
returned successfully, any subsequent read by any processes would reflect the written data.
So the answer is "yes, if the first write() failed".
Implementation:
ext4, and almost every other filesystem, uses a page cache. The page cache is an in-memory representation of the file's data (or a relevant part thereof). Any synchronization that needs to be done, is done using this representation. In that respect, reading and writing from the file is like reading and writing from shared memory.
The page cache, as the name suggests, is built with pages. In most implementations, a page is a region of 4k of memory, and reads and writes happen on a page basis.
This means that e.g. ext4 will serialize reads & writes on the same 4k region of the file, but a 12k write may not be atomic.
AFAICT, ext4 does not allow concurrent multiple writes on the same page, or concurrent reads & writes on the same page, but it is nowhere guaranteed.
edit: The filesystem (on-disk) block size might be smaller than a page, in which case some I/O may be done at a block-size granularity, but that is even less reliable in terms of atomicity.
Is overwriting a small file atomic on ext4?
From my experiment it was not atomic.
Basically my experiment was to have two processes, one writer and one reader. The writer writes to a file in a loop and reader reads from the file
Writer Process:
char buf[][18] = {
"xxxxxxxxxxxxxxxx",
"yyyyyyyyyyyyyyyy"
};
i = 0;
while (1) {
pwrite(fd, buf[i], 18, 0);
i = (i + 1) % 2;
}
Reader Process
while(1) {
pread(fd, readbuf, 18, 0);
//check if readbuf is either buf[0] or buf[1]
}
After a while of running both processes, I could see that the readbuf
is either xxxxxxxxxxxxxxxxyy
or yyyyyyyyyyyyyyyyxx
.
So it definitively shows that the writes are not atomic. In my case 16byte writes were always atomic.
The answer was: POSIX doesn't mandate atomicity for writes/reads except for pipes. The 16 byte atomicity that I saw was kernel specific and may/can change in future.
Details of the answer in the actual post:
write(2)/read(2) atomicity between processes in linux
Do Linux pipe read/writes ALWAYS cause a context switch?
You are correct that there would be at least one switch to the kernel, but this is merely a privilege change (achieved via syscall) and not a context switch.
Related Topics
Replace All Lines That Do Not Contain Matched String
Create a Symbolic Link of Directory in Ubuntu
Understanding Load Average VS. CPU Usage
Docker Bash Prompt Does Not Display Color Output
Svn Error: Can't Convert String from Native Encoding to 'Utf-8'
Android Sdk on a 64-Bit Linux MAChine
Profiling a (Possibly I/O-Bound) Process to Reduce Latency
How to Realize a Diff Function
When "Vagrant Up" It Says "It Appears Your Machine Doesn't Support Nfs" (Debian Jessie)
Tomcat Intellij Idea: Remote Deploy
How to Call Matlab Functions from the Linux Command Line
How to Set Limit on Directory Size in Linux
Linux Process in Background - "Stopped" in Jobs