When Does The Write() System Call Write All of The Requested Buffer Versus Just Doing a Partial Write

When does the write() system call write all of the requested buffer versus just doing a partial write?

You need to check errno to see if your call got interrupted, or why write() returned early, and why it only wrote a certain number of bytes.

From man 2 write

When using non-blocking I/O on objects such as sockets that are subject to flow control, write() and writev() may write fewer bytes than requested; the return value must be noted, and the remainder of the operation should be retried when possible.

Basically, unless you are writing to a non-blocking socket, the only other time this will happen is if you get interrupted by a signal.

[EINTR] A signal interrupted the write before it could be completed.

See the Errors section in the man page for more information on what can be returned, and when it will be returned. From there you need to figure out if the error is severe enough to log an error and quit, or if you can continue the operation at hand!

This is all discussed in the book: Advanced Unix Programming by Marc J. Rochkind, I have written countless programs with the help of this book, and would suggest it while programming for a UNIX like OS.

POSIX partial write() and Signal Interrupts

To answer your individual numbered questions:

  1. errno is only meaningful after one of the standard functions returns a value indicating an error - for write, -1 - and before any other standard function or application code that might clobber it is called. So no, if write returns a short write, errno will not be set to anything meaningful. If it's equal to EINTR, it just happens to be; this is not something meaningful you can interpret.

  2. The way you identify such an event is by the return value being strictly less than the nbytes argument. This doesn't actually tell you the cause of the short write, so it could be something else like running out of space. If you need to know, you need to arrange for the signal handler to inform you. But in almost all cases you don't actually need to know.

Regarding the note, if write is returning the full nbytes after a signal arriving, the signal handler was non-interrupting. This is the default on Linux with any modern libc (glibc, musl, anything but libc5 basically), and it's almost always the right thing. If you actually want interrupting signals you have to install the signal handler with sigaction and the SA_RESTART flag clear. (And conversely if you're installing signal handlers you want to have the normal, reasonable, non-interrupting behavior, for portability you should use sigaction and set the SA_RESTART flag rather than using the legacy function signal).

Why Python splits read function into multiple syscalls?

I did some research on exactly why this happens.

Note: I did my tests with Python 3.5. Python 2 has a different I/O system with the same quirk for a similar reason, but this was easier to understand with the new IO system in Python 3.

As it turns out, this is due to Python's BufferedReader, not anything about the actual system calls.

You can try this code:

fp = open('/dev/urandom', 'rb')
fp = fp.detach()
ans = fp.read(65600)
fp.close()

If you try to strace this code, you will find:

read(3, "]\"\34\277V\21\223$l\361\234\16:\306V\323\266M\215\331\3bdU\265C\213\227\225pWV"..., 65600) = 65600

Our original file object was a BufferedReader:

>>> open("/dev/urandom", "rb")
<_io.BufferedReader name='/dev/urandom'>

If we call detach() on this, then we throw away the BufferedReader portion and just get the FileIO, which is what talks to the kernel. At this layer, it'll read everything at once.

So the behavior that we're looking for is in BufferedReader. We can look in Modules/_io/bufferedio.c in the Python source, specifically the function _io__Buffered_read_impl. In our case, where the file has not yet been read from until this point, we dispatch to _bufferedreader_read_generic.

Now, this is where the quirk we see comes from:

while (remaining > 0) {
/* We want to read a whole block at the end into buffer.
If we had readv() we could do this in one pass. */
Py_ssize_t r = MINUS_LAST_BLOCK(self, remaining);
if (r == 0)
break;
r = _bufferedreader_raw_read(self, out + written, r);

Essentially, this will read as many full "blocks" as possible directly into the output buffer. The block size is based on the parameter passed to the BufferedReader constructor, which has a default selected by a few parameters:

     * Binary files are buffered in fixed-size chunks; the size of the buffer
is chosen using a heuristic trying to determine the underlying device's
"block size" and falling back on `io.DEFAULT_BUFFER_SIZE`.
On many systems, the buffer will typically be 4096 or 8192 bytes long.

So this code will read as much as possible without needing to start filling its buffer. This will be 65536 bytes in this case, because it's the largest multiple of 4096 bytes less than or equal to 65600. By doing this, it can read the data directly into the output and avoid filling up and emptying its own buffer, which would be slower.

Once it's done with that, there might be a bit more to read. In our case, 65600 - 65536 == 64, so it needs to read at least 64 more bytes. But yet it reads 4096! What gives? Well, the key here is that the point of a BufferedReader is to minimize the number of kernel reads we actually have to do, as each read has significant overhead in and of itself. So it simply reads another block to fill its buffer (so 4096 bytes) and gives you the first 64 of these.

Hopefully, that makes sense in terms of explaining why it happens like this.

As a demonstration, we could try this program:

import _io
fp = _io.BufferedReader(_io.FileIO("/dev/urandom", "rb"), 30000)
ans = fp.read(65600)
fp.close()

With this, strace tells us:

read(3, "\357\202{u'\364\6R\fr\20\f~\254\372\3705\2\332JF\n\210\341\2s\365]\270\r\306B"..., 60000) = 60000
read(3, "\266_ \323\346\302}\32\334Yl\ry\215\326\222\363O\303\367\353\340\303\234\0\370Y_\3232\21\36"..., 30000) = 30000

Sure enough, this follows the same pattern: as many blocks as possible, and then one more.

dd, in a quest for high efficiency of copying lots and lots of data, would try to read up to a much larger amount at once, which is why it only uses one read. Try it with a larger set of data, and I suspect you may find multiple calls to read.

TL;DR: the BufferedReader reads as many full blocks as possible (64 * 4096) and then one extra block of 4096 to fill its buffer.

EDIT:

The easy way to change the buffer size, as @fcatho pointed out, is to change the buffering argument on open:

open(name[, mode[, buffering]])

( ... )

The optional buffering argument specifies the file’s desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used.

This works on both Python 2 and Python 3.

Why can't linux write more than 2147479552 bytes?

Why is this here?

I don't think there's necessarily a good reason for this - I think this is basically a historical artifact. Let me explain with some git archeology.

In current Linux, this limit is governed by MAX_RW_COUNT:

ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
[...]
if (count > MAX_RW_COUNT)
count = MAX_RW_COUNT;

That constant is defined as the AND of the integer max value and the page mask. This is roughly equal to the max integer size minus the size of one page.

#define MAX_RW_COUNT (INT_MAX & PAGE_MASK)

So that's where 0x7ffff000 comes from - your platform has pages which are 4096 bytes wide, which is 212, so it's the max integer value with the bottom 12 bits unset.

The last commit to change this, ignoring commits which just move things around, was e28cc71572da3.

Author: Linus Torvalds <torvalds@g5.osdl.org>
Date: Wed Jan 4 16:20:40 2006 -0800

Relax the rw_verify_area() error checking.

In particular, allow over-large read- or write-requests to be downgraded
to a more reasonable range, rather than considering them outright errors.

We want to protect lower layers from (the sadly all too common) overflow
conditions, but prefer to do so by chopping the requests up, rather than
just refusing them outright.

So, this gives us a reason for the change: to prevent integer overflow, the size of the write is capped at a size near the maximum integer. Most of the surrounding logic seems to have been changed to use longs or size_t's, but the check remains.

Before this change, giving it a buffer larger than INT_MAX would result in an EINVAL error:

if (unlikely(count > INT_MAX))
goto Einval;

As for why this limit was put in place, it existed prior to 2.6.12, the first version that was put into git. I'll let someone with more patience than me figure that one out. :)

Is this POSIX compliant?

Putting on my standards lawyer hat, I think this is actually POSIX compliant. Yes, POSIX does say that writes larger than SSIZE_MAX are implementation-defined behavior, and this is not larger than that limit. However, there are two other sentences in the standard which I think are important:

The write() function shall attempt to write nbyte bytes from the buffer pointed to by buf to the file associated with the open file descriptor, fildes.

[...]

Upon successful completion, write() and pwrite() shall return the number of bytes actually written to the file associated with fildes. This number shall never be greater than nbyte. Otherwise, -1 shall be returned and errno set to indicate the error.

The partial write is explicitly allowed by the standard. For this reason, all code which calls write() needs to wrap calls to write() in a loop which retries short writes.

Should the limit be raised?

Ignoring the historical baggage, and the standard, is there a reason to raise this limit today?

I'd argue the answer is no. The optimal size of the write() buffer is a tradeoff between trying to avoid excessive context switches between kernel and userspace, and ensuring your data fits into cache as much as possible.

The coreutils programs (which provide cat, cp, etc) use a buffer size of 128KiB. The optimal size for your hardware might be slightly larger or smaller. But it's unlikely that 2GB buffers are going to be faster.

Atomic syscall. Input/Output operations

Igor is right: just have one thread do all the log writes. Keep in mind that the kernel has to do locking to synchronize access to the open file descriptor (which keeps track of the file position), so by doing writes from multiple cores you're causing contention inside the kernel. Even worse, you're making system calls from multiple cores, which means the kernel's code / data accesses will dirty your caches on multiple cores.

See this paper for more about the impact of making system calls on the performance of user-space code after the syscall completes. (And about data / instruction cache misses inside the kernel for infrequent syscalls). It definitely makes sense to have one thread doing all the system calls, at least all the write system calls, to keep that part of your process's footprint isolated to one core. As well as the locking contention inside the kernel.

That FlexSC paper is about an idea for batching system calls to reduce user->kernel->user transitions, but they also measure overhead for the normal synchronous system-call method. More important is the discussion of cache-pollution from making system calls.


Alternatively, if you can let multiple threads write to your log file, you could just do that and not use the queue at all.

It's not guaranteed that a large write will finish uninterrupted, but a small to medium sized write should (almost?) always copy its whole buffer on most OSes. Especially if you're writing to a file, not a pipe. IDK how Linux write() behaves when it's preempted, but I expect it usually resumes to finish the write instead of returning without having written all the requested bytes. Partial writes might be more likely when interrupted by a signal.

It is guaranteed that bytes from two write() system calls won't be mixed together; all the bytes from one will be before or after the bytes from the other. You're correct that partial writes are a potential problem, though. I forget if the glibc syscall wrapper will resume the call for you on EINTR. Although in that case, it means no bytes actually got written, or it would have returned success with a byte count.

You should test this, for partial writes and for performance. kernel-space locking might be cheaper than the overhead of your lock-free queue, but making system calls from every thread that generates log messages might be worse for performance. (And when you test this, make sure you do it with some real work happening in your user-space process, not just a loop that only calls write.)

Handling Incomplete write() Calls

Write will return a negative number if nothing is written under two circumstances:

  • A temporary error (e.g. EINTR, EAGAIN, and EWOULDBLOCK); the first of these can happen with any write, the second two (broadly) only on non-blocking I/O.

  • A permanent error.

Normally you would want to retry the first, so the routine is to repeat the write if EINTR, EAGAIN or EWOULDBLOCK is returned (though I've seen argument against the latter).

For example:

ssize_t
write_with_retry (int fd, const void* buf, size_t size)
{
ssize_t ret;
do
{
ret = write(fd, buf, size);
} while ((ret<0) && (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK));
return ret;
}

Also note (from the man page) that write can return a number of bytes written less than you requested in the case of non-blocking I/O, or blocking I/O (as the linux man-page makes clear).

OS-X man-page extract:

When using non-blocking I/O on objects, such as sockets, that are subject to flow control, write() and writev() may write fewer bytes than requested; the return value must be noted, and the remainder of the operation should be retried when possible.

Linux man-page extract (my emphasis):

The number of bytes written may be less than count if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes.

You would normally be handling those with select(), but to handle that case manually:

ssize_t
write_with_retry (int fd, const void* buf, size_t size)
{
ssize_t ret;
while (size > 0) {
do
{
ret = write(fd, buf, size);
} while ((ret < 0) && (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK));
if (ret < 0)
return ret;
size -= ret;
buf += ret;
}
return 0;
}

What are the conditions under which a short read/write can occur?

For your second question : write can return short writes for a limited buffer size if it is non-blocking

How can I synchronize -- make atomic -- writes on one file from from two processes?

If you want the contents of both buffers to be present, you have to open the files with the O_APPEND flag set. The append flag seeks to the end of the file before writing. Without this set, it's possible that both processes will be pointing to the same or overlapping areas of the file and whoever writes last will overwrite what the other has written.

Each call to write will write up to the number of bytes requested. If your process is interrupted by a signal, then you can end up with a partial write -- the actual number of bytes written is returned. Whether you get all of your bytes written or not, you'll have written one contiguous section of the file. You don't get the interleaving effect you mentioned as your second possibility (e.g. A1,B1,A2,B2,...).

If you only get a partial write, how you proceed is up to you. You can either continue writing (offset from the buffer start by the number of bytes previously written), or you can abandon the rest of your write. Only in this way could you potentially get the interleaving effect.

If it's important to have the contents of one write complete before the other process writes, then you should look into locking the file for exclusive write access (which both processes will have to check for) before attempting to write any data.



Related Topics



Leave a reply



Submit