Buffered Asynchronous File I/O on Linux

buffered asynchronous file I/O on linux

Unless you want to write your own IO thread pool, the glibc implementation is an acceptable solution. It actually works surprisingly well for something that runs entirely in userland.

The kernel implementation does not work with buffered IO at all in my experience (though I've seen other people say the opposite!). Which is fine if you want to read huge amounts of data via DMA, but of course it sucks big time if you plan to take advantage of the buffer cache.

Also note that the kernel AIO calls may actually block. There is a limited size command buffer, and large reads are broken up into several smaller ones. Once the queue is full, asynchronous commands run synchronously. Surprise. I've run into this problem a year or two ago and could not find an explanation. Asking around gave me the "yeah of course, that's how it works" answer.

From what I've understood, the "official" interest in supporting buffered aio is not terribly great either, despite several working solutions seem to be available for years. Some of the arguments that I've read were on the lines of "you don't want to use the buffers anyway" and "nobody needs that" and "most people don't even use epoll yet". So, well... meh.

Being able to get an epoll signalled by a completed async operation was another issue until recently, but in the meantime this works really fine via eventfd.

Note that the glibc implementation will actually spawn threads on demand inside __aio_enqueue_request. It is probably no big deal, since spawning threads is not that terribly expensive any more, but one should be aware of it. If your understanding of starting an asynchronous operation is "returns immediately", then that assumption may not be true, because it may be spawning some threads first.

EDIT:

As a sidenote, under Windows there exists a very similar situation to the one in the glibc AIO implementation where the "returns immediately" assumption of queuing an asynchronous operation is not true.

If all data that you wanted to read is in the buffer cache, Windows will decide that it will instead run the request synchronously, because it will finish immediately anyway. This is well-documented, and admittedly sounds great, too. Except in case there are a few megabytes to copy or in case another thread has page faults or does IO concurrently (thus competing for the lock) "immediately" can be a surprisingly long time -- I've seen "immediate" times of 2-5 milliseconds. Which is no problem in most situations, but for example under the constraint of a 16.66ms frame time, you probably don't want to risk blocking for 5ms at random times. Thus, the naive assumption of "can do async IO from my render thread no problem, because async doesn't block" is flawed.

Is there really no asynchronous block I/O on Linux?

The real answer, which was indirectly pointed to by Peter Teoh, is based on io_setup() and io_submit().
Specifically, the "aio_" functions indicated by Peter are part of the glibc user-level emulation based on threads, which is not an efficient implementation.
The real answer is in:

io_submit(2)
io_setup(2)
io_cancel(2)
io_destroy(2)
io_getevents(2)

Note that the man page, dated 2012-08, says that this implementation has not yet matured to the point where it can replace the glibc user-space emulation:

http://man7.org/linux/man-pages/man7/aio.7.html

this implementation hasn't yet matured to the point where the POSIX
AIO implementation can be completely reimplemented using the kernel
system calls.

So, according to the latest kernel documentation I can find, Linux does not yet have a mature, kernel-based asynchronous I/O model. And, if I assume that the documented model is actually mature, it still doesn't support partial I/O in the sense of recv() vs read().

What's the difference with buffered synchronous I/O and asynchronous I/O?

My understanding of async I/O is that you're notified when it's done via an interrupt of some sort so you can do more I/O at that point. With buffered I/O, you do it and forget about it, you never hear about that particular I/O again.

At least that's how it is with the huge intelligent disk arrays we deal with.

The idea of async I/O is that you begin the I/O and return to doing other stuff. Then, when the I/O is finished, you're notified and can do more I/O - in other words, you're not waiting around for it to finish.

Specifically for the synchronous read case: you request some input and then wait around while it's read from the device. Buffering there just involves reading more than you need so it's available on the next read without going out to the device to get it.

Async reads, you simply start the process to read then you go off and do something else while it's happening. Whether by polling or an interrupt, you later discover that the read is finished and the data is available for you to use.

For writes, I'm not sure I can see the advantage of one over the other. Buffered sync writes will return almost immediately unless the buffer is full (that's the only time when an async write may have an advantage).

Asynchronous I/O reading from a file

Several people have hinted at a good solution to this in their comments, but it's probably worth spelling it out in more detail. The full solution has quite a lot of details and is pretty complicated code, so I'm going to use pseudocode to explain what I'd recommend.

What you have here is really a variation on a classic producer/consumer problem: You have a single thing producing data, and many things trying to consume that data. In your case, it must be a "single thing" producing that data, because the lengths of each line of the source file are unknown: You can't just jump forward 'n' bytes and somehow be at the next IP. There can only be one actor at a time moving the read pointer toward the next unknown position of the \n, so you by definition have a single producer.

There are three general ways to attack this:

  • Solution A involves having each thread pulling a little more out of a shared file buffer, and kicking off an asynchronous (nonblocking) read every time the last read completes. There are a whole host of headaches getting this solution right, as it's very sensitive to timing differences between the filesystem and the work being performed: If the file reads are slow, the workers will all stall waiting for the file. If the workers are slow, the file reader will either stall or fill up memory waiting for them to consume the data. This solution is likely the absolute fastest, but it's also incredibly difficult synchronization code to get right with about a zillion caveats. Unless you're an expert in threading (or extremely clever abuse of epoll_wait()), you probably don't want to go this route.

  • Solution B has a "master" thread, responsible for reading the file, and populating some kind of thread-safe queue with the data it reads, with one IP address (one string) per queue entry. Each of the worker threads just consumes queue entries as fast as it can, querying the remote server and then requesting another queue entry. This requires a little care to get right, but is generally a lot safer than Solution A, especially if you use somebody else's queue implementation.

  • Solution C is pretty hacktastic, but you shouldn't dismiss it out-of-hand, depending on what you're doing. This solution just involves using something like the Un*x sed command (see Get a range of lines from a file given the start and end line numbers) to slice your source file into a bunch of "chunky" source files in advance — say, twenty of them. Then you just run twenty copies of a really simple single-thread program in parallel using &, each on a different "slice" of file. Mushed together with a little shell script to automate it, this can be a "good enough" solution for a lot of needs.


Let's take a closer look at Solution B — a master thread with a thread-safe queue. I'm going to cheat and assume you can construct a working queue implementation (if not, there are StackOverflow articles on implementing a thread-safe queue using pthreads: pthread synchronized blocking queue).

In pseudocode, this solution is then something like this:

main()
{
/* Create a queue. */
queue = create_queue();

/* Kick off the master thread to read the file, and give it the queue. */
master_thread = pthread_create(master, queue);

/* Kick off a bunch of workers with access to the queue. */
for (i = 0; i < 20; i++) {
worker_thread[i] = pthread_create(worker, queue);
}

/* Wait for everybody to finish. */
pthread_join(master_thread);
for (i = 0; i < 20; i++) {
pthread_join(worker_thread[i]);
}
}

void master(queue q)
{
FILE *fp = fopen("ips.txt", "r");
char buffer[BIGGER_THAN_ANY_IP];

/* Inhale the file as fast as we can, and push each line we
read onto the queue. */
while (fgets(fp, buffer) != NULL) {
char *next_ip = strdup(buffer);
enqueue(q, next_ip);
}

/* Add some final messages in the queue to let the workers
know that we're out of data. There are *much* better ways
of notifying them that we're "done", but in this case,
pushing a bunch of NULLs equal to the number of threads is
simple and probably good enough. */
for (i = 0; i < 20; i++) {
enqueue(q, NULL);
}
}

void worker(queue q)
{
char *ip;

/* Inhale messages off the queue as fast as we can until
we get a "NULL", which means that it's time to stop.
The call to dequeue() *must* block if there's nothing
in the queue; the call should only return NULL if the
queue actually had NULL pushed into it. */
while ((ip = dequeue(q)) != NULL) {

/* Insert code to actually do the work here. */
connect_and_send_and_receive_to(ip);
}
}

There are plenty of caveats and details in a real implementation (like: how do we implement the queue, ring buffers or a linked list? what if the text isn't all IPs? what if the char buffer isn't big enough? how many threads is enough? how do we deal with file or network errors? will malloc performance become a bottleneck? what if the queue gets too big? can we do better to overlap the network I/O?).

But, caveats and details aside, the pseudocode I presented above is a good enough starting point that you likely can expand it into a working solution.

Linux Disk File AIO

The tutorial gives an overview of asynchronous I/O in general and talks about how there is kernel support for it. Then it goes on to talk about posix AIO (which is the standardized API for accessing asynchronous I/O), implying that using the posix AIO API on linux, will give you access to the kernel support for AIO. This is not the case.

On linux, there are really two separate AIO implementations:

  1. kernel AIO which uses io_submit() et al.) which is only supported in kernel 2.6 (or really 2.5 and there may be back-ported versions of it to 2.4.
  2. posix AIO which is a glibc feature, essentially unrelated to the kernel. It implements the posix API in terms of user-level threads making blocking disk I/O calls.

So, in short, if you already have a generic implementation of multiple threads for disk I/O, you might be better off using that than to use glibc's implementation (because you might have slightly more control over it).

If you're committed to actually use the io_submit() family of functions, you may have to do quite a lot of work to circumvent the restrictions on those functions.

kernel AIO requires your files to be opened with O_DIRECT. Which in turn requires all your file offsets, read and write sizes to be aligned to blocks on the disk. This is typically fine if you're just using one large file and you can make it work very similar to the page cache in the OS. For reading and writing arbitrary files at arbitrary offsets and lengths, it gets messy.

If you end up giving kernel AIO a shot, I would highly recommend looking into tying one or more eventfds to your iocbs so that you can wait on completions using epoll/select rather than having to block in io_getevents().

Linux asynchronous I/O queue

When a set of io requests is submitted with io_submit, the system call returns immediately. From the point of view of the thread emitting the requests, the execution of the commands embedded in the requests is asynchronous. The thread will have to query the OS to know the result, and is free to do what it wants in the mean time.

Now, if two threads happens to emit each a set of requests, they will both fall in the same situation. They will both have to ask the OS about the advancement of their respective IO commands. None of the threads will be blocked.

From the AIO framework point of view, it is entirely possible to make the OS actually execute the requests before returning from the io_submit call for either or all the threads invoking it, but the API remains the same: userland threads will still manipulate the API as an async one, obtaining a token for a future result from the API when it posts its requests, and using that token to get the real result.

In the specific case of linux AIO, the token is the context created beforehand, and the result check syscall is io_getevents, which reports an "event" (ie. a result) for each completed request.

Regarding your example, is it possible that during the second syscall, all the requests of the first thread get completed? I don't see a reason for this never happening at all, but if both threads post 100 requests very close to each other, then it seems very unlikely. A more likely scenario is that several requests of the first thread to call io_submit got completed when the second thread makes its own call to io_submit, but at any rate that call will not block.

Asynchronous I/O Linux

You want to avoid AIO on Linux for anything real, at least for now, From aio(7):

The current Linux POSIX AIO implementation is provided in userspace by glibc. This has a number of limitations, most notably that maintaining multiple threads to perform I/O operations is expensive and scales poorly. Work has been in progress for some time on a kernel state-machine-based implementation of asynchronous I/O (see io_submit(2), io_setup(2), io_cancel(2), io_destroy(2), io_getevents(2)), but this implementation hasn't yet matured to the point where the POSIX AIO implementation can be completely reimplemented using the kernel system calls.

Instead, look into non-blocking IO with select(2)/poll(2)/epoll(7).

Are aio_read, aio_write buffered by kernel? In case of Linux, does they go through page cache?

Yes. They go through the kernel cache. aio_fsync attests this.



Related Topics



Leave a reply



Submit