The State of Linux Async Io

Is there really no asynchronous block I/O on Linux?

The real answer, which was indirectly pointed to by Peter Teoh, is based on io_setup() and io_submit().
Specifically, the "aio_" functions indicated by Peter are part of the glibc user-level emulation based on threads, which is not an efficient implementation.
The real answer is in:

io_submit(2)
io_setup(2)
io_cancel(2)
io_destroy(2)
io_getevents(2)

Note that the man page, dated 2012-08, says that this implementation has not yet matured to the point where it can replace the glibc user-space emulation:

http://man7.org/linux/man-pages/man7/aio.7.html

this implementation hasn't yet matured to the point where the POSIX
AIO implementation can be completely reimplemented using the kernel
system calls.

So, according to the latest kernel documentation I can find, Linux does not yet have a mature, kernel-based asynchronous I/O model. And, if I assume that the documented model is actually mature, it still doesn't support partial I/O in the sense of recv() vs read().

What actually happens in asynchronous IO

I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.

No. Most modern devices are able to transfer data to/from RAM by themselves (using DMA or bus mastering).

For an example; the CPU might tell a disk controller "read 4 sectors into RAM at address 0x12345000" and then the CPU can do anything else it likes while the disk controller does the transfer (and will be interrupted by an IRQ from the disk controller when the disk controller has finished transferring the data).

However; for modern systems (where you can have any number of processes all wanting to use the same device at the same time) the device driver has to maintain a list of pending operations. In this case (under load); when the device generates an IRQ to say that it finished an operation the device driver responds by telling the device to start the next "pending operation". That way the device spends almost no time idle waiting to be asked to start the next operation (much better device utilization) and the CPU spends almost all of its time doing something else (between IRQs).

Of course often hardware is more advanced (e.g. having an internal queue of operations itself, so driver can tell it to do multiple things and it can start the next operation as soon as it finished the previous operation); and often drivers are more advanced (e.g. having "IO priorities" to ensure that more important stuff is done first rather than just having a simple FIFO queue of pending operations).

Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?

Lets say that you get info from deviceA (while CPU and deviceB are idle); then process that info a little (while deviceA and deviceB are idle); then write the result to deviceB (while deviceA and CPU are idle). You can see that most hardware is doing nothing most of the time (poor utilization).

With asynchronous IO; while deviceA is fetching the next piece of info the CPU can be processing the current piece of info while deviceB is writing the previous piece of info. Under ideal conditions (no speed mismatches) you can achieve 100% utilization (deviceA, CPU and deviceB are never idle); and even if there are speed mismatches (e.g. deviceB needs to wait for CPU to finish processing the current piece) the time anything spends idle will be minimized (and utilization maximized as much as possible).

The other alternative is to use multiple tasks - e.g. one task that fetches data from deviceA synchronously and notifies another task when the data was read; a second task that waits until data arrives and processes it and notifies another task when the data was processed; then a third task that waits until data was processed and writes it to deviceB synchronously. For utilization; this is effectively identical to using asynchronous IO (in fact it can be considered "emulation of asynchronous IO"). The problem is that you've added a bunch of extra overhead managing and synchronizing multiple tasks (more RAM spent on state and stacks, task switches, lock contention, ...); and made the code more complex and harder to maintain.

Asynchronous I/O Linux

You want to avoid AIO on Linux for anything real, at least for now, From aio(7):

The current Linux POSIX AIO implementation is provided in userspace by glibc. This has a number of limitations, most notably that maintaining multiple threads to perform I/O operations is expensive and scales poorly. Work has been in progress for some time on a kernel state-machine-based implementation of asynchronous I/O (see io_submit(2), io_setup(2), io_cancel(2), io_destroy(2), io_getevents(2)), but this implementation hasn't yet matured to the point where the POSIX AIO implementation can be completely reimplemented using the kernel system calls.

Instead, look into non-blocking IO with select(2)/poll(2)/epoll(7).

How to implement Async I/O efficiently in the kernel

Those "asynchronous" I/O stuffs are another illusion by KERNEL and Driver service. I will take an example of wifi driver. (which is network).

  1. RX

1) If packets are coming in, wifi H/W will generate interrupts and DMA the dot11 frame or dot3 frame to DRAM (it depends on wifi H/W. Nowadays, most of modern wifi hw will convert the packets in HW - actually FW on HW).

2) Wifi Driver(running in KERNEL) is supposed to handle multiple wifi related things but most likely, it will form socket buffer (skb) and then send skbs to Linux KERNEL. Typically, it is happening in NET_RX_SOFTIRQ or your can create your own thread.

3) Packets come to Linux stack. You can send it to user space. It is happening in "__netif_receive_skb_core" and if the packet is "IP" packet, the first rx_handler would be "ip_rcv()".

4) ip packets move up to transport layer hander which is udp_rcv() / tcp_rcv(). To send packets to transport layer, you have to go through socket layer and eventually, you will form packet linked list (you can say Q) on the specific socket.

5) As far as I understand, this "Q" is the queue to supply packet to user space. You can do "async" or "sync" I/O here.


  1. TX

1) Packets are going through KERNEL's transportation layer and IP layer and eventually, your netdev TX handler is getting called (hard_start_xmit or ndo_xmit_start). Basically, if your netdev(e.g. eth0 or wifi0) is ethernet device, it is connected to your ethernet driver "TX" function or wifi driver "TX" function. This is callback and it is typically set up when driver is up.

2) At this stage, your packets are already transformed to "skb"

3) In the callback, it will prepare all the headers and descriptors and do DMA.

4) Once TX is don on HW, HW will generate interrupt and you need to free the packet.

Here, my point is, your network I/O is already working as "Asynchronous" at DMA and Driver level. Most of modern drivers may have separate context for this. For TX, it would use thread, tasklet or NET_TX_SOFTIRQ. For RX, if we are using "NAPI", it would use NET_RX_SOFTIRQ. or it can use thread and tasklet too.

All these are happening independently based on "interrupt" or some other trigger.

"Synchronous I/O" is mostly simulated in upper application layer. So, if you re-write your socket layer in kernel, you can do whatever you want to do since lower layer is already working as you want.

buffered asynchronous file I/O on linux

Unless you want to write your own IO thread pool, the glibc implementation is an acceptable solution. It actually works surprisingly well for something that runs entirely in userland.

The kernel implementation does not work with buffered IO at all in my experience (though I've seen other people say the opposite!). Which is fine if you want to read huge amounts of data via DMA, but of course it sucks big time if you plan to take advantage of the buffer cache.

Also note that the kernel AIO calls may actually block. There is a limited size command buffer, and large reads are broken up into several smaller ones. Once the queue is full, asynchronous commands run synchronously. Surprise. I've run into this problem a year or two ago and could not find an explanation. Asking around gave me the "yeah of course, that's how it works" answer.

From what I've understood, the "official" interest in supporting buffered aio is not terribly great either, despite several working solutions seem to be available for years. Some of the arguments that I've read were on the lines of "you don't want to use the buffers anyway" and "nobody needs that" and "most people don't even use epoll yet". So, well... meh.

Being able to get an epoll signalled by a completed async operation was another issue until recently, but in the meantime this works really fine via eventfd.

Note that the glibc implementation will actually spawn threads on demand inside __aio_enqueue_request. It is probably no big deal, since spawning threads is not that terribly expensive any more, but one should be aware of it. If your understanding of starting an asynchronous operation is "returns immediately", then that assumption may not be true, because it may be spawning some threads first.

EDIT:

As a sidenote, under Windows there exists a very similar situation to the one in the glibc AIO implementation where the "returns immediately" assumption of queuing an asynchronous operation is not true.

If all data that you wanted to read is in the buffer cache, Windows will decide that it will instead run the request synchronously, because it will finish immediately anyway. This is well-documented, and admittedly sounds great, too. Except in case there are a few megabytes to copy or in case another thread has page faults or does IO concurrently (thus competing for the lock) "immediately" can be a surprisingly long time -- I've seen "immediate" times of 2-5 milliseconds. Which is no problem in most situations, but for example under the constraint of a 16.66ms frame time, you probably don't want to risk blocking for 5ms at random times. Thus, the naive assumption of "can do async IO from my render thread no problem, because async doesn't block" is flawed.

Why boost::aio is asynchronous when its implementation is based on epoll(synchronous)

"synchronous" normally refers to an operation that does not return control back to the caller until it has completed.

epoll is synchronous in the sense that its operation (returning fds with pending completions/actions) is complete by the time it returns.

Reading from or writing to a socket however is still asynchronous in the sense that the operation to read or write is still not complete when the function call returns. The actual I/O work may be done asynchronously, and epoll will tell you when it's done. The work will be performed regardless of if and when you call epoll, epoll is just the mechanism to signal completions back to you, not the function that performs the work.

What exactly is io_uring?

io_uring is a (new as of mid 2019) Linux kernel interface to efficiently allow you to send and receive data asynchronously. It was originally designed to target block devices and files but has since gained the ability to work with things like network sockets.

Unlike something like epoll(), it is built around a completion model rather than a readiness model. This is desirable because other operating systems have used the completion model successfully for some time. io_uring provides something competitive and complete for Linux without the drawbacks the previous Linux AIO interface has.

The author of io_uring has written a PDF document titled Efficient IO with io_uring which discusses its usage in a technical fashion. A gentler introduction is provided by the Lord of the io_uring guide. You can read ScyllaDB developer Glauber Costa proselytize it in How io_uring and eBPF Will Revolutionize Programming in Linux. Lastly, LWN.net has written about io_uring many times.

(Shameless plug: I've written a more linky answer on the "Is there really no asynchronous block I/O on Linux?" question)

How are asynchronous I/O methods processed

The real benefit of async/await in server applications (like WCF) is asynchronous I/O.

When you call a synchronous I/O method, the calling thread will be blocked waiting for the I/O to complete. The thread cannot be used by other requests, it just waits for the result. When more requests arrive, the thread pool will create more threads to handle them, wasting a lot of resources - memory, context switching when the waiting threads get unblocked...

If you use async IO, the thread is not blocked. After starting the asynchronous IO operation, it is again available to be used by the thread pool. When the async operation is finished, the thread pool assigns a thread to continue processing the request. No resources wasted.

From MSDN (it's about file I/O, but applies to other too)

In synchronous file I/O, a thread starts an I/O operation and immediately enters a wait state until the I/O request has completed. A thread performing asynchronous file I/O sends an I/O request to the kernel by calling an appropriate function. If the request is accepted by the kernel, the calling thread continues processing another job until the kernel signals to the thread that the I/O operation is complete. It then interrupts its current job and processes the data from the I/O operation as necessary.

Now you probably can see why await Task.Run() will not give any benefit if the IO in the task is done synchronously. A thread will get blocked anyway, just not the one that called the Task.Run().

You don't need to implement every method asynchronously to see improvement in performance (although it should become a habit to always perform I/O asynchronously).



Related Topics



Leave a reply



Submit