Linux Asynch Io - Difference Between Aio.H and Libaio.H

Linux Asynch IO - difference between aio.h and libaio.h

None of these are really intended for sockets.

The POSIX AIO interface creates threads that use normal blocking IO. They work with the buffer cache and should in principle even work with sockets (though I've admittedly not tried).

The Linux kernel AIO interface does not create threads to handle requests. It works exclusively in "no buffering" mode. Beware of non-obvious behaviour such as blocking when submitting requests in some situations, which you can neither foresee nor prevent (nor know about other than your program acting "weird").

What you want is nonblocking sockets (a nonblocking socket is "kind of asynchronous") and epoll to reduce the overhead of readiness notification to a minimum, and -- if you can figure out the almost non-existing documentation -- splice and vmsplice to reduce the IO overhead. Using splice/vmsplice you can directly DMA from disk to a kernel buffer and push to the network stack from there. Or, you can directly move pages from your application's address space to kernel, and push to the network.

The downside is that the documentation is sparse (to say the least) and especially with TCP, some questions remain unaddressed, e.g. when it is safe to reclaim memory.

Difference between POSIX AIO and libaio on Linux?

On linux, the two AIO implementations are fundamentally different.

The POSIX AIO is a user-level implementation that performs normal blocking I/O in multiple threads, hence giving the illusion that the I/Os are asynchronous. The main reason to do this is that:

it works with any filesystem
it works (essentially) on any operating system (keep in mind that gnu's libc is portable)
it works on files with buffering enabled (i.e. no O_DIRECT flag set)

The main drawback is that your queue depth (i.e. the number of outstanding operations you can have in practice) is limited by the number of threads you choose to have, which also means that a slow operation on one disk may block an operation going to a different disk. It also affects which I/Os (or how many) is seen by the kernel and the disk scheduler as well.

The kernel AIO (i.e. io_submit() et.al.) is kernel support for asynchronous I/O operations, where the io requests are actually queued up in the kernel, sorted by whatever disk scheduler you have, presumably some of them are forwarded (in somewhat optimal order one would hope) to the actual disk as asynchronous operations (using TCQ or NCQ). The main restriction with this approach is that not all filesystems work that well or at all with async I/O (and may fall back to blocking semantics), files have to be opened with O_DIRECT which comes with a whole lot of other restrictions on the I/O requests. If you fail to open your files with O_DIRECT, it may still "work", as in you get the right data back, but it probably isn't done asynchronously, but is falling back to blocking semantics.

Also keep in mind that io_submit() can actually block on the disk under certain circumstances.

What is the status of POSIX asynchronous I/O (AIO)?

Network I/O is not a priority for AIO because everyone writing POSIX network servers uses an event based, non-blocking approach. The old-style Java "billions of blocking threads" approach sucks horribly.

Disk write I/O is already buffered and disk read I/O can be prefetched into buffer using functions like posix_fadvise. That leaves direct, unbuffered disk I/O as the only useful purpose for AIO.

Direct, unbuffered I/O is only really useful for transactional databases, and those tend to write their own threads or processes to manage their disk I/O.

So, at the end that leaves POSIX AIO in the position of not serving any useful purpose. Don't use it.

How does the Linux kernel handle Asynchronous I/O (AIO) requests?

Short answer:
Most likely the AIO implementation is "faster" because it submits multiple IOs in parallel, while the synchronous implementation has either zero or one I/O in flight. It has nothing to do with writing to memory or with the kernel I/O path having additional overhead for synchronous I/Os.

You can check this using iostat -x -d 1. Look at the avgqu-sz (average queue size = the average number of in-flight I/Os) and %util (utilization = the percentage of the time the device had at least one I/O issued to it).

Long answer:

The concept of "faster" is tricky when talking about I/O. Does "faster" mean higher bandwidth? Or is it lower latency? Or bandwidth at a given request size? Or latency at a given queue depth? Or a combination of latency, bandwidth, request size, queue depth, and the many other parameters or the workload? I assume here that you are taking about throughput/bandwidth, however, it is good to remember that the performance of a storage device is not a single dimension metric.
SSDs are highly parallel devices. An SSD is composed of many flash chips, each chip having multiples dies that can read/write independently. SSDs take advantage of this and perform many I/Os in parallel, without a noticeable increase in response time. Therefore, in terms of throughput, it matters a lot how many concurrent I/Os the SSD sees.
Lets understand what happens when a thread submits a synchronous I/O: a) the thread spends some CPU cycles preparing the I/O request (generate data, compute offset, copy data into buffer, etc.), b) the system call is performed (e.g. pread()), execution passes to kernel space, and the thread blocks, c) the I/O request is processed by the kernel & traverses the various kernel I/O layers, d) the I/O request is submitted to the device and traverses the interconnect (e.g. PCIe), e) the I/O request is processed by the SSD firmware, f) the actual read command is send to the appropriate flash chip, g) the SSD controller waits for the data, h) the SSD controller gets the data from the flash chip and sends it through the interconnect. At this point the data leaves the SSD and stages e-a) happen in reverse.
As you can see, the synchronous I/O process is playing request ping-pong with the SSD. During many of the stages described above no data is actually read from the flash chips. On top of this, although your SSD can process tens to hundreds of requests in parallel, it sees at most one request at any given moment of time. Therefore, throughput is very, very low because you are actually not really using the SSD.
Asynchronous I/O helps in two ways: a) it allows the process to submit multiple I/O requests in parallel (the SSD has enough work to keep busy), and b) it allows pipelining I/Os through the various processing stages (therefore decoupling stage latency from throughput).
The reason why you see asynchronous I/O being faster than synchronous I/O is because you compare apples and oranges. The synchronous throughput is at a given request size, low queue depth, and without pipelining. The asynchronous throughput is at a different request size, higher queue depth, and with pipelining. The numbers you saw are not comparable.
The majority of I/O intensive applications (i.e. most applications such as databases, webservers, etc.) have many threads that perform synchronous I/O. Although each thread can submit at most one I/O at any given moment in time, the kernel & the SSD device see many I/O requests that can be served in parallel. Multiple sync I/O requests results in the same benefits as multiple async I/O requests.
The main differences between asynchronous and synchronous I/O come down to how I/O & processes scheduling and to the programming model. Both async & sync I/O can squeeze the same IOPS/throughput from a storage device if done right.

linux kernel aio functionality

First of all, good job using libaio instead of POSIX aio.

Are there any restrictions on the usage of O_DIRECT ?

I'm not 100% sure this is the real problem, but O_DIRECT has some requirements (quoting mostly from TLPI):

The data buffer being transferred must be aligned on a memory boundary that is a multiple of the block size (use posix_memalign)
The file or device offset at which data transfer commences must be a multiple of the block size
The length of the data to be transferred must be a multiple of the block size

At a glance, I can see you are not taking aby precautions to align memory in allocate_2D_matrix.

If I do not open the file using O_DIRECT, things work fine, but it
beats the purpose of having async writes.

This happens not to be the case. Asynchronous I/O works well without O_DIRECT (for instance think of the number of system calls slashed).

Is there really no asynchronous block I/O on Linux?

The real answer, which was indirectly pointed to by Peter Teoh, is based on io_setup() and io_submit().
Specifically, the "aio_" functions indicated by Peter are part of the glibc user-level emulation based on threads, which is not an efficient implementation.
The real answer is in:

io_submit(2)
io_setup(2)
io_cancel(2)
io_destroy(2)
io_getevents(2)

Note that the man page, dated 2012-08, says that this implementation has not yet matured to the point where it can replace the glibc user-space emulation:

http://man7.org/linux/man-pages/man7/aio.7.html

this implementation hasn't yet matured to the point where the POSIX
AIO implementation can be completely reimplemented using the kernel
system calls.

So, according to the latest kernel documentation I can find, Linux does not yet have a mature, kernel-based asynchronous I/O model. And, if I assume that the documented model is actually mature, it still doesn't support partial I/O in the sense of recv() vs read().

Linux Asynch Io - Difference Between Aio.H and Libaio.H