Linux Kernel Aio, Open System Call

linux kernel aio functionality

First of all, good job using libaio instead of POSIX aio.

Are there any restrictions on the usage of O_DIRECT ?

I'm not 100% sure this is the real problem, but O_DIRECT has some requirements (quoting mostly from TLPI):

The data buffer being transferred must be aligned on a memory boundary that is a multiple of the block size (use posix_memalign)
The file or device offset at which data transfer commences must be a multiple of the block size
The length of the data to be transferred must be a multiple of the block size

At a glance, I can see you are not taking aby precautions to align memory in allocate_2D_matrix.

If I do not open the file using O_DIRECT, things work fine, but it
beats the purpose of having async writes.

This happens not to be the case. Asynchronous I/O works well without O_DIRECT (for instance think of the number of system calls slashed).

How does the Linux kernel handle Asynchronous I/O (AIO) requests?

Short answer:
Most likely the AIO implementation is "faster" because it submits multiple IOs in parallel, while the synchronous implementation has either zero or one I/O in flight. It has nothing to do with writing to memory or with the kernel I/O path having additional overhead for synchronous I/Os.

You can check this using iostat -x -d 1. Look at the avgqu-sz (average queue size = the average number of in-flight I/Os) and %util (utilization = the percentage of the time the device had at least one I/O issued to it).

Long answer:

The concept of "faster" is tricky when talking about I/O. Does "faster" mean higher bandwidth? Or is it lower latency? Or bandwidth at a given request size? Or latency at a given queue depth? Or a combination of latency, bandwidth, request size, queue depth, and the many other parameters or the workload? I assume here that you are taking about throughput/bandwidth, however, it is good to remember that the performance of a storage device is not a single dimension metric.
SSDs are highly parallel devices. An SSD is composed of many flash chips, each chip having multiples dies that can read/write independently. SSDs take advantage of this and perform many I/Os in parallel, without a noticeable increase in response time. Therefore, in terms of throughput, it matters a lot how many concurrent I/Os the SSD sees.
Lets understand what happens when a thread submits a synchronous I/O: a) the thread spends some CPU cycles preparing the I/O request (generate data, compute offset, copy data into buffer, etc.), b) the system call is performed (e.g. pread()), execution passes to kernel space, and the thread blocks, c) the I/O request is processed by the kernel & traverses the various kernel I/O layers, d) the I/O request is submitted to the device and traverses the interconnect (e.g. PCIe), e) the I/O request is processed by the SSD firmware, f) the actual read command is send to the appropriate flash chip, g) the SSD controller waits for the data, h) the SSD controller gets the data from the flash chip and sends it through the interconnect. At this point the data leaves the SSD and stages e-a) happen in reverse.
As you can see, the synchronous I/O process is playing request ping-pong with the SSD. During many of the stages described above no data is actually read from the flash chips. On top of this, although your SSD can process tens to hundreds of requests in parallel, it sees at most one request at any given moment of time. Therefore, throughput is very, very low because you are actually not really using the SSD.
Asynchronous I/O helps in two ways: a) it allows the process to submit multiple I/O requests in parallel (the SSD has enough work to keep busy), and b) it allows pipelining I/Os through the various processing stages (therefore decoupling stage latency from throughput).
The reason why you see asynchronous I/O being faster than synchronous I/O is because you compare apples and oranges. The synchronous throughput is at a given request size, low queue depth, and without pipelining. The asynchronous throughput is at a different request size, higher queue depth, and with pipelining. The numbers you saw are not comparable.
The majority of I/O intensive applications (i.e. most applications such as databases, webservers, etc.) have many threads that perform synchronous I/O. Although each thread can submit at most one I/O at any given moment in time, the kernel & the SSD device see many I/O requests that can be served in parallel. Multiple sync I/O requests results in the same benefits as multiple async I/O requests.
The main differences between asynchronous and synchronous I/O come down to how I/O & processes scheduling and to the programming model. Both async & sync I/O can squeeze the same IOPS/throughput from a storage device if done right.

will open() system call block on remote filesystem?

Yes. How long depends on the speed (and state) of the uplink, but your process or thread will block until the remote operation finishes. NFS is a bit notorious for this, and some FUSE file systems handle the blocking for whatever has the file handle, but you will block on open(), read() and write(), often at the mercy of the network and the other system.

Don't use O_NONBLOCK to get around it, or you're potentially reading from or writing to a black hole (which would just block anyway).

Kernel Oops when calling aio_complete by workqueue

In your async_read_work function, you call kfree(a_work);, set a_work = NULL; and then dereference a_work here: a_work->a_iocb->ki_complete(a_work->a_iocb,100, 0);.

Possible solutions:

Rearrange the code:

    a_work->a_iocb->ki_complete(a_work->a_iocb,100, 0);
    kfree(a_work);
    a_work = NULL; /* not really necessary */

Use a local variable:

    struct kiocb *iocb;
    ...
    iocb = a_work->a_iocb;
    kfree(a_work);
    a_work = NULL; /* not really necessary */
    iocb->ki_complete(iocb,100, 0);

AIO support on Linux

AIO support has been included in the linux kernel proper. That's why the first hit on Google only offers patches to the 2.4 Linux kernel. In 2.6 and 3.0 it's already in there.

If you checkout the Linux kernel source code, it's at fs/aio.c

There's some documentation in the GNU libc manual, but be advised that aio is not possible for all types of Linux file descriptors. Most of the general "how to" documentation is dated around 2006, which is appropriate since that's when AIO in Linux was making the headlines.

Note that the POSIX.1b and Unix98 standards haven't changed, so can you be a bit specific as to the nature of the "out-of-date"ness of the examples?