Linux Async (Io_Submit) Write V/S Normal (Buffered) Write

Linux async (io_submit) write v/s normal (buffered) write

Copying a buffer into the kernel is not necessarily instantaneous.

First the kernel needs to find a free page. If there is none (which is fairly likely under heavy disk-write pressure), it has to decide to evict one. If it decides to evict a dirty page (instead of evicting your process for instance), it will have to actually write it before it can use that page.

there's a related issue in linux when saturating writing to a slow drive, the page cache fills up with dirty pages backed by a slow drive. Whenever the kernel needs a page, for any reason, it takes a long time to acquire one and the whole system freezes as a result.

The size of each individual write is less relevant than the write pressure of the system. If you have a million small writes already queued up, this may be the one that has to block.

Regarding whether the allocation lives on the stack or the heap is also less relevant. If you want efficient allocation of blocks to write, you can use a dedicated pool allocator (from the heap) and not pay for the general purpose heap allocator.

aio_write() gets around this by not copying the buffer into the kernel at all, it may even be DMAd straight out of your buffer (given the alignment requirements), which means you're likely to save a copy as well.

Using Linux AIO, able to do IOs but writing garbage as well into the file

There's a few things going on here. First up, the alignment requirement that you mentioned is either 512 bytes or 4096 bytes, depending on your underlying device. Try 512 bytes to start. It applies to:

The offset that you're writing in the file must be a multiple of 512 bytes. It can be 0, 512, 1024, etc. You can write at offset 0 like you're doing here, but you can't write at offset 100.
The length of data that you're writing to the file must be a multiple of 512 bytes. Again, you can write 512 bytes, 1024 bytes, or 2048 bytes, and so on - any multiple of 512. You can't write 100 bytes like you're trying to do here.
The address of the memory that contains the data you're writing must be a multiple of 512. (I typically use 4096, to be safe.) Here, you'll need to be able to do someBuffer % 512 and get 0. (With the code the way it is, it most likely won't be.)

In my experience, failing to meet any of the above requirements doesn't actually give you an error back! Instead, it'll complete the I/O request using normal, regular old blocking I/O.

Unaligned I/O: If you really, really need to write a smaller amount of data or write at an unaligned offset, then things get tricky even above and beyond the io_submit interface. You'll need to do an aligned read to cover the range of data that you need to write, then modify the data in memory and write the aligned region back to disk.

For example, say you wanted to modify offset 768 through 1023 on the disk. You'd need to read 512 bytes at offset 512 into a buffer. Then, memcpy() the 256 bytes you wanted to write 256 bytes into that buffer. Finally, you issue a write of the 512 byte buffer at offset 512.

Uninitialized Data: As others have pointed out, you haven't fully initialized the buffer that you're writing. Use memset() to initialize it to zero to avoid writing junk.

Allocating an Aligned Pointer: To meet the pointer requirements for the data buffer, you'll need to use posix_memalign(). For example, to allocate 4096 bytes with a 512 byte alignment restriction: posix_memalign(&ptr, 512, 4096);

Lastly, consider whether you need to do this at all. Even in the best of cases, io_submit still "blocks", albeit at the 10 to 100 microsecond level. Normal blocking I/O with pread and pwrite offers a great many benefits to your application. And, if it becomes onerous, you can relegate it to another thread. If you've got a latency-sensitive app, you'll need to do io_submit in another thread anyway!

Is there really no asynchronous block I/O on Linux?

The real answer, which was indirectly pointed to by Peter Teoh, is based on io_setup() and io_submit().
Specifically, the "aio_" functions indicated by Peter are part of the glibc user-level emulation based on threads, which is not an efficient implementation.
The real answer is in:

io_submit(2)
io_setup(2)
io_cancel(2)
io_destroy(2)
io_getevents(2)

Note that the man page, dated 2012-08, says that this implementation has not yet matured to the point where it can replace the glibc user-space emulation:

http://man7.org/linux/man-pages/man7/aio.7.html

this implementation hasn't yet matured to the point where the POSIX
AIO implementation can be completely reimplemented using the kernel
system calls.

So, according to the latest kernel documentation I can find, Linux does not yet have a mature, kernel-based asynchronous I/O model. And, if I assume that the documented model is actually mature, it still doesn't support partial I/O in the sense of recv() vs read().

linux kernel aio functionality

First of all, good job using libaio instead of POSIX aio.

Are there any restrictions on the usage of O_DIRECT ?

I'm not 100% sure this is the real problem, but O_DIRECT has some requirements (quoting mostly from TLPI):

The data buffer being transferred must be aligned on a memory boundary that is a multiple of the block size (use posix_memalign)
The file or device offset at which data transfer commences must be a multiple of the block size
The length of the data to be transferred must be a multiple of the block size

At a glance, I can see you are not taking aby precautions to align memory in allocate_2D_matrix.

If I do not open the file using O_DIRECT, things work fine, but it
beats the purpose of having async writes.

This happens not to be the case. Asynchronous I/O works well without O_DIRECT (for instance think of the number of system calls slashed).

Resources associated to an aio_context

You can mix requests freely in a single context and I would do so. Otherwise you have to poll two separate contexts doubling the number of syscalls.
Requests to a context are passed to the kernels async IO VFS layer. Multiple files, multiple contexts, multiple processes or users doing the requests it all ends up in the same layer. The VFS layer then sends the requests to the relevant filesystems or block devices and all the usual collation and such happens naturally.
Requests to the same file to one or more context at the same time I think are undefined behavior if they overlap. They could be ordered one way or the other. The later request could be processed first for example. So you need to write your own synchronization if strict ordering is required. Same as one or more threads doing read/write calls in parallel.
Prioritization and scheduling will depend on the lower layers. Afaik block devices will reorder requests so they happen in increasing block numbers (elevator code) to minimize seek times on rotating disks.
Yes, requests from different contexts and normal read/write calls will get interleaved.
I think the requesting process and NUMA and such is completely ignored.

Note: When dealing with files make sure the filesystem supports the linux async IO hooks and you might need to use O_DIRECT on open() with all it's consequences.
A way to simply test this I found is to make lots of requests to a file in one io_submit() call and then check if the all finish simultaneously. If the filesystem falls back to sync IO then everything submitted will finish at the same time.

What is the status of POSIX asynchronous I/O (AIO)?

Network I/O is not a priority for AIO because everyone writing POSIX network servers uses an event based, non-blocking approach. The old-style Java "billions of blocking threads" approach sucks horribly.

Disk write I/O is already buffered and disk read I/O can be prefetched into buffer using functions like posix_fadvise. That leaves direct, unbuffered disk I/O as the only useful purpose for AIO.

Direct, unbuffered I/O is only really useful for transactional databases, and those tend to write their own threads or processes to manage their disk I/O.

So, at the end that leaves POSIX AIO in the position of not serving any useful purpose. Don't use it.

Linux Async (Io_Submit) Write V/S Normal (Buffered) Write