Is There Really No Asynchronous Block I/O on Linux

Is there really no asynchronous block I/O on Linux?

The real answer, which was indirectly pointed to by Peter Teoh, is based on io_setup() and io_submit().
Specifically, the "aio_" functions indicated by Peter are part of the glibc user-level emulation based on threads, which is not an efficient implementation.
The real answer is in:

io_submit(2)
io_setup(2)
io_cancel(2)
io_destroy(2)
io_getevents(2)

Note that the man page, dated 2012-08, says that this implementation has not yet matured to the point where it can replace the glibc user-space emulation:

http://man7.org/linux/man-pages/man7/aio.7.html

this implementation hasn't yet matured to the point where the POSIX
AIO implementation can be completely reimplemented using the kernel
system calls.

So, according to the latest kernel documentation I can find, Linux does not yet have a mature, kernel-based asynchronous I/O model. And, if I assume that the documented model is actually mature, it still doesn't support partial I/O in the sense of recv() vs read().

Does synchronized I/O always mean blocking I/O?

"Asynchronous" or "non-blocking" I/O are, indeed, effectively synonymous. However, if we're using Linux terminology, "blocking" and "synchronized" I/O are different.

"Blocking" just tells you that the syscall won't return until the kernel has recorded the data... somewhere. There's no guarantee that this record is persistent in the event of an unexpected power loss or hardware failure; it can simply be a writeahead cache, for example -- so your blocking call can return at a point where other processes running at the time can see the write, but where that write would be lost if a power failure took place.

"Synchronized" in the O_SYNC sense tells you that the syscall won't return until the data is actually persisted to hardware.

Thus: All synchronized I/O is blocking, but not all blocking I/O is synchronized.

What is the status of POSIX asynchronous I/O (AIO)?

Network I/O is not a priority for AIO because everyone writing POSIX network servers uses an event based, non-blocking approach. The old-style Java "billions of blocking threads" approach sucks horribly.

Disk write I/O is already buffered and disk read I/O can be prefetched into buffer using functions like posix_fadvise. That leaves direct, unbuffered disk I/O as the only useful purpose for AIO.

Direct, unbuffered I/O is only really useful for transactional databases, and those tend to write their own threads or processes to manage their disk I/O.

So, at the end that leaves POSIX AIO in the position of not serving any useful purpose. Don't use it.

Will non-blocking I/O be put to sleep during copying data from kernel to user?

I fail to quite parse what you've written.

I'll try to make a sheer guess and conjure you might be overseeing the fact that the write(2) and read(2) syscalls (and those of their ilk such as send(2) and recv(2)) on the sockets put into non-blocking mode are free to consume (and return, respectively) less data than requested.

In other words, a write(2) call on a non-blocking socket told to write 1 megabyte of data will consume just as much data currently fits into the assotiated kernel buffer and return immediately, signalling it consumed only as much data. The next immediate call to write(2) will likely return EWOULDBLOCK.

The same goes for the read(2) call: if you pass it a buffer large enough to hold 1 megabyte of data, and tell it to read that number of bytes, the call will only drain the contents of the kernel buffer and return immediately, signaling how much data it actually copied. The next immediate call to read(2) will likely return EWOULDBLOCK.

So, any attempt to get or put data to the socket succeeds almost immediately: either after the data had been shoveled between the kernel's buffer and the user space or right away—with the EAGAIN return code.

Sure, there's supposedly a possibility for an OS thread to be suspended right in the middle of performing such a syscall, but this does not count as "blocking in a syscall."

Update to the original answer in response to the following comment of the OP:

<…>

This is what I see in book
"UNIX Network Programming" (Volume 1, 3rd), chapter 6.2:

A synchronous I/O operation causes the requesting process
to be blocked until that I/O operation completes. Using these
definitions, the first four I/O models—blocking, nonblocking, I/O
multiplexing, and signal-driven I/O—are all synchronous because the
actual I/O operation (recvfrom) blocks the process.

It uses "blocks" to describe nonblocking I/O operation. That makes me confused.

I still don't understand why the book uses "blocks the process" if the process is actually not blocked.

I can only guess that the book's author intended to highlight that the process is indeed blocked since entering a syscall and until returning from it. Reads from and writes to a non-blocking socket do block to transfer the data, if available, between the kernel and the user space. We colloquially say this does not block because we mean "it does not block waiting and doing nothing for an indeterminate amount of time".

The book's author might contrast this to the so-called asynchronous I/O (called "overlapping" on Windows™)—where you basically give the kernel a buffer with/for data and ask it to do away with it completely in parallel with your code—in the sense the relevant syscall returns right away and the I/O is carried out in background (with regard to your user-space code).

To my knowledge, Go does not use kernel's async I/O facilities on neither platform it supports. You might look there for the developments regarding Linux and its contemporary io_uring subsystem.

Oh, and one more point. The book might (at that point through the narrative at least) be discussing a simplified "classic" scheme where there are no in-process threads, and the sole unit of concurrency is the process (with a single thread of execution). In this scheme, any syscall obviously blocks the whole process. In contrast, Go works only on kernels which support threads, so in a Go program a syscall never blocks the whole process—only the thread it's called on.

Let me take yet another stab at explaining the problem as—I perceive—the OP stated it.

The problem of serving multiple client requests is not new—one of the more visible first statements of it is "The C10k problem".

To quickly recap it, a single threaded server with blocking operations on the sockets it manages is only realistically able to handle a single client at a time.

To solve it, there exist two straightforward approaches:

Fork a copy of the server process to handle each incoming client connection.
On an OS which supports threads, fork a new thread inside the same process to handle each incoming client.

They have their pros and cons but they both suck with regard to resource usage, and—which is more important—they do not play well with the fact most clients have relatively low rate and bandwidth of I/O they perform with regard to the processing resources available on a typical server.

In other words, when serving a typical TCP/IP exchange with a client, the serving thread most of the time sleeps in the write(2) and read(2) calls on the client socket.

This is what most people mean when talking about "blocking operations" on sockets: if a socket is blocking, and operation on it will block until it can actually be carried out, and the originating thread will be put to sleep for an indeterminate amount of time.

Another important thing to note is that when the socket becomes ready, the amount of work done is typically miniscule compared to the amount of time slept between the wakeups.
While the tread sleeps, its resources (such as memory) are effectively wasted, as they cannot be used to do any other work.

Enter "polling". It combats the problem of wasted resources by noticing that the points of readiness of networked sockets are relatively rare and far in between, so it makes sense to have lots of such sockets been served by a single thread: it allows to keep the thread almost as busy as theoretically possible, and also allows to scale out when needed: if a single thread is unable to cope with the data flow, add another thread, and so on.

This approach is definitely cool but it has a downside: the code which reads and writes data must be re-written to use callback style instead of the original plain sequential style. Writing with callbacks is hard: you usuaully have to implement intricate buffer management and state machines to deal with this.

The Go runtime solves this problem by adding another layer of scheduling for its execution flow units—goroutines: for goroutines, operations on the sockets are always blocking, but when a goroutine is about to block on a socket, this is transparently handled by suspending only the goroutine itself—until the requested operation will be able to proceed—and using the thread the goroutine was running on to do other work¹.

This allows to have the best of both approaches: the programmer may write classic no-brainer sequential callback-free networking code but the threads used to handle networking requests are fully utilized².

As to the original question of blocking, both the goroutine and the thread it runs on are indeed blocked when the data transfer on a socket is happening, but since what happens is data shoveling between a kernel and a user-space buffer, the delay is most of the time small, and is no different to the classic "polling" case.

Note that performing of syscalls—including I/O on non-pollable descriptors—in Go (at leas up until, and including Go 1.14) does block both the calling goroutine and the thread it runs on, but is handled differently from those of pollable descriptors: when a special monitoring thread notices a goroutine spent in a syscall more that certain amount of time (20 µs, IIRC), the runtime pulls the so-called "processor" (a runtime thing which runs goroutines on OS threads) from under the gorotuine and tries to make it run another goroutine on another OS thread; if there is a goroutine wanting to run but no free OS thread, the Go runtime creates another one.

Hence "normal" blocking I/O is still blocking in Go in both senses: it blocks both goroutines and OS threads, but the Go scheduler makes sure the program as a whole still able to make progress.

This could arguably be a perfect case for using true asynchronous I/O provided by the kernel, but it's not there yet.

¹ See this classic essay for more info.

² The Go runtime is certainly not the first one to pioneer this idea. For instance, look at the State Threads library (and the more recent libtask) which implement the same approach in plain C; the ST library has superb docs which explain the idea.

Does all asynchronous I/O ultimately implemented in polling?

At the lowest (or at least, lowest worth looking at) hardware level, asynchronous operations truly are asynchronous in modern operating systems.

For example, when you read a file from the disk, the operating system translates your call to read to a series of disk operations (seek to location, read blocks X through Y, etc.). On most modern OSes, these commands get written either to special registers, or special locations in main memory, and the disk controller is informed that there are operations pending. The operating system then goes on about its business, and when the disk controller has completed all of the operations assigned to it, it triggers an interrupt, causing the thread that requested the read to pickup where it left off.

Regardless of what type of low-level asynchronous operation you're looking at (disk I/O, network I/O, mouse and keyboard input, etc.), ultimately, there is some stage at which a command is dispatched to hardware, and the "callback" as it were is not executed until the hardware reaches out and informs the OS that it's done, usually in the form of an interrupt.

That's not to say that there aren't some asynchronous operations implemented using polling. One trivial (but naive and costly) way to implement any blocking operation asynchronously is just to spawn a thread that waits for the operation to complete (perhaps polling in a tight loop), and then call the callback when it's finished. Generally speaking, though, common asynchronous operations at the OS level are truly asynchronous.

It's also worth mentioning that just because an API is blocking doesn't mean it's polling: you can put a blocking API on an asynchronous operation, and a non-blocking API on a synchronous operation. With things like select and kqueues, for example, the thread actually just goes to sleep until something interesting happens. That "something interesting" comes in in the form of an interrupt (usually), and that's taken as an indication that the operating system should wake up the relevant threads to continue work. It doesn't just sit there in a tight loop waiting for something to happen.

There really is no way to tell whether a system uses polling or "real" callbacks (like interrupts) just from its API, but yes, there are asynchronous APIs that are truly backed by asynchronous operations.

How io_uring internally works?

Apparently, Linux already had an Asyn[c]-IO (AIO) API. I believe it is not fully asynchronous. So what was the issue with AIO?

If you're extra careful and match all its constraints, the "old" Linux AIO interface will behave asynchronously. However, if you "break" ANY of the (hidden) rules submission can suddenly (and silently) behave in a synchronous fashion (i.e. submission blocks non-deterministically and for inconveniently long periods of time). Some of the many "rules" are given in answers to asynchronous IO io_submit latency in Ubuntu Linux (the overall issues are also listed in section 1.0 of the io_uring document you linked).

how io_uring overcomes it?

It is a radically different interface (see this answer on the "Is there really no asynchronous block I/O on Linux?") which is harder to get wrong.
It has a workqueue/thread pool mechanism which it will punt requests to when it is aware that blocking will take place before the result of submission can be returned (and thus it is able to return control back to the caller). This allows it to retain asynchrony in more (hopefully all!) submission cases.
It has an optional privileged mode (IORING_SETUP_SQPOLL) where you don't even have to make syscalls to submit/retrieve results. If you're "just" manipulating memory contents it's going to be hard to be blocked on a call you never made!

How io_uring internally works?

There are two ring buffers where the first ring is used to queue submissions and when the work has been completed the "results" are announced via the second ring buffer (which contains completions). While it's hard to give you something more than a very high level view if you're uncomfortable with things like C structs/C function interfaces you may enjoy this video by Jens presenting io_uring nonetheless and find the explanations in https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/ and https://mattermost.com/blog/iouring-and-go/ more accessible.

io_uring's advantages over Linux AIO don't stop at better asynchrony though! See the aforementioned link for "Is there really no asynchronous block I/O on Linux?" for a list of other benefits...

What actually happens in asynchronous IO

I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.

No. Most modern devices are able to transfer data to/from RAM by themselves (using DMA or bus mastering).

For an example; the CPU might tell a disk controller "read 4 sectors into RAM at address 0x12345000" and then the CPU can do anything else it likes while the disk controller does the transfer (and will be interrupted by an IRQ from the disk controller when the disk controller has finished transferring the data).

However; for modern systems (where you can have any number of processes all wanting to use the same device at the same time) the device driver has to maintain a list of pending operations. In this case (under load); when the device generates an IRQ to say that it finished an operation the device driver responds by telling the device to start the next "pending operation". That way the device spends almost no time idle waiting to be asked to start the next operation (much better device utilization) and the CPU spends almost all of its time doing something else (between IRQs).

Of course often hardware is more advanced (e.g. having an internal queue of operations itself, so driver can tell it to do multiple things and it can start the next operation as soon as it finished the previous operation); and often drivers are more advanced (e.g. having "IO priorities" to ensure that more important stuff is done first rather than just having a simple FIFO queue of pending operations).

Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?

Lets say that you get info from deviceA (while CPU and deviceB are idle); then process that info a little (while deviceA and deviceB are idle); then write the result to deviceB (while deviceA and CPU are idle). You can see that most hardware is doing nothing most of the time (poor utilization).

With asynchronous IO; while deviceA is fetching the next piece of info the CPU can be processing the current piece of info while deviceB is writing the previous piece of info. Under ideal conditions (no speed mismatches) you can achieve 100% utilization (deviceA, CPU and deviceB are never idle); and even if there are speed mismatches (e.g. deviceB needs to wait for CPU to finish processing the current piece) the time anything spends idle will be minimized (and utilization maximized as much as possible).

The other alternative is to use multiple tasks - e.g. one task that fetches data from deviceA synchronously and notifies another task when the data was read; a second task that waits until data arrives and processes it and notifies another task when the data was processed; then a third task that waits until data was processed and writes it to deviceB synchronously. For utilization; this is effectively identical to using asynchronous IO (in fact it can be considered "emulation of asynchronous IO"). The problem is that you've added a bunch of extra overhead managing and synchronizing multiple tasks (more RAM spent on state and stacks, task switches, lock contention, ...); and made the code more complex and harder to maintain.

How is none blocking IO for regular files is implemented in .Net on Linux?

It's worth pointing that there are multiple contexts at play here.

The Linux operating system

From Non-Blocking descriptors:

By default, read on any descriptor blocks if there’s no data
available. The same applies to write or send. This applies to
operations on most descriptors except disk files, since writes to disk
never happen directly but via the kernel buffer cache as a proxy. The
only time when writes to disk happen synchronously is when the O_SYNC
flag was specified when opening the disk file.
Any descriptor (pipes, FIFOs, sockets, terminals, pseudo-terminals,
and some other types of devices) can be put in the nonblocking mode.
When a descriptor is set in nonblocking mode, an I/O system call on
that descriptor will return immediately, even if that request can’t be
immediately completed (and will therefore result in the process being
blocked otherwise). The return value can be either of the following:
an error: when the operation cannot be completed at all
a partial count: when the input or output operation can be partially completed
the entire result: when the I/O operation could be fully completed

As explained above, the Non-Blocking descriptors will prevent pipes (or sockets, or...) from blocking continuously. They weren't meant to be used with disk files, however, because no matter if you want to read an entire file, or just a part of it, the data is there. It's not going to get there in the future, so you can start processing it right away.

Quoting your linked post:

Regular files are always readable and they are also always writeable.
This is clearly stated in the relevant POSIX specifications. I cannot
stress this enough. Putting a regular file in non-blocking has
ABSOLUTELY no effects other than changing one bit in the file flags.
Reading from a regular file might take a long time. For instance, if
it is located on a busy disk, the I/O scheduler might take so much
time that the user will notice that the application is frozen.
Nevertheless, non-blocking mode will not fix it. It will simply not
work. Checking a file for readability or writeability always succeeds
immediately. If the system needs time to perform the I/O operation, it
will put the task in non-interruptible sleep from the read or write
system call. In other words, if you can assume that a file descriptor
refers to a regular file, do not waste your time (or worse, other
people's time) in implementing non-blocking I/O.
The only safe way to read data from or write data to a regular file
while not blocking a task... consists of not performing the operation,
not in that particular task anyway. Concretely, you need to create a separate thread (or process), or use asynchronous I/O (functions whose
name starts with aio_). Whether you like it or not, and even if you
think multiple threads suck, there are no other options.

The .NET runtime

Implements the async/await pattern to unblock the main event loop while I/O is being performed. As mentioned above:

Concretely, you need to create a separate thread (or process), or use
asynchronous I/O (functions whose name starts with aio_). Whether you
like it or not, and even if you think multiple threads suck, there are
no other options.

The .NET threadpool will spawn additional processes as needed (ref why is .NET spawning multiple processes on Linux). So, ideally, when the .NET File.ReadAsync(...) or File.WriteAsync(...) overloads are called, the current thread (from the threadpool) will initiate the I/O operation and will then give up control, freeing it to do other work. But before it does, a continuation is placed on the I/O operation. So when the I/O device signals the operation has finished, the threadpool scheduler knows the next free thread can pick up the continuation.

To be sure, this is all about responsiveness. All code that requires the I/O to complete, will still have to wait. Although, it won't "block" the application.

Back to OS

The thread giving up control, which eventually leads to it being freed up, can be achieved on Windows:

https://docs.microsoft.com/en-us/troubleshoot/windows/win32/asynchronous-disk-io-synchronous

Asynchronous I/O hasn't been a part of Linux (for very long), the flow we have here is described at:

https://devblogs.microsoft.com/dotnet/file-io-improvements-in-dotnet-6/#unix

Unix-like systems don’t expose async file IO APIs (except of the new
io_uring which we talk about later). Anytime user asks FileStream to
perform async file IO operation, a synchronous IO operation is being
scheduled to Thread Pool. Once it’s dequeued, the blocking operation
is performed on a dedicated thread.

Similar flow is suggested by Python's asyncio implementation:

asyncio does not support asynchronous operations on the filesystem.
Even if files are opened with O_NONBLOCK, read and write will block.
...
The Linux kernel provides asynchronous operations on the filesystem
(aio), but it requires a library and it doesn't scale with many
concurrent operations. See aio.
...
For now, the workaround is to use aiofiles that uses threads to handle
files.

Closing thoughts

The concept behind Linux' Non-Blocking descriptor (and its polling mechanism) is not what makes async I/O tick on Windows.

As mentioned by @Damien_The_Unbeliever there's a relatively new io_uring Linux kernel interface that allows continuation flow similar to the one on Windows. However, the following links confirm this is not yet implemented on .NET6:

https://devblogs.microsoft.com/dotnet/file-io-improvements-in-dotnet-6/#whats-next
https://github.com/dotnet/runtime/issues/12650

How does the Linux kernel handle Asynchronous I/O (AIO) requests?

Short answer:
Most likely the AIO implementation is "faster" because it submits multiple IOs in parallel, while the synchronous implementation has either zero or one I/O in flight. It has nothing to do with writing to memory or with the kernel I/O path having additional overhead for synchronous I/Os.

You can check this using iostat -x -d 1. Look at the avgqu-sz (average queue size = the average number of in-flight I/Os) and %util (utilization = the percentage of the time the device had at least one I/O issued to it).

Long answer:

The concept of "faster" is tricky when talking about I/O. Does "faster" mean higher bandwidth? Or is it lower latency? Or bandwidth at a given request size? Or latency at a given queue depth? Or a combination of latency, bandwidth, request size, queue depth, and the many other parameters or the workload? I assume here that you are taking about throughput/bandwidth, however, it is good to remember that the performance of a storage device is not a single dimension metric.
SSDs are highly parallel devices. An SSD is composed of many flash chips, each chip having multiples dies that can read/write independently. SSDs take advantage of this and perform many I/Os in parallel, without a noticeable increase in response time. Therefore, in terms of throughput, it matters a lot how many concurrent I/Os the SSD sees.
Lets understand what happens when a thread submits a synchronous I/O: a) the thread spends some CPU cycles preparing the I/O request (generate data, compute offset, copy data into buffer, etc.), b) the system call is performed (e.g. pread()), execution passes to kernel space, and the thread blocks, c) the I/O request is processed by the kernel & traverses the various kernel I/O layers, d) the I/O request is submitted to the device and traverses the interconnect (e.g. PCIe), e) the I/O request is processed by the SSD firmware, f) the actual read command is send to the appropriate flash chip, g) the SSD controller waits for the data, h) the SSD controller gets the data from the flash chip and sends it through the interconnect. At this point the data leaves the SSD and stages e-a) happen in reverse.
As you can see, the synchronous I/O process is playing request ping-pong with the SSD. During many of the stages described above no data is actually read from the flash chips. On top of this, although your SSD can process tens to hundreds of requests in parallel, it sees at most one request at any given moment of time. Therefore, throughput is very, very low because you are actually not really using the SSD.
Asynchronous I/O helps in two ways: a) it allows the process to submit multiple I/O requests in parallel (the SSD has enough work to keep busy), and b) it allows pipelining I/Os through the various processing stages (therefore decoupling stage latency from throughput).
The reason why you see asynchronous I/O being faster than synchronous I/O is because you compare apples and oranges. The synchronous throughput is at a given request size, low queue depth, and without pipelining. The asynchronous throughput is at a different request size, higher queue depth, and with pipelining. The numbers you saw are not comparable.
The majority of I/O intensive applications (i.e. most applications such as databases, webservers, etc.) have many threads that perform synchronous I/O. Although each thread can submit at most one I/O at any given moment in time, the kernel & the SSD device see many I/O requests that can be served in parallel. Multiple sync I/O requests results in the same benefits as multiple async I/O requests.
The main differences between asynchronous and synchronous I/O come down to how I/O & processes scheduling and to the programming model. Both async & sync I/O can squeeze the same IOPS/throughput from a storage device if done right.

Is There Really No Asynchronous Block I/O on Linux