Boost Asio Single Threaded Performance

Boost Asio single threaded performance

You might want to read my question from a few years ago, I asked it when first investigating the scalability of Boost.Asio while developing the system software for the Blue Gene/Q supercomputer.

Scaling to 100k or more connections should not be a problem, though you will need to be aware of the obvious resource limitations such as the maximum number of open file descriptors. If you haven't read the seminal C10K paper, I suggest reading it.

After you have implemented your application using a single thread and a single io_service, I suggest investigating a pool of threads invoking io_service::run(), and only then investigate pinning an io_service to a specific thread and/or cpu. There are multiple examples included in the Asio documentation for all three of these designs, and several questions on SO with more information. Be aware that as you introduce multiple threads invoking io_service::run() you may need to implement strands to ensure the handlers have exclusive access to shared data structures.

Boost: Single Threaded IO Service

Be aware that you have to pay for synchronization as soon as you use any non-blocking calls of Asio.

Even though you might use a single thread for scheduling work and processing the resulting callbacks, Asio might still have to spawn additional threads internally for executing asynchronous calls. Those will access the io_service concurrently.

Think of an async_read on a socket: As soon as the received data becomes available, the socket has to notify the io_service. This happens concurrent to your main thread, so additional synchronization is required.

For blocking I/O this problem goes away in theory, but since asynchronous I/O is sort of the whole point of the library, I would not expect to find too many optimizations for this case in the implementation.

As was pointed out in the comments already, the contention on the io_service will be very low with only one main thread, so unless profiling indicates a clear performance bottleneck there, you should not worry about it too much.

Is it safe to disable threads on boost::asio in a multi-threaded program?

It'll depend. As far as I know it ought to be fine. See below for caveats/areas of attention.

Also, you might want to take a step back and think about the objectives. If you're trying to optimize areas containing async IO, there may be quick wins that don't require such drastic measures. That is not to say that there are certainly situations where I imagine BOOST_ASIO_DISABLE_THREADS will help squeeze just that little extra bit of performance out.

Impact

What BOOST_ASIO_DISABLE_THREADS does is

replace selected mutexes/events with null implementations
disable some internal thread support (boost::asio::detail::thread throws on construction)
removes atomics (atomic_count becomes non-atomic)
make globals behave as simple statics (applies to system_context/system_executor)
disables TLS support

System executor

It's worth noting that system_executor is the default fallback when querying for associated handler executors. The library implementation specifies that async initiations will override that default with the executor of any IO object involved (e.g. the one bound to your socket or timer).

However, you have to scrutinize your own use and that of third-party code to make sure you don't accidentally rely on fallback.

Update: turns out system_executor internally spawns a thread_group which uses detail::thread - correctly erroring out when used

IO Services

Asio is extensible. Some services may elect to run internal threads as an implementation detail.

docs:

The implementation of this library for a particular platform may make use of one or more internal threads to emulate asynchronicity. As far as possible, these threads must be invisible to the library user. [...]

I'd trust the library implementation to use detail::thread - causing a runtime error if that were to be the case.

However, again, when using third-party code/user services you'll have to make sure that they don't break your assumptions.

Also, specific operations will not work without the thread support, like:

Live On Coliru

#define BOOST_ASIO_DISABLE_THREADS
#include <boost/asio.hpp>
#include <iostream>

int main() {
    boost::asio::io_context ioc;
    boost::asio::ip::tcp::resolver r{ioc};

    std::cout << r.resolve("127.0.0.1", "80")->endpoint() << std::endl; // fine

    // throws "thread: not supported":
    r.async_resolve("127.0.0.1", "80", [](auto...) {});
}

Prints

127.0.0.1:80
terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
  what():  thread: Operation not supported [system:95]
bash: line 7: 25771 Aborted                 (core dumped) ./a.out

boost strand vs single thread

A strand will typically perform better than a single thread. This is because a strand gives the scheduler and the program logic more flexibility. However, the differences are typically not significant (except in the special case I discuss below).

For example, consider the case where something happens that requires service. With a strand, there can be more than one thread that could perform the service, and whichever of those threads gets scheduled first will do the job. With a thread, that very thread must get scheduled for the job to start.

Suppose, for example, a timer fires that creates some new work to be done by the strand. If the timer thread then calls into the strand's dispatch routine, the timer thread can do the work with no context switch. If you had a dedicated thread rather than a strand, then the timer thread could not do the work, and a context switch would be needed before the work created by the timer routine could even begin.

Note that if you just have one thread that executes the strand, you don't get these benefits. (But, IMO, that's a dumb way to do things if you care about performance at this fine a level.)

For some applications, carefully breaking your program into strands can significantly reduce the amount of lock operations required. Objects that are only accessed in a single strand need not be locked. But you can still get a lot of the advantages of multi-threading. (One big disadvantage though -- if any of your code ever blocks, it will stall the entire strand. So you either have to not mind if a strand stalls or make sure none of your code for a critical strand ever blocks.)

In this case, you can have three strands, A, B, and C, and a single thread can do some work for strand A, some for strand B, and some for strand C with no context switches (and with the data hot in the cache). Using a thread for each task would require two context switches to do the same job, and each task would likely not find the data in cache. If you constantly "hand things" from strand to strand, strands can significantly outperform dedicated threads.

As to your second question, a lock is not needed unless data is being accessed in one thread while it could possibly be being modified in another thread. If all accesses to an object are through a single strand, locks are not needed because a strand can only execute in one thread at a time. Typically, strands will access some data that is only accessed by that strand and some that is shared with other threads or strands.

Scalability of Boost.Asio

We are using 1.39 on several Linux flavors for timers, network (both TCP and UDP), serial (20+ lines, two of which run at 500 kbps), and inotify events, and while we don't have many socket connections, we do have a few hundred async timers at any time. They are in production and they work well, for us. If I were you, I'd make up a quick prototype and performance-test it.

Boost 1.43 claims a number of Linux-specific performance improvements in ASIO, but I am yet to benchmark them for our product.

Boost Asio Single Threaded Performance