C++ Socket Server - Unable to Saturate Cpu

C++ Socket Server - Unable to saturate CPU

boost::asio is not as thread-friendly as you would hope - there is a big lock around the epoll code in boost/asio/detail/epoll_reactor.hpp which means that only one thread can call into the kernel's epoll syscall at a time. And for very small requests this makes all the difference (meaning you will only see roughly single-threaded performance).

Note that this is a limitation of how boost::asio uses the Linux kernel facilities, not necessarily the Linux kernel itself. The epoll syscall does support multiple threads when using edge-triggered events, but getting it right (without excessive locking) can be quite tricky.

BTW, I have been doing some work in this area (combining a fully-multithreaded edge-triggered epoll event loop with user-scheduled threads/fibers) and made some code available under the nginetd project.

Fastest socket method for a lot of data between a lot of files

On windows you may try using TransmitFile, which has a potential of boosting your performance by avoiding kernel space <-> user space data copying.

Socket server with epoll and threads

I think you're trying to over-engineer this problem. The epoll architecture in Linux was intended for situations where you have thousands of concurrent connections. In these kinds of cases, the overhead by the way the poll and select system calls are defined will be the main bottleneck in a server. The decision to use poll or select vs. epoll is based on the number of connections, not the amount of data.

For what you're doing, it seems as though the humans at your editing system would go insane after you hit a few dozen concurrent editors. Using epoll will probably make you go crazy; they play a few tricks with the API to squeeze out the extra performance, and you have to be very careful processing the information you get back from the calls.

This sort of application sounds like it would be network-I/O-bound instead of CPU-bound. I would try writing it as a single-threaded server with poll first. When you receive new text, buffer it for your clients if necessary, and then send it out when the socket accepts write calls. Use non-blocking I/O; the only call you want to block is the poll call.

If you are doing a significant amount of processing on the data after receiving it, but before sending it back out to clients, then you could benefit from multi-threading. Write the single-threaded version first, then if you are CPU-bound (check using top) and most of the CPU time is spent in the functions where you are doing data processing (check using gprof), add multithreading to do the data processing.

If you want, you can use pipes or Unix-domain sockets inside the program for communication between the different threads---in this way everything in the main thread can be event-driven and handled through poll. Alternatively, with this model, you could even use multiple processes with fork instead of multiple threads.

Increasing CPU usage on server side

According to the topic, your goal seems to be to load the CPU that you have on your server and it is not directly related to python. Besides that it is hard to judge what you program is doing "wrong" without any code.

If your server uses Linux, check out stress utility. It can do that for you.

If you server runs Windows, this topic might be helpful in terms of stressing CPU.

If you need Pythonic solution, you can refer to this StackOverflow post.

Is it safe to disable threads on boost::asio in a multi-threaded program?

It'll depend. As far as I know it ought to be fine. See below for caveats/areas of attention.

Also, you might want to take a step back and think about the objectives. If you're trying to optimize areas containing async IO, there may be quick wins that don't require such drastic measures. That is not to say that there are certainly situations where I imagine BOOST_ASIO_DISABLE_THREADS will help squeeze just that little extra bit of performance out.

Impact

What BOOST_ASIO_DISABLE_THREADS does is

replace selected mutexes/events with null implementations
disable some internal thread support (boost::asio::detail::thread throws on construction)
removes atomics (atomic_count becomes non-atomic)
make globals behave as simple statics (applies to system_context/system_executor)
disables TLS support

System executor

It's worth noting that system_executor is the default fallback when querying for associated handler executors. The library implementation specifies that async initiations will override that default with the executor of any IO object involved (e.g. the one bound to your socket or timer).

However, you have to scrutinize your own use and that of third-party code to make sure you don't accidentally rely on fallback.

Update: turns out system_executor internally spawns a thread_group which uses detail::thread - correctly erroring out when used

IO Services

Asio is extensible. Some services may elect to run internal threads as an implementation detail.

docs:

The implementation of this library for a particular platform may make use of one or more internal threads to emulate asynchronicity. As far as possible, these threads must be invisible to the library user. [...]

I'd trust the library implementation to use detail::thread - causing a runtime error if that were to be the case.

However, again, when using third-party code/user services you'll have to make sure that they don't break your assumptions.

Also, specific operations will not work without the thread support, like:

Live On Coliru

#define BOOST_ASIO_DISABLE_THREADS
#include <boost/asio.hpp>
#include <iostream>

int main() {
    boost::asio::io_context ioc;
    boost::asio::ip::tcp::resolver r{ioc};

    std::cout << r.resolve("127.0.0.1", "80")->endpoint() << std::endl; // fine

    // throws "thread: not supported":
    r.async_resolve("127.0.0.1", "80", [](auto...) {});
}

Prints

127.0.0.1:80
terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
  what():  thread: Operation not supported [system:95]
bash: line 7: 25771 Aborted                 (core dumped) ./a.out

How to speed up a http c server on larger payload?

As a side note, your code does nothing related to HTTP, so it's not really a "http server", but that's not what the question is about.

Your question is that the performance is slow. And the answer is simple - the call to write() is blocking. It doesn't merely "hand in the memory address of the response to the network" - it also waits until it's delivered! So the way this code is written, you're really just processing one request at a time. No wonder your requests-per-second drop as the payload increases.

What you need is "asynchronous" (also known as "non-blocking") processing - your "reads" and "writes" should return immediately, not waiting until the data is delivered. This way you can service multiple sockets in parallel, even without multiple threads. The downside however is that juggling them all will become pretty complicated. But if you do it right, you'll saturate your CPU and/or network to it's fullest.

The details about this are pretty lengthy and I won't repeat them here again. Googling for "linux non blocking socket example" seems to bring up many good results, as well as the famous Beej's Guide to Network Programming covers all the important points in a very nice way. Read that!

Can Boost ASIO be used to build low-latency applications?

I evaluated Boost Asio for use in high frequency trading a few years ago. To the best of my knowledge the basics are still the same today. Here are some reasons why I decided not to use it:

Asio relies on bind() style callbacks. There is some overhead here.
It is not obvious how to arrange certain low-level operations to occur at the right moment or in the right way.
There is rather a lot of complex code in an area which is important to optimize. It is harder to optimize complex, general code for specific use cases. Thinking that you will not need to look under the covers would be a mistake.
There is little to no need for portability in HFT applications. In particular, having "automatic" selection of a multiplexing mechanism is contrary to the mission, because each mechanism must be tested and optimized separately--this creates more work rather than reducing it.
If a third-party library is to be used, others such as libev, libevent, and libuv are more battle-hardened and avoid some of these downsides.

Related: C++ Socket Server - Unable to saturate CPU

C++ Socket Server - Unable to Saturate Cpu