C++ Which thread pool is cppreference.com talking about?
cppreference and the C++ standard are in fact at odds about this. cppreference says this (emphasis and strikethrough mine):
The template function
async
runs the functionf
asynchronously (potentiallyoptionally in a separate thread which may be part of a thread pool).
Whereas the C++ standard says this:
If
launch::async
is set inpolicy
, [std::async
] calls [the function f] as if in a new thread of execution ...
And these are clearly two different things.
Only Windows' implementation of std::async
uses a thread pool AFAIK, while gcc and clang start a new thread for every invocation of std::async
(when launch::async
is set in policy
), and thus follow the standard.
More analysis here: https://stackoverflow.com/a/50898570/5743288
Does async(launch::async) in C++11 make thread pools obsolete for avoiding expensive thread creation?
Question 1:
I changed this from the original because the original was wrong. I was under the impression that Linux thread creation was very cheap and after testing I determined that the overhead of function call in a new thread vs. a normal one is enormous. The overhead for creating a thread to handle a function call is something like 10000 or more times slower than a plain function call. So, if you're issuing a lot of small function calls, a thread pool might be a good idea.
It's quite apparent that the standard C++ library that ships with g++ doesn't have thread pools. But I can definitely see a case for them. Even with the overhead of having to shove the call through some kind of inter-thread queue, it would likely be cheaper than starting up a new thread. And the standard allows this.
IMHO, the Linux kernel people should work on making thread creation cheaper than it currently is. But, the standard C++ library should also consider using pool to implement launch::async | launch::deferred
.
And the OP is correct, using ::std::thread
to launch a thread of course forces the creation of a new thread instead of using one from a pool. So ::std::async(::std::launch::async, ...)
is preferred.
Question 2:
Yes, basically this 'implicitly' launches a thread. But really, it's still quite obvious what's happening. So I don't really think the word implicitly is a particularly good word.
I'm also not convinced that forcing you to wait for a return before destruction is necessarily an error. I don't know that you should be using the async
call to create 'daemon' threads that aren't expected to return. And if they are expected to return, it's not OK to be ignoring exceptions.
Question 3:
Personally, I like thread launches to be explicit. I place a lot of value on islands where you can guarantee serial access. Otherwise you end up with mutable state that you always have to be wrapping a mutex around somewhere and remembering to use it.
I liked the work queue model a whole lot better than the 'future' model because there are 'islands of serial' lying around so you can more effectively handle mutable state.
But really, it depends on exactly what you're doing.
Performance Test
So, I tested the performance of various methods of calling things and came up with these numbers on an 8 core (AMD Ryzen 7 2700X) system running Fedora 29 compiled with clang version 7.0.1 and libc++ (not libstdc++):
Do nothing calls per second: 35365257
Empty calls per second: 35210682
New thread calls per second: 62356
Async launch calls per second: 68869
Worker thread calls per second: 970415
And native, on my MacBook Pro 15" (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz) with Apple LLVM version 10.0.0 (clang-1000.10.44.4)
under OSX 10.13.6, I get this:
Do nothing calls per second: 22078079
Empty calls per second: 21847547
New thread calls per second: 43326
Async launch calls per second: 58684
Worker thread calls per second: 2053775
For the worker thread, I started up a thread, then used a lockless queue to send requests to another thread and then wait for a "It's done" reply to be sent back.
The "Do nothing" is just to test the overhead of the test harness.
It's clear that the overhead of launching a thread is enormous. And even the worker thread with the inter-thread queue slows things down by a factor of 20 or so on Fedora 25 in a VM, and by about 8 on native OS X.
I created an OSDN chamber holding the code I used for the performance test. It can be found here: https://osdn.net/users/omnifarious/pf/launch_thread_performance/
async uses a threadpool?
According to the standard std::async
cannot use a thread pool because of the requirements on thread local storage. However in practice MSVC does use a thread pool as it's implementation is built on top of PPL and they simply ignore the requirements on thread local storage. Other implementation will launch a new thread for every call to std::async
as the language requires.
As ever Bartosz has an excellent blog post on this subject.
Why would concurrency using std::async be faster than using std::thread?
My original interpretation was incorrect. Refer to @OznOg's answer below.
Modified Answer:
I created a simple benchmark that uses std::async
and std::thread
to do some tiny tasks:
#include <thread>
#include <chrono>
#include <vector>
#include <future>
#include <iostream>
__thread volatile int you_shall_not_optimize_this;
void work() {
// This is the simplest way I can think of to prevent the compiler and
// operating system from doing naughty things
you_shall_not_optimize_this = 42;
}
[[gnu::noinline]]
std::chrono::nanoseconds benchmark_threads(size_t count) {
std::vector<std::optional<std::thread>> threads;
threads.resize(count);
auto before = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < count; ++i)
threads[i] = std::thread { work };
for (size_t i = 0; i < count; ++i)
threads[i]->join();
threads.clear();
auto after = std::chrono::high_resolution_clock::now();
return after - before;
}
[[gnu::noinline]]
std::chrono::nanoseconds benchmark_async(size_t count, std::launch policy) {
std::vector<std::optional<std::future<void>>> results;
results.resize(count);
auto before = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < count; ++i)
results[i] = std::async(policy, work);
for (size_t i = 0; i < count; ++i)
results[i]->wait();
results.clear();
auto after = std::chrono::high_resolution_clock::now();
return after - before;
}
std::ostream& operator<<(std::ostream& stream, std::launch value)
{
if (value == std::launch::async)
return stream << "std::launch::async";
else if (value == std::launch::deferred)
return stream << "std::launch::deferred";
else
return stream << "std::launch::unknown";
}
// #define CONFIG_THREADS true
// #define CONFIG_ITERATIONS 10000
// #define CONFIG_POLICY std::launch::async
int main() {
std::cout << "Running benchmark:\n"
<< " threads? " << std::boolalpha << CONFIG_THREADS << '\n'
<< " iterations " << CONFIG_ITERATIONS << '\n'
<< " async policy " << CONFIG_POLICY << std::endl;
std::chrono::nanoseconds duration;
if (CONFIG_THREADS) {
duration = benchmark_threads(CONFIG_ITERATIONS);
} else {
duration = benchmark_async(CONFIG_ITERATIONS, CONFIG_POLICY);
}
std::cout << "Completed in " << duration.count() << "ns (" << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << "ms)\n";
}
I've run the benchmark as follows:
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
threads? false
iterations 10000
async policy std::launch::deferred
Completed in 4783327ns (4ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
threads? false
iterations 10000
async policy std::launch::async
Completed in 301756775ns (301ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
threads? true
iterations 10000
async policy std::launch::deferred
Completed in 291284997ns (291ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
threads? true
iterations 10000
async policy std::launch::async
Completed in 293539858ns (293ms)
I re-ran all the benchmarks with strace
attached and accumulated the system calls made:
# std::async with std::launch::async
1 access
2 arch_prctl
36 brk
10000 clone
6 close
1 execve
1 exit_group
10002 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::async with std::launch::deferred
1 access
2 arch_prctl
11 brk
6 close
1 execve
1 exit_group
10002 futex
28 mmap
9 mprotect
2 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
1 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::thread with std::launch::async
1 access
2 arch_prctl
27 brk
10000 clone
6 close
1 execve
1 exit_group
2 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::thread with std::launch::deferred
1 access
2 arch_prctl
27 brk
10000 clone
6 close
1 execve
1 exit_group
2 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
We observe that std::async
is significantly faster with std::launch::deferred
but that everything else doesn't seem to matter as much.
My conclusions are:
The current libstdc++ implementation does not take advantage of the fact that
std::async
doesn't need a new thread for each task.The current libstdc++ implementation does some sort of locking in
std::async
thatstd::thread
doesn't do.std::async
withstd::launch::deferred
saves setup and destroy costs and is much faster for this case.
My machine is configured as follows:
$ uname -a
Linux linux-2 5.12.1-arch1-1 #1 SMP PREEMPT Sun, 02 May 2021 12:43:58 +0000 x86_64 GNU/Linux
$ g++ --version
g++ (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lscpu # truncated
Architecture: x86_64
Byte Order: Little Endian
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Original Answer:
std::thread
is a wrapper for thread objects which are provided by the operating system, they are extremely expensive to create and destroy.
std::async
is similar, but there isn't a 1-to-1 mapping between tasks and operating system threads. This could be implemented with thread pools, where threads are reused for multiple tasks.
So std::async
is better if you have many small tasks, and std::thread
is better if you have a few tasks that are running for long periods of time.
Also if you have things that truly need to happen in parallel, then std::async
might not fit very well. (std::thread
also can't make such guarantees, but that's the closest you can get.)
Maybe to clarify, in your case std::async
saves the overhead from creating and destroying threads.
(Depending on the operating system, you could also lose performance simply by having a lot of threads running. An operating system might have a scheduling strategy where it tries to guarantee that every thread gets executed every so often, thus the scheduler could decide go give the individual threads smaller slices of processing time, thus creating more overhead for switching between threads.)
Does C++ async use a thread pool when building for macOS with Xcode?
An application built with the standard toolchain for the platform (Xcode/Clang) does not use a thread pool. The base of the stack of a task launched with std::async
contains std::thread and pthread calls.
On exit, each job calls pthread_exit()
to kill the thread running it.
Xcode 8.3.3 also uses an OS thread per job launched with std::async
when building for iOS (Tested on original iPad Pro 9.7" target).
Related Topics
How to Make Reading from 'Std::Cin' Timeout After a Particular Amount of Time
Why Sizeof(Int) Is Not Greater Than -1
How to Programmatically Cause a Core Dump in C/C++
Difference Between Instantiation and Specialization in C++ Templates
How to Dynamically Allocate a Matrix
Expand File Names That Have Environment Variables in Their Path
Acquire/Release Semantics with Non-Temporal Stores on X64
Why Isn't C/C++'s "#Pragma Once" an Iso Standard
What's the Precedence of Comma Operator Inside Conditional Operator in C++
Waitforinputidle Doesn't Work for Starting Mspaint Programmatically
How to Find Where an Exception Was Thrown in C++
Differencebetween Wm_Quit, Wm_Close, and Wm_Destroy in a Windows Program
Lvalue to Rvalue Implicit Conversion