Which Std::Async Implementations Use Thread Pools

C++ Which thread pool is cppreference.com talking about?

cppreference and the C++ standard are in fact at odds about this. cppreference says this (emphasis and strikethrough mine):

The template function async runs the function f asynchronously (~~potentially~~ optionally in a separate thread which may be part of a thread pool).

Whereas the C++ standard says this:

If launch::async is set in policy, [std::async] calls [the function f] as if in a new thread of execution ...

And these are clearly two different things.

Only Windows' implementation of std::async uses a thread pool AFAIK, while gcc and clang start a new thread for every invocation of std::async (when launch::async is set in policy), and thus follow the standard.

More analysis here: https://stackoverflow.com/a/50898570/5743288

Does async(launch::async) in C++11 make thread pools obsolete for avoiding expensive thread creation?

Question 1:

I changed this from the original because the original was wrong. I was under the impression that Linux thread creation was very cheap and after testing I determined that the overhead of function call in a new thread vs. a normal one is enormous. The overhead for creating a thread to handle a function call is something like 10000 or more times slower than a plain function call. So, if you're issuing a lot of small function calls, a thread pool might be a good idea.

It's quite apparent that the standard C++ library that ships with g++ doesn't have thread pools. But I can definitely see a case for them. Even with the overhead of having to shove the call through some kind of inter-thread queue, it would likely be cheaper than starting up a new thread. And the standard allows this.

IMHO, the Linux kernel people should work on making thread creation cheaper than it currently is. But, the standard C++ library should also consider using pool to implement launch::async | launch::deferred.

And the OP is correct, using ::std::thread to launch a thread of course forces the creation of a new thread instead of using one from a pool. So ::std::async(::std::launch::async, ...) is preferred.

Question 2:

Yes, basically this 'implicitly' launches a thread. But really, it's still quite obvious what's happening. So I don't really think the word implicitly is a particularly good word.

I'm also not convinced that forcing you to wait for a return before destruction is necessarily an error. I don't know that you should be using the async call to create 'daemon' threads that aren't expected to return. And if they are expected to return, it's not OK to be ignoring exceptions.

Question 3:

Personally, I like thread launches to be explicit. I place a lot of value on islands where you can guarantee serial access. Otherwise you end up with mutable state that you always have to be wrapping a mutex around somewhere and remembering to use it.

I liked the work queue model a whole lot better than the 'future' model because there are 'islands of serial' lying around so you can more effectively handle mutable state.

But really, it depends on exactly what you're doing.

Performance Test

So, I tested the performance of various methods of calling things and came up with these numbers on an 8 core (AMD Ryzen 7 2700X) system running Fedora 29 compiled with clang version 7.0.1 and libc++ (not libstdc++):

   Do nothing calls per second:   35365257                                      
        Empty calls per second:   35210682                                      
   New thread calls per second:      62356                                      
 Async launch calls per second:      68869                                      
Worker thread calls per second:     970415

And native, on my MacBook Pro 15" (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz) with Apple LLVM version 10.0.0 (clang-1000.10.44.4) under OSX 10.13.6, I get this:

   Do nothing calls per second:   22078079
        Empty calls per second:   21847547
   New thread calls per second:      43326
 Async launch calls per second:      58684
Worker thread calls per second:    2053775

For the worker thread, I started up a thread, then used a lockless queue to send requests to another thread and then wait for a "It's done" reply to be sent back.

The "Do nothing" is just to test the overhead of the test harness.

It's clear that the overhead of launching a thread is enormous. And even the worker thread with the inter-thread queue slows things down by a factor of 20 or so on Fedora 25 in a VM, and by about 8 on native OS X.

I created an OSDN chamber holding the code I used for the performance test. It can be found here: https://osdn.net/users/omnifarious/pf/launch_thread_performance/

async uses a threadpool?

According to the standard std::async cannot use a thread pool because of the requirements on thread local storage. However in practice MSVC does use a thread pool as it's implementation is built on top of PPL and they simply ignore the requirements on thread local storage. Other implementation will launch a new thread for every call to std::async as the language requires.

As ever Bartosz has an excellent blog post on this subject.

Why would concurrency using std::async be faster than using std::thread?

My original interpretation was incorrect. Refer to @OznOg's answer below.

Modified Answer:

I created a simple benchmark that uses std::async and std::thread to do some tiny tasks:

#include <thread>
#include <chrono>
#include <vector>
#include <future>
#include <iostream>

__thread volatile int you_shall_not_optimize_this;

void work() {
    // This is the simplest way I can think of to prevent the compiler and
    // operating system from doing naughty things
    you_shall_not_optimize_this = 42;
}

[[gnu::noinline]]
std::chrono::nanoseconds benchmark_threads(size_t count) {
    std::vector<std::optional<std::thread>> threads;
    threads.resize(count);

    auto before = std::chrono::high_resolution_clock::now();

    for (size_t i = 0; i < count; ++i)
        threads[i] = std::thread { work };

    for (size_t i = 0; i < count; ++i)
        threads[i]->join();

    threads.clear();

    auto after = std::chrono::high_resolution_clock::now();

    return after - before;
}

[[gnu::noinline]]
std::chrono::nanoseconds benchmark_async(size_t count, std::launch policy) {
    std::vector<std::optional<std::future<void>>> results;
    results.resize(count);

    auto before = std::chrono::high_resolution_clock::now();

    for (size_t i = 0; i < count; ++i)
        results[i] = std::async(policy, work);

    for (size_t i = 0; i < count; ++i)
        results[i]->wait();

    results.clear();

    auto after = std::chrono::high_resolution_clock::now();

    return after - before;
}

std::ostream& operator<<(std::ostream& stream, std::launch value)
{
    if (value == std::launch::async)
        return stream << "std::launch::async";
    else if (value == std::launch::deferred)
        return stream << "std::launch::deferred";
    else
        return stream << "std::launch::unknown";
}

// #define CONFIG_THREADS true
// #define CONFIG_ITERATIONS 10000
// #define CONFIG_POLICY std::launch::async

int main() {
    std::cout << "Running benchmark:\n"
              << "  threads?     " << std::boolalpha << CONFIG_THREADS << '\n'
              << "  iterations   " << CONFIG_ITERATIONS << '\n'
              << "  async policy " << CONFIG_POLICY << std::endl;

    std::chrono::nanoseconds duration;
    if (CONFIG_THREADS) {
        duration = benchmark_threads(CONFIG_ITERATIONS);
    } else {
        duration = benchmark_async(CONFIG_ITERATIONS, CONFIG_POLICY);
    }

    std::cout << "Completed in " << duration.count() << "ns (" << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << "ms)\n";
}

I've run the benchmark as follows:

$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
  threads?     false
  iterations   10000
  async policy std::launch::deferred
Completed in 4783327ns (4ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
  threads?     false
  iterations   10000
  async policy std::launch::async
Completed in 301756775ns (301ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
  threads?     true
  iterations   10000
  async policy std::launch::deferred
Completed in 291284997ns (291ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
  threads?     true
  iterations   10000
  async policy std::launch::async
Completed in 293539858ns (293ms)

I re-ran all the benchmarks with strace attached and accumulated the system calls made:

# std::async with std::launch::async
      1 access
      2 arch_prctl
     36 brk
  10000 clone
      6 close
      1 execve
      1 exit_group
  10002 futex
  10028 mmap
  10009 mprotect
   9998 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
  20001 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

# std::async with std::launch::deferred
      1 access
      2 arch_prctl
     11 brk
      6 close
      1 execve
      1 exit_group
  10002 futex
     28 mmap
      9 mprotect
      2 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
      1 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

# std::thread with std::launch::async
      1 access
      2 arch_prctl
     27 brk
  10000 clone
      6 close
      1 execve
      1 exit_group
      2 futex
  10028 mmap
  10009 mprotect
   9998 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
  20001 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

# std::thread with std::launch::deferred
      1 access
      2 arch_prctl
     27 brk
  10000 clone
      6 close
      1 execve
      1 exit_group
      2 futex
  10028 mmap
  10009 mprotect
   9998 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
  20001 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

We observe that std::async is significantly faster with std::launch::deferred but that everything else doesn't seem to matter as much.

My conclusions are:

The current libstdc++ implementation does not take advantage of the fact that std::async doesn't need a new thread for each task.
The current libstdc++ implementation does some sort of locking in std::async that std::thread doesn't do.
std::async with std::launch::deferred saves setup and destroy costs and is much faster for this case.

My machine is configured as follows:

$ uname -a
Linux linux-2 5.12.1-arch1-1 #1 SMP PREEMPT Sun, 02 May 2021 12:43:58 +0000 x86_64 GNU/Linux
$ g++ --version
g++ (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lscpu # truncated
Architecture:                    x86_64
Byte Order:                      Little Endian
CPU(s):                          8
Model name:                      Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

Original Answer:

std::thread is a wrapper for thread objects which are provided by the operating system, they are extremely expensive to create and destroy.

std::async is similar, but there isn't a 1-to-1 mapping between tasks and operating system threads. This could be implemented with thread pools, where threads are reused for multiple tasks.

So std::async is better if you have many small tasks, and std::thread is better if you have a few tasks that are running for long periods of time.

Also if you have things that truly need to happen in parallel, then std::async might not fit very well. (std::thread also can't make such guarantees, but that's the closest you can get.)

Maybe to clarify, in your case std::async saves the overhead from creating and destroying threads.

(Depending on the operating system, you could also lose performance simply by having a lot of threads running. An operating system might have a scheduling strategy where it tries to guarantee that every thread gets executed every so often, thus the scheduler could decide go give the individual threads smaller slices of processing time, thus creating more overhead for switching between threads.)

Does C++ async use a thread pool when building for macOS with Xcode?

An application built with the standard toolchain for the platform (Xcode/Clang) does not use a thread pool. The base of the stack of a task launched with std::async contains std::thread and pthread calls.

Stack trace showing async job running on a dedicated OS thread

On exit, each job calls pthread_exit() to kill the thread running it.

Sample Image

Xcode 8.3.3 also uses an OS thread per job launched with std::async when building for iOS (Tested on original iPad Pro 9.7" target).