Why Should I Use Std::Async

Why should I use std::async?

If you need the result of an asynchronous operation, then you have to block, no matter what library you use. The idea is that you get to choose when to block, and, hopefully when you do that, you block for a negligible time because all the work has already been done.

Note also that std::async can be launched with policies std::launch::async or std::launch::deferred. If you don't specify it, the implementation is allowed to choose, and it could well choose to use deferred evaluation, which would result in all the work being done when you attempt to get the result from the future, resulting in a longer block. So if you want to make sure that the work is done asynchronously, use std::launch::async.

When to use std::async vs std::threads?

It's not really an either-or thing - you can use futures (together with promises) with manually created std::threads. Using std::async is a convenient way to fire off a thread for some asynchronous computation and marshal the result back via a future but std::async is rather limited in the current standard. It will become more useful if the suggested extensions to incorporate some of the ideas from Microsoft's PPL are accepted.

Currently, std::async is probably best suited to handling either very long running computations or long running IO for fairly simple programs. It doesn't guarantee low overhead though (and in fact the way it is specified makes it difficult to implement with a thread pool behind the scenes), so it's not well suited for finer grained workloads. For that you either need to roll your own thread pools using std::thread or use something like Microsoft's PPL or Intel's TBB.

You can also use std::thread for 'traditional' POSIX thread style code written in a more modern and portable way.

Bartosz Milewski discusses some of the limitations of the way std::async is currently specified in his article Async Tasks in C++11: Not Quite There Yet

scope block when use std::async in function other than the main function

This QUESTION answered in :

main thread waits for std::async to complete

Can I use std::async without waiting for the future limitation?

Whoever, If you store the std::future object, its lifetime will be extended to the end of main and you get the behavior you want.

void printData() 
{
   for (size_t i = 0; i < 5; i++)
    {
        std::cout << "Test Function" << std::endl;
        std::this_thread::sleep_for(std::chrono::seconds(1));
    }
}

std::future<void> runningAsync()
{
    return std::async(std::launch::async, test);
}

int main()
{
    auto a = runningAsync();

    std::cout << "Main Function" << std::endl;
}

That's a problem because std::future's destructor may block and wait for the thread to finish. see this link for more details

Why would concurrency using std::async be faster than using std::thread?

My original interpretation was incorrect. Refer to @OznOg's answer below.

Modified Answer:

I created a simple benchmark that uses std::async and std::thread to do some tiny tasks:

#include <thread>
#include <chrono>
#include <vector>
#include <future>
#include <iostream>

__thread volatile int you_shall_not_optimize_this;

void work() {
    // This is the simplest way I can think of to prevent the compiler and
    // operating system from doing naughty things
    you_shall_not_optimize_this = 42;
}

[[gnu::noinline]]
std::chrono::nanoseconds benchmark_threads(size_t count) {
    std::vector<std::optional<std::thread>> threads;
    threads.resize(count);

    auto before = std::chrono::high_resolution_clock::now();

    for (size_t i = 0; i < count; ++i)
        threads[i] = std::thread { work };

    for (size_t i = 0; i < count; ++i)
        threads[i]->join();

    threads.clear();

    auto after = std::chrono::high_resolution_clock::now();

    return after - before;
}

[[gnu::noinline]]
std::chrono::nanoseconds benchmark_async(size_t count, std::launch policy) {
    std::vector<std::optional<std::future<void>>> results;
    results.resize(count);

    auto before = std::chrono::high_resolution_clock::now();

    for (size_t i = 0; i < count; ++i)
        results[i] = std::async(policy, work);

    for (size_t i = 0; i < count; ++i)
        results[i]->wait();

    results.clear();

    auto after = std::chrono::high_resolution_clock::now();

    return after - before;
}

std::ostream& operator<<(std::ostream& stream, std::launch value)
{
    if (value == std::launch::async)
        return stream << "std::launch::async";
    else if (value == std::launch::deferred)
        return stream << "std::launch::deferred";
    else
        return stream << "std::launch::unknown";
}

// #define CONFIG_THREADS true
// #define CONFIG_ITERATIONS 10000
// #define CONFIG_POLICY std::launch::async

int main() {
    std::cout << "Running benchmark:\n"
              << "  threads?     " << std::boolalpha << CONFIG_THREADS << '\n'
              << "  iterations   " << CONFIG_ITERATIONS << '\n'
              << "  async policy " << CONFIG_POLICY << std::endl;

    std::chrono::nanoseconds duration;
    if (CONFIG_THREADS) {
        duration = benchmark_threads(CONFIG_ITERATIONS);
    } else {
        duration = benchmark_async(CONFIG_ITERATIONS, CONFIG_POLICY);
    }

    std::cout << "Completed in " << duration.count() << "ns (" << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << "ms)\n";
}

I've run the benchmark as follows:

$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
  threads?     false
  iterations   10000
  async policy std::launch::deferred
Completed in 4783327ns (4ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
  threads?     false
  iterations   10000
  async policy std::launch::async
Completed in 301756775ns (301ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
  threads?     true
  iterations   10000
  async policy std::launch::deferred
Completed in 291284997ns (291ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
  threads?     true
  iterations   10000
  async policy std::launch::async
Completed in 293539858ns (293ms)

I re-ran all the benchmarks with strace attached and accumulated the system calls made:

# std::async with std::launch::async
      1 access
      2 arch_prctl
     36 brk
  10000 clone
      6 close
      1 execve
      1 exit_group
  10002 futex
  10028 mmap
  10009 mprotect
   9998 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
  20001 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

# std::async with std::launch::deferred
      1 access
      2 arch_prctl
     11 brk
      6 close
      1 execve
      1 exit_group
  10002 futex
     28 mmap
      9 mprotect
      2 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
      1 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

# std::thread with std::launch::async
      1 access
      2 arch_prctl
     27 brk
  10000 clone
      6 close
      1 execve
      1 exit_group
      2 futex
  10028 mmap
  10009 mprotect
   9998 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
  20001 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

# std::thread with std::launch::deferred
      1 access
      2 arch_prctl
     27 brk
  10000 clone
      6 close
      1 execve
      1 exit_group
      2 futex
  10028 mmap
  10009 mprotect
   9998 munmap
      7 newfstatat
      6 openat
      7 pread64
      1 prlimit64
      5 read
      2 rt_sigaction
  20001 rt_sigprocmask
      1 set_robust_list
      1 set_tid_address
      5 write

We observe that std::async is significantly faster with std::launch::deferred but that everything else doesn't seem to matter as much.

My conclusions are:

The current libstdc++ implementation does not take advantage of the fact that std::async doesn't need a new thread for each task.
The current libstdc++ implementation does some sort of locking in std::async that std::thread doesn't do.
std::async with std::launch::deferred saves setup and destroy costs and is much faster for this case.

My machine is configured as follows:

$ uname -a
Linux linux-2 5.12.1-arch1-1 #1 SMP PREEMPT Sun, 02 May 2021 12:43:58 +0000 x86_64 GNU/Linux
$ g++ --version
g++ (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lscpu # truncated
Architecture:                    x86_64
Byte Order:                      Little Endian
CPU(s):                          8
Model name:                      Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

Original Answer:

std::thread is a wrapper for thread objects which are provided by the operating system, they are extremely expensive to create and destroy.

std::async is similar, but there isn't a 1-to-1 mapping between tasks and operating system threads. This could be implemented with thread pools, where threads are reused for multiple tasks.

So std::async is better if you have many small tasks, and std::thread is better if you have a few tasks that are running for long periods of time.

Also if you have things that truly need to happen in parallel, then std::async might not fit very well. (std::thread also can't make such guarantees, but that's the closest you can get.)

Maybe to clarify, in your case std::async saves the overhead from creating and destroying threads.

(Depending on the operating system, you could also lose performance simply by having a lot of threads running. An operating system might have a scheduling strategy where it tries to guarantee that every thread gets executed every so often, thus the scheduler could decide go give the individual threads smaller slices of processing time, thus creating more overhead for switching between threads.)

How to use std::async efficiently to perform operations on pointer array

The fundamental problem is your code wants to launch more than array size / 100 threads. That means more than 100 threads. 100 threads won't do anything good; they'll thrash. See std::thread::hardware_concurrency, and in general don't use raw async or thread in production applications; write task pools and splice together futures and the like.

That many threads is both extremely inefficient and could exhaust system resources.

The second problem is you failed to calculate the average of 2 values.

The average of begIdx and endIdx is not endIdx/2 but rather:

int midIdx = begIdx + (endIdx-begIdx) / 2;

Live example.

You'll notice I discovered the problem with your program by adding intermediate output. In particular, I had it print out the ranges it was working on, and I noticed it was repeating ranges. This is known as "printf debugging", and is pretty powerful especially when step-based debugging isn't (with this many threads, stepping through the code will be brain-numbing)

How to use std future and async with threading in a for loop with a shared resource as param?

First - define what information each thread shall provide.

In this case - it is probably something like this:

struct Result
{
     error_codes::NPPCoreErrorCode error;
     float produced_energy;
};

So your future type is std::future<Result>.

Then start your work in many threads:

std::vector<std::future<Result>> results;
for (const auto& core_it : reactor_cores_)
{
    auto action = [&]{
        Result res; 
        res.error = core_it.GenerateEnergy(energy_required_per_core, res.power_generated);
        return res;
    };
    // start thread
    results.emplace_back(std::async(std::launch::async, action));
}

Then wait for each thread to finish:

  for (auto& f : results) f.wait();

Then, I guess, you want to sum up:


 for (auto& f : results) {
    Result res = f.get();
    if (res.error == error_codes::success)
       power_generated += res.power_generated;
    else {
       power_plant_error_code = res.error;
       // depending on your error strategy, you might break here
       break;
   }
}

How to run line of codes asynchronously in c++

I haven't understood what exactly you want to do. But I think you can read more about the std::async.

#include <iostream>
#include <future>

void asyncFunction ()
{
    std::cout << "I am inside async function\n";
}

int main()
{
    std::future<void> fn = std::async(std::launch::async, asyncFunction);
    // here some other main thread operations
    return 0;
}

Function that is run asynchronously can also return a value, which can be accessed through the future with std::future::get blocking method.

Why Should I Use Std::Async