In Visual Studio, 'Thread_Local' Variables' Destructor Not Called When Used with Std::Async, Is This a Bug

In Visual Studio, `thread_local` variables' destructor not called when used with std::async, is this a bug?

Introductory Note: I have now learned a lot more about this and have therefore re-written my answer. Thanks to @super, @M.M and (latterly) @DavidHaim and @NoSenseEtAl for putting me on the right track.

tl;dr Microsoft's implementation of std::async is non-conformant, but they have their reasons and what they have done can actually be useful, once you understand it properly.

For those who don't want that, it is not too difficult to code up a drop-in replacement replacement for std::async which works the same way on all platforms. I have posted one here.

Edit: Wow, how open MS are being these days, I like it, see: https://github.com/MicrosoftDocs/cpp-docs/issues/308

Let's being at the beginning. cppreference has this to say (emphasis and strikethrough mine):

The template function async runs the function f asynchronously (~~potentially~~ optionally in a separate thread which may be part of a thread pool).

However, the C++ standard says this:

If launch::async is set in policy, [std::async] calls [the function f] as if in a new thread of execution ...

So which is correct? The two statements have very different semantics as the OP has discovered. Well of course the standard is correct, as both clang and gcc show, so why does the Windows implementation differ? And like so many things, it comes down to history.

The (oldish) link that M.M dredged up has this to say, amongst other things:

... Microsoft has its implementation of [std::async] in the form of PPL (Parallel Pattern Library) ... [and] I can understand the eagerness of those companies to bend the rules and make these libraries accessible through std::async, especially if they can dramatically improve performance...

... Microsoft wanted to change the semantics of std::async when called with launch_policy::async. I think this was pretty much ruled out in the ensuing discussion ... (rationale follows, if you want to know more then read the link, it's well worth it).

And PPL is based on Windows' built-in support for ThreadPools, so @super was right.

So what does the Windows thread pool do and what is it good for? Well, it's intended to manage frequently-sheduled, short-running tasks in an efficient way so point 1 is don't abuse it, but my simple tests show that if this is your use-case then it can offer significant efficiencies. It does, essentially, two things

It recycles threads, rather than having to always start a new one for each asynchronous task you launch.
It limits the total number of background threads it uses, after which a call to std::async will block until a thread becomes free. On my machine, this number is 768.

So knowing all that, we can now explain the OP's observations:

A new thread is created for each of the three tasks started by main() (because none of them terminates immediately).
Each of these three threads creates a new thread-local variable Foo some_thread_var.
These three tasks all run to completion but the threads they are running on remain in existence (sleeping).
The program then sleeps for a short while and then exits, leaving the 3 thread-local variables un-destructed.

I ran a number of tests and in addition to this I found a few key things:

When a thread is recycled, the thread-local variables are re-used. Specifically, they are not destroyed and then re-created (you have been warned!).
If all the asynchonous tasks complete and you wait long enough, the thread pool terminates all the associated threads and the thread-local variables are then destroyed. (No doubt the actual rules are more complex than that but that's what I observed).
As new asynchonous tasks are submitted, the thread pool limits the rate at which new threads are created, in the hope that one will become free before it needs to perform all that work (creating new threads is expensive). A call to std::async might therefore take a while to return (up to 300ms in my tests). In the meantime, it's just hanging around, hoping that its ship will come in. This behaviour is documented but I call it out here in case it takes you by surprise.

Conclusions:

Microsoft's implementation of std::async is non-conformant but it is clearly designed with a specific purpose, and that purpose is to make good use of the Win32 ThreadPool API. You can beat them up for blantantly flouting the standard but it's been this way for a long time and they probably have (important!) customers who rely on it. I will ask them to call this out in their documentation. Not doing that is criminal.
It is not safe to use thread_local variables in std::async tasks on Windows. Just don't do it, it will end in tears.

possible std::async implementation bug Windows

New day, better answer (much better). Read on.

I spent some time investigating the behaviour of std::async on Windows and you're right. It's a different animal, see here.

So, if your code relies on std::async always starting a new thread of execution and returning immediately then you can't use it. Not on Windows, anyway. On my machine, the limit seems to be 768 background threads, which would fit in, more or less, with what you have observed.

Anyway, I wanted to learn a bit more about modern C++ so I had a crack at rolling my own replacement for std::async that can be used on Windows with the semantics deaired by the OP. I therefore humbly present the following:

AsyncTask: drop-in replacement for std::async

#include <future>
#include <thread>

template <class Func, class... Args>
    std::future <std::result_of_t <std::decay_t <Func> (std::decay_t <Args>...)>>
        AsyncTask (Func&& f, Args&&... args)
{
    using decay_func = std::decay_t <Func>;
    using return_type = std::result_of_t <decay_func (std::decay_t <Args>...)>;

    std::packaged_task <return_type (decay_func f, std::decay_t <Args>... args)>
        task ([] (decay_func f, std::decay_t <Args>... args)
    {
        return f (args...);
    });

    auto task_future = task.get_future ();
    std::thread t (std::move (task), f, std::forward <Args> (args)...);
    t.detach ();
    return task_future;
};

Test program

#include <iostream>
#include <string>

int add_two_integers (int a, int b)
{
    return a + b;
}

std::string append_to_string (const std::string& s)
{
    return s + " addendum";
}

int main ()
{
    auto /* i.e. std::future <int> */ f1 = AsyncTask (add_two_integers , 1, 2);
    auto /* i.e. int */  i = f1.get ();
    std::cout << "add_two_integers : " << i << std::endl;

    auto  /* i.e. std::future <std::string> */ f2 = AsyncTask (append_to_string , "Hello world");
    auto /* i.e. std::string */ s = f2.get ();        std::cout << "append_to_string : " << s << std::endl;
    return 0;  
}

Output

add_two_integers : 3
append_to_string : Hello world addendum

Live demo here (gcc) and here (clang).

I learnt a lot from writing this and it was a lot of fun. I'm fairly new to this stuff, so all comments welcome. I'll be happy to update this post if I've got anything wrong.

Is std::packaged_taskT (Function Template) an std::async (FT) with an invoked function?

std::async(ploicy, callable, args...)

launches a new thread (if the resources are available) if the policy is std::async::launch.

If the policy is not determined, it may launch or not.

If the policy is std::async::deferred, won't launch.

while std::packaged_task wraps your callable so that it can be invoked asynchronously using a new thread like

auto t1 = std::thread(std::move(taskObj), args...);
....
t1.join();

But If you used it as you do in your example, it wouldn't launch a new thread. It doesn't launch a new thread by itself but it can be used to do that.

C++ Which thread pool is cppreference.com talking about?

cppreference and the C++ standard are in fact at odds about this. cppreference says this (emphasis and strikethrough mine):

The template function async runs the function f asynchronously (~~potentially~~ optionally in a separate thread which may be part of a thread pool).

Whereas the C++ standard says this:

If launch::async is set in policy, [std::async] calls [the function f] as if in a new thread of execution ...

And these are clearly two different things.

Only Windows' implementation of std::async uses a thread pool AFAIK, while gcc and clang start a new thread for every invocation of std::async (when launch::async is set in policy), and thus follow the standard.

More analysis here: https://stackoverflow.com/a/50898570/5743288

C++ multi threading: how do I reuse threads for many jobs?

Thanks for all the comments. I ended up incorporating Ted Lyngmo's example which improved performance from 80ms to 7ms per frame using all my cores.

The task struct:

#ifdef __cpp_lib_hardware_interference_size
  using std::hardware_constructive_interference_size;
  using std::hardware_destructive_interference_size;
#else
  constexpr std::size_t hardware_constructive_interference_size = 2 * sizeof(std::max_align_t);
  constexpr std::size_t hardware_destructive_interference_size = 2 * sizeof(std::max_align_t);
#endif

struct alignas(hardware_destructive_interference_size) raytrace_task
{
    // ctor omitted

    void operator()()
    {
        // raytrace one screen-chunk here
    }
}

and the code triggering the raytracing each frame:

#include <execution>

// ...

void trace()
{
    const auto thread_count = std::thread::hardware_concurrency();

    // generate render-chunks into multiple raytrace_tasks:
    std::vector<raytrace_task> tasks;
    for (auto i = 0u; i < thread_count; i++)
    {
        tasks.push_back(raytrace_task(world_, i, thread_count, camera_, screen_));
    }

    // run the raytrace_tasks:
    std::for_each(std::execution::par, tasks.begin(), tasks.end(), [](auto& task) { task(); });
}

Note: I also had to set Visual Studio to compile in C++17 (project properties > C/C++ > Language)

In Visual Studio, 'Thread_Local' Variables' Destructor Not Called When Used with Std::Async, Is This a Bug