Are Function Static Variables Thread-Safe in Gcc

Are function static variables thread-safe in GCC?

No, it means that the initialization of local statics is thread-safe.
You definitely want to leave this feature enabled. Thread-safe initialization of local statics is very important. If you need generally thread-safe access to local statics then you will need to add the appropriate guards yourself.

Is local static variable initialization thread-safe in C++11?

The relevant section 6.7:

such a variable is initialized the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization. [...] If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization.

Then there's a footnote:

The implementation must not introduce any deadlock around execution of the initializer.

So yes, you're safe.

(This says nothing of course about the subsequent access to the variable through the reference.)

Cost of thread-safe local static variable initialization in C++11?

A look at the generated assembler code helps.

Source

#include <vector>

std::vector<int> &get(){
  static std::vector<int> v;
  return v;
}
int main(){
  return get().size();
}

Assembler

std::vector<int, std::allocator<int> >::~vector():
        movq    (%rdi), %rdi
        testq   %rdi, %rdi
        je      .L1
        jmp     operator delete(void*)
.L1:
        rep ret
get():
        movzbl  guard variable for get()::v(%rip), %eax
        testb   %al, %al
        je      .L15
        movl    get()::v, %eax
        ret
.L15:
        subq    $8, %rsp
        movl    guard variable for get()::v, %edi
        call    __cxa_guard_acquire
        testl   %eax, %eax
        je      .L6
        movl    guard variable for get()::v, %edi
        movq    $0, get()::v(%rip)
        movq    $0, get()::v+8(%rip)
        movq    $0, get()::v+16(%rip)
        call    __cxa_guard_release
        movl    $__dso_handle, %edx
        movl    get()::v, %esi
        movl    std::vector<int, std::allocator<int> >::~vector(), %edi
        call    __cxa_atexit
.L6:
        movl    get()::v, %eax
        addq    $8, %rsp
        ret
main:
        subq    $8, %rsp
        call    get()
        movq    8(%rax), %rdx
        subq    (%rax), %rdx
        addq    $8, %rsp
        movq    %rdx, %rax
        sarq    $2, %rax
        ret

Compared to

Source

#include <vector>

static std::vector<int> v;
std::vector<int> &get(){
  return v;
}
int main(){
  return get().size();
}

Assembler

std::vector<int, std::allocator<int> >::~vector():
        movq    (%rdi), %rdi
        testq   %rdi, %rdi
        je      .L1
        jmp     operator delete(void*)
.L1:
        rep ret
get():
        movl    v, %eax
        ret
main:
        movq    v+8(%rip), %rax
        subq    v(%rip), %rax
        sarq    $2, %rax
        ret
        movl    $__dso_handle, %edx
        movl    v, %esi
        movl    std::vector<int, std::allocator<int> >::~vector(), %edi
        movq    $0, v(%rip)
        movq    $0, v+8(%rip)
        movq    $0, v+16(%rip)
        jmp     __cxa_atexit

I'm not that great with assembler, but I can see that in the first version v has a lock around it and get is not inlined whereas in the second version get is essentially gone.

You can play around with various compilers and optimization flags, but it seems no compiler is able to inline or optimize out the locks, even though the program is obviously single threaded.

You can add static to get which makes gcc inline get while preserving the lock.

To know how much these locks and additional instructions cost for your compiler, flags, platform and surrounding code you would need to make a proper benchmark.

I would expect the locks to have some overhead and be significantly slower than the inlined code, which becomes insignificant when you actually do work with the vector, but you can never be sure without measuring.

Is C++ static member variable initialization thread-safe?

It's more a question of function-scoped static variables vs. every other kind of static variable, rather than scoped vs. globals.

All non-function-scope static variables are constructed before main(), while there is only one active thread. Function-scope static variables are constructed the first time their containing function is called. The standard is silent on the question of how function-level statics are constructed when the function is called on multiple threads. However, every implementation I've worked with uses a lock around the constructor (with a twice-checked flag) to guarantee thread-safety.

How can I know if C++ compiler make thread-safe static object code?

Looks like there is one define __cpp_threadsafe_static_init.

SD-6: SG10 Feature Test Recommendations:
C++11 features
Significant features of C++11
Doc. No. Title Primary Section Macro name Value Header
N2660 Dynamic Initialization and Destruction with Concurrency 3.6 __cpp_threadsafe_static_init 200806 predefined

CLang - http://clang.llvm.org/cxx_status.html#ts (github.com)

GCC - https://gcc.gnu.org/projects/cxx-status.html

MSVC - Feature request under investigation https://developercommunity.visualstudio.com/content/problem/96337/feature-request-cpp-threadsafe-static-init.html

Useful on cppreference.com:

Feature Test Recommendations
C++ compiler support

Are function-local static mutexes thread-safe?

C++11

In C++11 and later versions: yes, this pattern is safe. In particular, initialization of function-local static variables is thread-safe, so your code above works safely across threads.

This way this works in practice is that the compiler inserts any necessary boilerplate in the function itself to check if the variable is initialized prior to access. In the case of std::mutex as implemented in gcc, clang and icc, however, the initialized state is all-zeros, so no explicit initialization is needed (the variable will live in the all-zeros .bss section so the initialization is "free"), as we see from the assembly¹:

inc(int& i):
        mov     eax, OFFSET FLAT:_ZL28__gthrw___pthread_key_createPjPFvPvE
        test    rax, rax
        je      .L2
        push    rbx
        mov     rbx, rdi
        mov     edi, OFFSET FLAT:_ZZ3incRiE3mtx
        call    _ZL26__gthrw_pthread_mutex_lockP15pthread_mutex_t
        test    eax, eax
        jne     .L10
        add     DWORD PTR [rbx], 1
        mov     edi, OFFSET FLAT:_ZZ3incRiE3mtx
        pop     rbx
        jmp     _ZL28__gthrw_pthread_mutex_unlockP15pthread_mutex_t
.L2:
        add     DWORD PTR [rdi], 1
        ret
.L10:
        mov     edi, eax
        call    _ZSt20__throw_system_errori

Note that starting at the line mov edi, OFFSET FLAT:_ZZ3incRiE3mtx it simply loads the address of the inc::mtx function-local static and calls pthread_mutex_lock on it, without any initialization. The code before that dealing with pthread_key_create is apparently just checking if the pthreads library is present at all.

There's not guarantee, however, that all implementations will implement std::mutex as all-zeros, so you might in some cases incur ongoing overhead on each call to check if the mutex has been initialized. Declaring the mutex outside the function would avoid that.

Here's an example contrasting the two approaches with a stand-in mutex2 class with a non-inlinable constructor (so the compiler can't determine that the initial state is all-zeros):

#include <mutex>

class mutex2 {
    public:
    mutex2();
    void lock(); 
    void unlock();
 };

void inc_local(int &i)
{    
    // Thread safe?
    static mutex2 mtx;
    std::unique_lock<mutex2> lock(mtx);
    i++;
}

mutex2 g_mtx;

void inc_global(int &i)
{    
    std::unique_lock<mutex2> lock(g_mtx);
    i++;
}

The function-local version compiles (on gcc) to:

inc_local(int& i):
        push    rbx
        movzx   eax, BYTE PTR _ZGVZ9inc_localRiE3mtx[rip]
        mov     rbx, rdi
        test    al, al
        jne     .L3
        mov     edi, OFFSET FLAT:_ZGVZ9inc_localRiE3mtx
        call    __cxa_guard_acquire
        test    eax, eax
        jne     .L12
.L3:
        mov     edi, OFFSET FLAT:_ZZ9inc_localRiE3mtx
        call    _ZN6mutex24lockEv
        add     DWORD PTR [rbx], 1
        mov     edi, OFFSET FLAT:_ZZ9inc_localRiE3mtx
        pop     rbx
        jmp     _ZN6mutex26unlockEv
.L12:
        mov     edi, OFFSET FLAT:_ZZ9inc_localRiE3mtx
        call    _ZN6mutex2C1Ev
        mov     edi, OFFSET FLAT:_ZGVZ9inc_localRiE3mtx
        call    __cxa_guard_release
        jmp     .L3
        mov     rbx, rax
        mov     edi, OFFSET FLAT:_ZGVZ9inc_localRiE3mtx
        call    __cxa_guard_abort
        mov     rdi, rbx
        call    _Unwind_Resume

Note the large amount of boilerplate dealing with the __cxa_guard_* functions. First, a rip-relative flag byte, _ZGVZ9inc_localRiE3mtx² is checked and if non-zero, the variable has already been initialized and we are done and fall into the fast-path. No atomic operations are needed because on x86, loads already have the needed acquire semantics.

If this check fails, we go to the slow path, which is essentially a form of double-checked locking: the initial check is not sufficient to determine that the variable needs initialization because two or more threads may be racing here. The __cxa_guard_acquire call does the locking and the second check, and may either fall through to the fast path as well (if another thread concurrently initialized the object), or may jump dwon to the actual initialization code at .L12.

Finally note that the last 5 instructions in the assembly aren't direct reachable from the function at all as they are preceded by an unconditional jmp .L3 and nothing jumps to them. They are there to be jumped to by an exception handler should the call to the constructor mutex2() throw an exception at some point.

Overall, we can say at the runtime cost of the first-access initialization is low to moderate because the fast-path only checks a single byte flag without any expensive instructions (and the remainder of the function itself usually implies at least two atomic operations for mutex.lock() and mutex.unlock(), but it comes at a significant code size increase.

Compare to the global version, which is identical except that initailization happens during global initialization rather than before first access:

inc_global(int& i):
    push    rbx
    mov     rbx, rdi
    mov     edi, OFFSET FLAT:g_mtx
    call    _ZN6mutex24lockEv
    add     DWORD PTR [rbx], 1
    mov     edi, OFFSET FLAT:g_mtx
    pop     rbx
    jmp     _ZN6mutex26unlockEv

The function is less than a third of the size without any initialization boilerplate at all.

Prior to C++11

Prior to C++11, however, this is generally not safe, unless your compiler makes some special guarantees about the way in which static locals are initialized.

Some time ago, while looking at a similar issue, I examined the assembly generated by Visual Studio for this case. The pseudocode for the generated assembly code for your print method looked something like this:

void print(const std::string & s)
{    
    if (!init_check_print_mtx) {
        init_check_print_mtx = true;
        mtx.mutex();  // call mutex() ctor for mtx
    }
    
    // ... rest of method
}

The init_check_print_mtx is a compiler generated global variable specific to this method which tracks whether the local static has been initialized. Note that inside the "one time" initialize block guarded by this variable, that the variable is set to true before the mutex is initialized.

I though this was silly since it ensures that other threads racing into this method will skip the initializer and use a uninitialized mtx - versus the alternative of possibly initializing mtx more than once - but in fact doing it this way allows you to avoid the infinite recursion issue that occurs if std::mutex() were to call back into print, and this behavior is in fact mandated by the standard.

Nemo above mentions that this has been fixed (more precisely, re-specified) in C++11 to require a wait for all racing threads, which would make this safe, but you'll need to check your own compiler for compliance. I didn't check if in fact the new spec includes this guarantee, but I wouldn't be at all surprised given that local statics were pretty much useless in multi-threaded environments without this (except perhaps for primitive values which didn't have any check-and-set behavior because they just referred directly to an already initialized location in the .data segment).

¹ Note that I changed the print() function to a slightly simpler inc() function that just increments an integer in the locked region. This has the same locking structure and implications as the original, but avoids a bunch of code dealing with the << operators and std::cout.

² Using c++filt this de-mangles to guard variable for inc_local(int&)::mtx.

How to implement thread safe local static variable in C++03?

I discussed this in a follow-up to the blog post referenced in the question. If for some reason you can't use boost::call_once your block-scoped static is a pointer, POD, or has a thread-safe constructor, you can write the same initialization guard code that GCC would emit:

// Define a static local variable once, safely, for MSVC
//
// This macro is necessary because MSVC pre-2013 doesn't
// properly implement C++11 static local initialization.
// It is equivalent to writing something like
//
//     static type var = stmt;
//
// in a compliant compiler (e.g. GCC since who knows when)

// States for lock checking
enum { uninitialized = 0, initializing, initialized };

// Preprocessor hackery for anonymous variables
#define PASTE_IMPL(x, y) x ## y
#define PASTE(x, y) PASTE_IMPL(x, y)
#define ANON_VAR(var) PASTE(var, __LINE__)

#define STATIC_DEFINE_ONCE(type, var, stmt)                     \
    static type var;                                            \
    static int ANON_VAR(state);                                 \
    bool ANON_VAR(cont) = true;                                 \
    while (ANON_VAR(cont)) {                                    \
        switch (InterlockedCompareExchange(&ANON_VAR(state),    \
                initializing, uninitialized)) {                 \
        case uninitialized:                                     \
            var = stmt;                                         \
            InterlockedExchange(&ANON_VAR(state), initialized); \
            ANON_VAR(cont) = false;                             \
            break;                                              \
        case initializing:                                      \
            continue;                                           \
        case initialized:                                       \
            ANON_VAR(cont) = false;                             \
            break;                                              \
        }                                                       \
    } do { } while (0)

You can use this like

void concurrently_accessed() {
    STATIC_DEFINE_ONCE(int, local_var, thread_unsafe_initializer());
    // ...
}

This approach takes advantage of zero-initialization of static block-scoped variables, which is required by the C language standard. The above macros will let you safely use "magic" statics until actual compiler & run-time support arrive in MSVC 2014.

HOW are local static variables thread unsafe in c?

but isn't internal linkage supposed to stop threads from stepping in each other's static variables?

No, linkage has nothing to do with thread safety. It merely restricts functions from accessing variables declared in other scopes, which is a different and unrelated matter.

Lets assume you have a function like this:

int do_stuff (void)
{
  static int x=0;
  ...
  return x++;
}

and then this function is called by multiple threads, thread 1 and thread 2. The thread callback functions cannot access x directly, because it has local scope. However, they can call do_stuff() and they can do so simultaneously. And then you will get scenarios like this:

Thread 1 has executed do_stuff until the point return 0 to caller.
Thread 1 is about to write value 1 to x, but before it does..:
Context switch, thread 2 steps in and executes do_stuff.
Thread 2 reads x, it is still 0, so it returns 0 to the caller and then increases x by 1.
x is now 1.
Thread 1 gets focus again. It was about to store 1 to x so that's what it does.
Now x is still 1, although if the program had behaved correctly, it should have been 2.

This gets even worse when the access to x is done in multiple instructions, so that one thread reads "half of x" and then gets interrupted.

This is a "race condition" and the solution here is to protect x with a mutex or similar protection mechanism. Doing so will make the function thread-safe. Alternatively, do_stuff can be rewritten to not use any static storage variables or similar resources - it would then be re-entrant.

Are Function Static Variables Thread-Safe in Gcc