Where Is the Lock for a Std::Atomic

Where is the lock for a std::atomic?

The easiest way to answer such questions is generally to just look at the resulting assembly and take it from there.

Compiling the following (I made your struct larger to dodge crafty compiler shenanigans):

#include <atomic>

struct foo {
double a;
double b;
double c;
double d;
double e;
};

std::atomic<foo> var;

void bar()
{
var.store(foo{1.0,2.0,1.0,2.0,1.0});
}

In clang 5.0.0 yields the following under -O3: see on godbolt

bar(): # @bar()
sub rsp, 40
movaps xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = [1.000000e+00,2.000000e+00]
movaps xmmword ptr [rsp], xmm0
movaps xmmword ptr [rsp + 16], xmm0
movabs rax, 4607182418800017408
mov qword ptr [rsp + 32], rax
mov rdx, rsp
mov edi, 40
mov esi, var
mov ecx, 5
call __atomic_store

Great, the compiler delegates to an intrinsic (__atomic_store), that's not telling us what's really going on here. However, since the compiler is open source, we can easily find the implementation of the intrinsic (I found it in https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/atomic.c):

void __atomic_store_c(int size, void *dest, void *src, int model) {
#define LOCK_FREE_ACTION(type) \
__c11_atomic_store((_Atomic(type)*)dest, *(type*)dest, model);\
return;
LOCK_FREE_CASES();
#undef LOCK_FREE_ACTION
Lock *l = lock_for_pointer(dest);
lock(l);
memcpy(dest, src, size);
unlock(l);
}

It seems like the magic happens in lock_for_pointer(), so let's have a look at it:

static __inline Lock *lock_for_pointer(void *ptr) {
intptr_t hash = (intptr_t)ptr;
// Disregard the lowest 4 bits. We want all values that may be part of the
// same memory operation to hash to the same value and therefore use the same
// lock.
hash >>= 4;
// Use the next bits as the basis for the hash
intptr_t low = hash & SPINLOCK_MASK;
// Now use the high(er) set of bits to perturb the hash, so that we don't
// get collisions from atomic fields in a single object
hash >>= 16;
hash ^= low;
// Return a pointer to the word to use
return locks + (hash & SPINLOCK_MASK);
}

And here's our explanation: The address of the atomic is used to generate a hash-key to select a pre-alocated lock.

What does std::atomic::is_always_lock_free = true really mean?

"Lock" here is in the sense of "mutex", not specifically in reference to the x86 instruction prefix named lock.

A trivial and generic way to implement std::atomic<T> for arbitrary types T would be as a class containing a T member together with a std::mutex, which is locked and unlocked around every operation on the object (load, store, exchange, fetch_add, etc). Those operations can then be done in any old way, and need not use atomic machine instructions, because the lock protects them. This implementation would be not lock free.

A downside of such an implementation, besides being slow in general, is that if two threads try to operate on the object at the same time, one of them will have to wait for the lock, which may actually block and cause it to be scheduled out for a while. Or, if a thread gets scheduled out while holding the lock, every other thread that wants to operate on the object will have to wait for the first thread to get scheduled back in and complete its work first.

So it is desirable if the machine supports truly atomic operations on T: a single instruction or sequence that other threads cannot interfere with, and which will not block other threads if interrupted (or perhaps cannot be interrupted at all). If for some type T the library has been able to specialize std::atomic<T> with such an implementation, then that is what we mean by saying it is lock free. (It is just confusing on x86 because the atomic instructions used for such implementations are named lock. On other architectures they might be called something else, e.g. ARM64's ldxr/stxr exclusive load/store instructions.)

The C++ standard allows for types to be "sometimes lock free": maybe it is not known at compile time whether std::atomic<T> will be lock-free, because it depends on special machine features that will be detected at runtime. It's even possible that some objects of type std::atomic<T> are lock-free and others are not. That's why atomic_is_lock_free is a function and not a constant. It checks whether this particular object is lock-free on this particular day.

However, it might be the case for some implementations that certain types can be guaranteed, at compile time, to always be lock free. That's what is_always_lock_free is used to indicate, and note that it's a constexpr bool instead of a function.

Genuinely test std::atomic is lock-free or not

Other than performance, the standard doesn't guarantee any way you can tell; that's more or less the point.

If you are willing to introduce some platform-specific UB, you could do something like cast a atomic<int64_t> * to a volatile int64_t* and see if you observe "tearing" when another thread reads the object. (When to use volatile with multi threading? - normally never, but real hardware has coherent caches between cores that run threads so plain asm load/store are basically like relaxed-atomic.)

If this test succeeds (i.e. the plain C++ type was naturally atomic with just volatile), that tells you any sane compiler will make it lock-free very cheaply. But if it fails, it doesn't tell you very much. A lock-free atomic for that type may be only slightly more expensive than the plain version for loads/stores, or the compiler may not make it lock-free at all. e.g. on 32-bit x86 where lock-free int64_t is efficient with only small overhead (using SSE2 or x87), but volatile int64_t* will produce tearing using two separate 4-byte integer loads or stores the way most compilers compile it.

On any specific platform / target architecture, you can single-step your code in a debugger and see what asm instructions run. (Including stepping into libatomic function calls like __atomic_store_16). This is the only 100% reliable way. (Plus consulting ISA documentation to check atomicity guarantees for different instructions, e.g. whether ARM load/store pair is guaranteed, under what conditions.)

(Fun fact: gcc7 with statically linked libatomic may always use locking for 16-byte objects on x86-64, because it doesn't have an opportunity to do runtime CPU detection at dynamic link time and use lock cmpxchg16b on CPUs that support it, with the same mechanism glibc uses to pick optimal memcpy / strchr implementations for the current system.)


You could portably look for a performance difference (e.g. scalability with multiple readers), but x86-64 lock cmpxchg16b doesn't scale1. Multiple readers contend with each other, unlike 8 byte and narrower atomic objects where pure asm loads are atomic and can be used. lock cmpxchg16b acquires exclusive access to a cache line before executing; abusing the side-effect of atomically loading the old value on failure to implement .load() is much worse than an 8-byte atomic load which compiles to just a regular load instruction.

That's part of the reason that gcc7 decided to stop returning true for is_lock_free() on 16-byte objects, as described in the GCC mailing list message about the change you're asking about.

Also note that clang on 32-bit x86 uses lock cmpxchg8b to implement std::atomic<int64_t>, just like for 16-byte objects in 64-bit mode. So you would see a lack of parallel read scaling with it, too. (https://bugs.llvm.org/show_bug.cgi?id=33109)


std::atomic<> implementations that use locking usually still don't make the object larger by including a lock byte or word in each object. It would change the ABI, but lock-free vs. locking is already an ABI difference. The standard allows this, I think, but weird hardware might need extra bytes in the object even when lock-free. Anyway sizeof(atomic<T>) == sizeof(T) doesn't tell you anything either way. If it's larger it's most likely that your implementation added a mutex, but you can't be sure without checking the asm. (If the size wasn't a power of 2, it could have widened it for alignment.)

(In C11, there's much less scope for including a lock in the object: it has to work even with minimal initialization (e.g. statically to 0), and no destructor. Compilers / ABIs generally want their C stdatomic.h atomics to be compatible with their C++ std::atomic atomics.)

The normal mechanism is to use the address of the atomic object as a key for a global hash table of locks. Two objects aliasing / colliding and sharing the same lock is extra contention, but not a correctness problem. These locks are only taken/released from library functions, not while holding other such locks, so it can't create a deadlock.

You could detect this by using shared memory between two different processes (so each process would have its own hash table of locks).
Is C++11 atomic<T> usable with mmap?

  • check that std::atomic<T> is the same size as T (so the lock isn't in the object itself).

  • Map a shared memory segment from two separate processes that don't otherwise share any of their address space. It doesn't matter if you map it to a different base address in each process.

  • Store patterns like all-ones and all-zeros from one process while reading from the other (and look for tearing). Same as what I suggested with volatile above.

  • Also test atomic increment: have each thread do 1G increments and check that the result is 2G every time. Even if pure load and pure store are naturally atomic (the tearing test), read-modify-write operations like fetch_add / operator++ need special support: Can num++ be atomic for 'int num'?

From the C++11 standard, the intent is that this should still be atomic for lock-free objects. It might also work for non-lock-free objects (if they embed the lock in the object), which is why you have to rule that out by checking sizeof().

To facilitate inter-process communication via shared memory, it is our intent that lock-free operations also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation shall not depend on any per-process state.

If you see tearing between two processes, the object wasn't lock-free (at least not the way C++11 intended, and not the way you'd expect on normal shared-memory CPUs.)

I'm not sure why address-free matters if the processes don't have to share any address-space other than 1 page containing the atomic object2. (Of course, C++11 doesn't require that the implementation uses pages at all. Or maybe an implementation could put the hash table of locks at the top or bottom of each page? In which case using a hash function that depended on address bits above the page offset would be totally silly.)

Anyway, this depends on a lot of assumptions about how computers work that are true on all normal CPUs, but which C++ doesn't make. If the implementation you care about is on a mainstream CPU like x86 or ARM under a normal OS, then this testing method should be fairly accurate and might be an alternative to just reading the asm. It's not something that's very practical to do automatically at compile time, but it would be possible to automate a test like this and put it into a build script, unlike reading the asm.


Footnote 1: 16-byte atomics on x86

(Update: Intel recently documented that the AVX feature bit implies 16-byte atomicity for aligned loads/stores, such as with movaps. At least on Intel CPUs specifically; AMD CPUs with AVX in practice seem to be like that too, but AMD hasn't yet documented it officially. The rest of this answer was written before that, but GCC's libatomic does use vmovdqa [mem], xmm / mfence for atomic 16-byte stores on CPUs where that's guaranteed atomic.)

No x86 hardware documents support for 16-byte atomic load/store with SSE instructions. In practice many modern CPUs do have atomic movaps load/store, but there are no guarantees of this in Intel/AMD manuals the way there are for 8-byte x87/MMX/SSE loads/stores on Pentium and later. And no way to detect which CPUs do/don't have atomic 128-bit ops (other than lock cmpxchg16b), so compiler writers can't safely use them.

See SSE instructions: which CPUs can do atomic 16B memory operations? for a nasty corner case: testing on K10 shows that aligned xmm load/store shows no tearing between threads on the same socket, but threads on different sockets experience rare tearing because HyperTransport apparently only gives the minimum x86 atomicity guarantee of 8 byte objects. (IDK if lock cmpxchg16b is more expensive on a system like that.)

Without published guarantees from vendors, we can never be sure about weird microarchitectural corner cases, either. Lack of tearing in a simple test with one thread writing patterns and the other reading is pretty good evidence, but it's always possible that something could be different in some special case the CPU designers decided to handle a different way than normal.


A pointer + counter struct where read-only access only needs the pointer can be cheap, but current compilers need union hacks to get them to do an 8-byte atomic load of just the first half of the object. How can I implement ABA counter with c++11 CAS?. For an ABA counter, you'd normally update it with a CAS anyway, so lack of a 16-byte atomic pure store is not a problem.

An ILP32 ABI (32-bit pointers) in 64-bit mode (like Linux's x32 ABI, or AArch64's ILP32 ABI) means pointer+integer can fit in only 8 bytes, but integer registers are still 8 bytes wide. This makes it much more efficient to use a pointer+counter atomic object than in full 64-bit mode where a pointer is 8 bytes.


Footnote 2: address-free

I think the term "address-free" is a separate claim from not depending on any per-process state. As I understand it, it means that correctness doesn't depend on both threads using the same address for the same memory location. But if correctness also depends on them sharing the same global hash table (IDK why storing the address of an object in the object itself would ever help), that would only matter if it was possible to have multiple addresses for the same object within the same process. That is possible on something like x86's real-mode segmentation model, where a 20-bit linear address space is addressed with 32-bit segment:offset. (Actual C implementations for 16-bit x86 exposed segmentation to the programmer; hiding it behind C's rules would be possible but not high performance.)

It's also possible with virtual memory: two mappings of the same physical page to different virtual addresses within the same process is possible but weird. That might or might not use the same lock, depending on whether the hash function uses any address bits above the page offset.
(The low bits of an address, that represent the offset within a page, are the same for every mapping. i.e. virtual to physical translation for those bits is a no-op, which is why VIPT caches are usually designed to take advantage of that to get speed without aliasing.)

So a non-lock-free object might be address-free within a single process, even if it uses a separate global hash table instead of adding a mutex to the atomic object. But this would be a very unusual situation; it's extremely rare to use virtual memory tricks to create two addresses for the same variable within the same process that shares all of its address-space between threads. Much more common would be atomic objects in shared memory between processes. (I may be misunderstanding the meaning of "address-free"; possibly it means "address-space free", i.e. lack of dependency on other addresses being shared.)

Does the C++ 11 standard guarantees that std::atomic is implemented as a lock-free operation?

The C++ standard makes no guarantee that std::atomic<T> operations are lock-free. However, you can use std::atomic<T>::is_lock_free() to find out if the operation of std::atomic<T> is lock free 29.6.5 [atomics.types.operations.req] paragraph 7:

Returns: True if the object’s operations are lock-free, false otherwise.

If it is not lock-free it will still do the required synchronization but it uses some lock to do so.

shared_lock implemented with std::atomic

For correctness I think this looks reasonable, I don't see a problem. I might have missed something, but a seq_cst CAS is more than sufficient for acquiring a lock. It looks like you're avoiding integer wrap-around by using the max value as something special.

The mutex=0 unlock only needs to be release, not seq_cst. (Same for the -=1 shared unlock, but that won't make it more efficient on x86, only on weakly-ordered ISAs). Also, compare_exchange_weak would be totally fine; you're retrying in a loop anyway so spurious failure is not different from a failed compare.

If you're on x86, you'd normally want _mm_pause() inside your spin loop, and possibly some kind of backoff to reduce contention if multiple threads are all trying to acquire the lock at once.

And usually you want to spin read-only until you see the lock available, not keep hammering with atomic RMWs. (See Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?).

Also, short is a strange choice; if any size is going to perform worse than int, it's often short. But probably it's fine, and ok I guess if that helps it pack into the same cache line as the data you're modifying. (Although that cache line will be the victim of false sharing contention from other threads hammering on it trying to take the lock.)

How std::atomic wait operation works?

Yes, that is exactly it. notify_one/all simply provide the waiting thread a chance to check the value for change. If it remains the same, e.g. because a different thread has set the value back to its original value, the thread will remain blocking.

Note: A valid implementation for this code is to use a global array of mutexes and condition_variables. atomic variables are then mapped to these objects by their pointer via a hash function. That's why you get spurious wakeups. Some atomics share the same condition_variable.

Something like this:


std::mutex atomic_mutexes[64];
std::condition_variable atomic_conds[64];

template<class T>
std::size_t index_for_atomic(std::atomic<T>* ptr) noexcept
{ return reinterpret_cast<std::size_t>(ptr) / sizeof(T) % 64; }

void atomic<T>::wait(T value, std::memory_order order)
{
if(this->load(order) != value)
return;
std::size_t index = index_for_atomic(this);
std::unique_lock<std::mutex> lock(atomic_mutexes[index]);
while(this->load(std::memory_order_relaxed) == value)
atomic_conds[index].wait(lock);
}
template<class T>
void std::atomic_notify_one(std::atomic<T>* ptr)
{
const std::size_t index = index_for_atomic(ptr);
/*
* normally we don't need to hold the mutex to notify
* but in this case we updated the value without holding
* the lock. Therefore without the mutex there would be
* a race condition in wait() between the while-loop condition
* and the loop body
*/
std::lock_guard<std::mutex> lock(atomic_mutexes[index]);
/*
* needs to notify_all because we could have multiple waiters
* in multiple atomics due to aliasing
*/
atomic_conds[index].notify_all();
}

A real implementation would probably use the OS primitives, for example WaitForAddress on Windows or (at least for int-sized types) futex on Linux.

Are lock-free atomics address-free in practice?

Yes, lock-free atomics are address-free on all C++ implementations on all normal CPUs, and can safely be used on shared-memory between processes. Non-lock-free atomics1 won't be safe between processes, though. Each process will have its own hash table of locks (Where is the lock for a std::atomic?).

The C++ standard intends lock-free atomics to work in shared memory between processes, but it can only go as far as "should" without defining terms and so on.

C++draft 29.5 Lock-free property

[ Note: Operations that are lock-free should also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation should not depend on any per-process state. This restriction enables communication by memory that is mapped into a process more than once and by memory that is shared between two processes. — end note ]

This is a quality-of-implementation recommendation that is very easy to implement on current hardware, and in fact you'd have to try hard to design a deathstation9000 C++ implementation that violates it on x86 / ARM / PowerPC / other mainstream CPU while actually being lock-free.


The mechanism hardware exposes for atomic read-modify-write operations is based on MESI cache coherency which only cares about physical addresses. x86 lock cmpxchg / lock add / etc. makes a core hang on to a cache line in Modified state so no other core can read/write it in the middle of the atomic operation. (Can num++ be atomic for 'int num'?).

Most non-x86 architectures use LL/SC, which lets you write a retry loop that only does the store if it will be atomic. LL/SC can emulate CAS with O(1) overhead in a wait-free manner without introducing addresses.

C++ lock-free atomics compile to use LL/SC instructions directly. See my answer on the num++ question for x86 examples. See Atomically clearing lowest non-zero bit of an unsigned integer for some examples of AArch64 code-gen for compare_exchange_weak vs fetch_add using LL/SC instructions.

Atomic pure-load or pure-store are easier and happen for free with aligned data. On x86, see Why is integer assignment on a naturally aligned variable atomic on x86? Other ISAs have similar rules.


Related: I included some comments about address-free in my answer on Genuinely test std::atomic is lock-free or not. I'm not sure whether they're helpful or correct. :/


Footnote 1:

All mainstream CPUs have lock-free atomics for objects up to the width of a pointer. Some have wider atomics, like x86 has lock cmpxchg16b, but not all implementations choose to use it for double-width lock-free atomic objects. Check C++17 std::atomic::is_always_lock_free, or ATOMIC_xxx_LOCK_FREE if defined, for compile-time detection.

(Some microcontrollers can't hold a pointer in a single register (or copy it around with a single operation), but there aren't usually multi-core implementations of such ISAs.)



Why on earth would an implementation use non-address-free atomics that are lock-free?

I don't know any plausible reason on hardware that works anything like normal modern CPUs. You could may imagine some architecture where you do atomic operations by submitting the address to some

I think the C++ standard wants to avoid constraining non-mainstream implementations as much as possible. e.g. C++ on top of some kind of interpreter, rather than compiled do machine code for a "normal" CPU architecture.

IDK if you could usefully implement C++ on a loosely-coupled shared memory system like a cluster with ethernet links instead of shared memory, or non-coherent shared memory (that has to be flushed explicitly for other threads to see your stores).

I think it's mostly that the C++ committee can't say much about how atomics must be implemented without assuming that implementations will run programs under an OS where multiple processes can set up shared memory.

They might be imagining some future ISA where address-free atomics aren't possible, but I think more likely they don't want to talk about shared-memory between multiple C++ programs. The standard only requires that an implementation run one program.

Apparently std::atomic_flag is actually guaranteed to be address-free Why only std::atomic_flag is guaranteed to be lock-free?, so IDK why they don't make the same requirement for any atomic<T> that the implementation chooses to implement as lock-free.

Can the implementation of std::atomic with respect to locks change a program's behavior from correct to incorrect?

In order to have deadlock in a program you need to hold more then one lock simultaneously. Accessing or modifying std::atomic<T> variable may acquire lock according to the C++11 standard, but it releases the lock as soon as function call finished and it doesn't call any user defined function while holding a lock, so you can't have a situation where two (or more) mutexes are locked simultaneously; hence no deadlock is possible with std::atomic internal lockable objects.



Related Topics



Leave a reply



Submit