Why Does a Std::Atomic Store with Sequential Consistency Use Xchg

Why does a std::atomic store with sequential consistency use XCHG?

mov-store + mfence and xchg are both valid ways to implement a sequential-consistency store on x86. The implicit lock prefix on an xchg with memory makes it a full memory barrier, like all atomic RMW operations on x86.

(x86's memory-ordering rules essentially make that full-barrier effect the only option for any atomic RMW: it's both a load and a store at the same time, stuck together in the global order. Atomicity requires that the load and store aren't separated by just queuing the store into the store buffer so it has to be drained, and load-load ordering of the load side requires that it not reorder.)

Plain mov is not sufficient; it only has release semantics, not sequential-release. (Unlike AArch64's stlr instruction, which does do a sequential-release store that can't reorder with later ldar sequential-acquire loads. This choice is obviously motivated by C++11 having seq_cst as the default memory ordering. But AArch64's normal store is much weaker; relaxed not release.)

See Jeff Preshing's article on acquire / release semantics, and note that regular release stores (like mov or any non-locked x86 memory-destination instruction other than xchg) allows reordering with later operations, including acquire loads (like mov or any x86 memory-source operand). e.g. If the release-store is releasing a lock, it's ok for later stuff to appear to happen inside the critical section.

There are performance differences between mfence and xchg on different CPUs, and maybe in the hot vs. cold cache and contended vs. uncontended cases. And/or for throughput of many operations back to back in the same thread vs. for one on its own, and for allowing surrounding code to overlap execution with the atomic operation.

See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for actual benchmarks of mfence vs. lock addl $0, -8(%rsp) vs. (%rsp) as a full barrier (when you don't already have a store to do).

On Intel Skylake hardware, mfence blocks out-of-order execution of independent ALU instructions, but xchg doesn't. (See my test asm + results in the bottom of this SO answer). Intel's manuals don't require it to be that strong; only lfence is documented to do that. But as an implementation detail, it's very expensive for out-of-order execution of surrounding code on Skylake.

I haven't tested other CPUs, and this may be a result of a microcode fix for erratum SKL079, SKL079 MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions. The existence of the erratum basically proves that SKL used to be able to execute instructions after MFENCE. I wouldn't be surprised if they fixed it by making MFENCE stronger in microcode, kind of a blunt instrument approach that significantly increases the impact on surrounding code.

I've only tested the single-threaded case where the cache line is hot in L1d cache. (Not when it's cold in memory, or when it's in Modified state on another core.) xchg has to load the previous value, creating a "false" dependency on the old value that was in memory. But mfence forces the CPU to wait until previous stores commit to L1d, which also requires the cache line to arrive (and be in M state). So they're probably about equal in that respect, but Intel's mfence forces everything to wait, not just loads.

AMD's optimization manual recommends xchg for atomic seq-cst stores. I thought Intel recommended mov + mfence, which older gcc uses, but Intel's compiler also uses xchg here.

When I tested, I got better throughput on Skylake for xchg than for mov+mfence in a single-threaded loop on the same location repeatedly. See Agner Fog's microarch guide and instruction tables for some details, but he doesn't spend much time on locked operations.

See gcc/clang/ICC/MSVC output on the Godbolt compiler explorer for a C++11 seq-cst my_atomic = 4; gcc uses mov + mfence when SSE2 is available. (use -m32 -mno-sse2 to get gcc to use xchg too). The other 3 compilers all prefer xchg with default tuning, or for znver1 (Ryzen) or skylake.

The Linux kernel uses xchg for __smp_store_mb().

Update: recent GCC (like GCC10) changed to using xchg for seq-cst stores like other compilers do, even when SSE2 for mfence is available.

Another interesting question is how to compile atomic_thread_fence(mo_seq_cst);. The obvious option is mfence, but lock or dword [rsp], 0 is another valid option (and used by gcc -m32 when MFENCE isn't available). The bottom of the stack is usually already hot in cache in M state. The downside is introducing latency if a local was stored there. (If it's just a return address, return-address prediction is usually very good so delaying ret's ability to read it is not much of a problem.) So lock or dword [rsp-4], 0 could be worth considering in some cases. (gcc did consider it, but reverted it because it makes valgrind unhappy. This was before it was known that it might be better than mfence even when mfence was available.)

All compilers currently use mfence for a stand-alone barrier when it's available. Those are rare in C++11 code, but more research is needed on what's actually most efficient for real multi-threaded code that has real work going on inside the threads that are communicating locklessly.

But multiple source recommend using lock add to the stack as a barrier instead of mfence, so the Linux kernel recently switched to using it for the smp_mb() implementation on x86, even when SSE2 is available.

See https://groups.google.com/d/msg/fa.linux.kernel/hNOoIZc6I9E/pVO3hB5ABAAJ for some discussion, including a mention of some errata for HSW/BDW about movntdqa loads from WC memory passing earlier locked instructions. (Opposite of Skylake, where it was mfence instead of locked instructions that were a problem. But unlike SKL, there's no fix in microcode. This may be why Linux still uses mfence for its mb() for drivers, in case anything ever uses NT loads to copy back from video RAM or something but can't let the reads happen until after an earlier store is visible.)

In Linux 4.14, smp_mb() uses mb(). That uses mfence is used if available, otherwise lock addl $0, 0(%esp).
__smp_store_mb (store + memory barrier) uses xchg (and that doesn't change in later kernels).
In Linux 4.15, smb_mb() uses lock; addl $0,-4(%esp) or %rsp, instead of using mb(). (The kernel doesn't use a red-zone even in 64-bit, so the -4 may help avoid extra latency for local vars).
mb() is used by drivers to order access to MMIO regions, but smp_mb() turns into a no-op when compiled for a uniprocessor system. Changing mb() is riskier because it's harder to test (affects drivers), and CPUs have errata related to lock vs. mfence. But anyway, mb() uses mfence if available, else lock addl $0, -4(%esp). The only change is the -4.
In Linux 4.16, no change except removing the #if defined(CONFIG_X86_PPRO_FENCE) which defined stuff for a more weakly-ordered memory model than the x86-TSO model that modern hardware implements.

x86 & x86_64. Where a store has an implicit acquire fence

You mean release, I hope. my_atomic.store(1, std::memory_order_acquire); won't compile, because write-only atomic operations can't be acquire operations. See also Jeff Preshing's article on acquire/release semantics.

Or asm volatile("" ::: "memory");

No, that's a compiler barrier only; it prevents all compile-time reordering across it, but doesn't prevent runtime StoreLoad reordering, i.e. the store being buffered until later, and not appearing in the global order until after a later load. (StoreLoad is the only kind of runtime reordering x86 allows.)

Anyway, another way to express what you want here is:

my_atomic.store(1, std::memory_order_release);        // mov
// with no operations in between, there's nothing for the release-store to be delayed past
std::atomic_thread_fence(std::memory_order_seq_cst);  // mfence

Using a release fence would not be strong enough (it and the release-store could both be delayed past a later load, which is the same thing as saying that release fences don't keep later loads from happening early). A release-acquire fence would do the trick, though, keeping later loads from happening early and not itself being able to reorder with the release store.

Related: Jeff Preshing's article on fences being different from release operations.

But note that seq-cst is special according to C++11 rules: only seq-cst operations are guaranteed to have a single global / total order which all threads agree on seeing. So emulating them with weaker order + fences might not be exactly equivalent in general on the C++ abstract machine, even if it is on x86. (On x86, all store have a single total order which all cores agree on. See also Globally Invisible load instructions: Loads can take their data from the store buffer, so we can't really say that there's a total order for loads + stores.)

Why does GCC use mov/mfence instead of xchg to implement C11's atomic_store?

Quite simply, the mov and mfence method is faster as it does not trigger a redundant memory read like the xchg which will take time. The x86 CPU guarantees strict ordering of writes between threads anyway so so it is enough.

Note some very old CPUs have a bug in the mov instruction which makes xchg necessary but this is from a very long time ago and working around this is not worth the overhead to most users.

Credit to @amdn for the information on the bug in old Pentium CPUs causing xchg to be needed in the past.

What is the point of atomic.Load and atomic.Store

I don't know Go at all, but it looks like the x86-64 implementations of .load() and .store() are sequentially-consistent. Presumably on purpose / for a reason!

//go:noinline on the load means the compiler can't reorder around a blackbox non-inline function, I assume. On x86 that's all you need for the load side of sequential-consistency, or acq-rel. A plain x86 mov load is an acquire load.

The compiler-generated code gets to take advantage of x86's strongly-ordered memory model, which is sequential consistency + a store buffer (with store forwarding), i.e. acq/rel. To recover sequential consistency, you only need to drain the store buffer after a release-store.

.store() is written in asm, loading its stack args and using xchg as a seq-cst store.

XCHG with memory has an implicit lock prefix which is a full barrier; it's an efficient alternative to mov+mfence to implement what C++ would call a memory_order_seq_cst store.

It flushes the store buffer before later loads and stores are allowed to touch L1d cache. Why does a std::atomic store with sequential consistency use XCHG?

See

https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
C/C++11 mappings to processors
describes the sequences of instructions that implement relaxed load/store, acq/rel load/store, seq-cst load/store, and various barriers, on various ISAs. So you can recognize things like xchg with memory.
Does lock xchg have the same behavior as mfence? (TL:DR: yes except for maybe some corner cases with NT loads from WC memory, e.g. from video RAM). You may see a dummy lock add $0, (SP) used as an alternative to mfence in some code.
IIRC, AMD's optimization manual even recommends this. It's good on Intel as well, especially on Skylake where mfence was strengthened by microcode update to fully block out-of-order exec even of ALU instructions (like lfence) as well as memory reordering. (To fix an erratum with NT loads.)
https://preshing.com/20120913/acquire-and-release-semantics/

Why does this `std::atomic_thread_fence` work

So my major question is how can _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst); create a full barrier MFENCE

This compiles to an xchg instruction with a memory destination. This is a full memory barrier (draining the store buffer) exactly¹ like mfence.

With compiler barriers before and after that, compile-time reordering around it is also prevented. Therefore all reordering in either direction is prevented (of operations on atomic and non-atomic C++ objects), making it more than strong enough to do everything that ISO C++ atomic_thread_fence(mo_seq_cst) promises.

For orders weaker than seq_cst, only a compiler barrier is needed. x86's hardware memory-ordering model is program-order + a store buffer with store forwarding. That's strong enough for acq_rel without the compiler emitting any special asm instructions, just blocking compile-time reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/

Footnote 1: exactly enough for the purposes of std::atomic. Weakly ordered MOVNTDQA loads from WC memory may not be as strictly ordered by locked instructions as by MFENCE.

Which is a better write barrier on x86: lock+addl or xchgl?
Does lock xchg have the same behavior as mfence? - equivalent for std::atomic purposes, but some minor differences that might matter for a device driver using WC memory regions. And perf differences. Notably on Skylake where mfence blocks OoO exec like lfence
Why is LOCK a full barrier on x86?

Atomic read-modify-write (RMW) operation on x86 are only possible with a lock prefix, or xchg with memory which is like that even without a lock prefix in the machine code. A lock-prefixed instruction (or xchg with mem) is always a full memory barrier.

Using an instruction like lock add dword [esp], 0 as a substitute for mfence is a well-known technique. (And performs better on some CPUs.) This MSVC code is the same idea, but instead of a no-op on whatever the stack pointer is pointing-to, it does an xchg on a dummy variable. It doesn't actually matter where it is, but a cache line that's only ever accessed by the current core and is already hot in cache is the best choice for performance.

Using a static shared variable that all cores will contend for access to is the worst possible choice; this code is terrible! Interacting with the same cache line as other cores is not necessary to control the order of this core's operations on its own L1d cache. This is completely bonkers. MSVC still apparently uses this horrible code in its implementation of std::atomic_thread_fence(), even for x86-64 where mfence is guaranteed available. (Godbolt with MSVC 19.14)

If you're doing a seq_cst store, your options are mov+mfence (gcc does this) or doing the store and the barrier with a single xchg (clang and MSVC do this, so the codegen is fine, no shared dummy var).

Much of the early part of this question (stating "facts") seems wrong and contains some misinterpretations or things that are so misguided they're not even wrong.

std::memory_order_seq_cst makes no guarantee to prevent STORE-LOAD reorder.

C++ guarantees order using a totally different model, where acquire loads that see a value from a release store "synchronize with" it, and later operations in the C++ source are guaranteed to see all the stores from code before the release store.

It also guarantees that there's a total order of all seq_cst operations even across different objects. (Weaker orders allow a thread to reload its own stores before they become globally visible, i.e. store forwarding. That's why only seq_cst has to drain the store buffer. They also allow IRIW reordering. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)

Concepts like StoreLoad reordering are based on a model where:

All inter-core communication is via committing stores to cache-coherent shared memory
Reordering happens inside one core between its own accesses to cache. e.g. by the store buffer delaying store visibility until after later loads like x86 allows. (Except a core can see its own stores early via store forwarding.)

In terms of this model, seq_cst does require draining the store buffer at some point between a seq_cst store and a later seq_cst load. The efficient way to implement this is to put a full barrier after seq_cst stores. (Instead of before every seq_cst load. Cheap loads are more important than cheap stores.)

On an ISA like AArch64, there are load-acquire and store-release instructions which actually have sequential-release semantics, unlike x86 loads/stores which are "only" regular release. (So AArch64 seq_cst doesn't need a separate barrier; a microarchitecture could delay draining the store buffer unless / until a load-acquire executes while there's still a store-release not committed to L1d cache yet.) Other ISAs generally need a full barrier instruction to drain the store buffer after a seq_cst store.

Of course even AArch64 needs a full barrier instruction for a seq_cst fence, unlike a seq_cst load or store operation.

std::atomic_thread_fence(memory_order_seq_cst) always generates a full-barrier

In practice yes.

So I can always replace asm volatile("mfence" ::: "memory") with std::atomic_thread_fence(memory_order_seq_cst)

In practice yes, but in theory an implementation could maybe allow some reordering of non-atomic operations around std::atomic_thread_fence and still be standards-compliant. Always is a very strong word.

ISO C++ only guarantees anything when there are std::atomic load or store operations involved. GNU C++ would let you roll your own atomic operations out of asm("" ::: "memory") compiler barriers (acq_rel) and asm("mfence" ::: "memory") full barriers. Converting that to ISO C++ signal_fence and thread_fence would leave a "portable" ISO C++ program that has data-race UB and thus no guarantee of anything.

(Although note that rolling your own atomics should use at least volatile, not just barriers, to make sure the compiler doesn't invent multiple loads, even if you avoid the obvious problem of having loads hoisted out of a loop. Who's afraid of a big bad optimizing compiler?).

Always remember that what an implementation does has to be at least as strong as what ISO C++ guarantees. That often ends up being stronger.

Does STLR(B) provide sequential consistency on ARM64?

Yup, stlr is store-release on its own, and ldar can't pass an earlier stlr (i.e. no StoreLoad reordering) - that interaction between them satisfies that part of the seq_cst requirements which acq / rel doesn't have. (ARMv8.3 ldapr is like ldar without that interaction, being only a plain acquire load, allowing more efficient acq_rel.)

So on ARMv8.3, the difference between seq_cst and acq / rel is in the load side. ARMv8 before 8.3 can't do acq / rel while still allowing StoreLoad reordering, so it's unfortunately slow if you acquire-load something else after a release-store. ARMv8.3 ldapr fixes that, making acq / rel as efficient as on x86.

On x86, everything is an acquire load or release store (so acq_rel is free), and the least-bad way to achieve sequential consistency is by doing a full barrier on seq_cst stores. (You want atomic loads to be cheap, and it's going to be common for code to use the default seq_cst memory order.)

(C/C++11 mappings to processors discusses that tradeoff of wanting cheap loads, if you have to pick either load or store to attach the full barrier to.)

Separately, the IRIW litmus test (all threads agreeing on the order of independent stores) is guaranteed by the ARMv8 memory model even for relaxed stores. It's guaranteed to be "multicopy-atomic", which means that when a store becomes visible to any other core, it becomes visible to all other cores at the same time. This is sufficient for all cores to agree on a total order for all stores, to the limits of anything they can observe via two acquire loads.

In practical terms, that means stores only become visible by committing to L1d cache, which is coherent. Not for example by store-forwarding between logical cores sharing a physical core, the mechanism for IRIW reordering on the few POWER CPUs that can produce the effect in real life. ARMv8 originally allowed that on paper, but no ARM CPUs ever did. They strengthened the memory model to simply guarantee that no future CPU would be weird like that. See Simplifying ARM Concurrency: Multicopy-Atomic
Axiomatic and Operational Models for ARMv8 for details.

Note that this guarantee of all threads being able to agree on an order applies to all stores on ARM64, including relaxed. (There are very few HW mechanisms that can create it, in a machine with coherent shared memory, so it's only on rare ISAs that seq_cst has to actually do anything specific to prevent it.)

x86's TSO (Total Store Order) memory model has the required property right in the name. And yes, it's much stronger, basically program-order plus a store-buffer with store-forwarding. (So this allows StoreLoad reordering, and for a core to see its own stores before they're globally visible, but nothing else. Ignoring NT stores, and NT loads from WC memory such as video RAM...)

Atomic operations, std::atomic and ordering of writes

I put your example on the Godbolt compiler explorer, and added some functions to read, increment, or combine (a+=b) two atomic variables. I also used a.store(1, memory_order_release); instead of a = 1; to avoid getting more ordering than needed, so it's just a simple store on x86.

See below for (hopefully correct) explanations. update: I had "release" semantics confused with just a StoreStore barrier. I think I fixed all the mistakes, but may have left some.

The easy question first:

Is the write to ‘a’ guaranteed to be an atomic?

Yes, any thread reading a will get either the old or the new value, not some half-written value. This happens for free on x86 and most other architectures with any aligned type that fits in a register. (e.g. not int64_t on 32bit.) Thus, on many systems, this happens to be true for b as well, the way most compilers would generate code.

There are some types of stores that may not be atomic on an x86, including unaligned stores that cross a cache line boundary. But std::atomic of course guarantees whatever alignment is necessary.

Read-modify-write operations are where this gets interesting. 1000 evaluations of a+=3 done in multiple threads at once will always produce a += 3000. You'd potentially get fewer if a wasn't atomic.

Fun fact: signed atomic types guarantee two's complement wraparound, unlike normal signed types. C and C++ still cling to the idea of leaving signed integer overflow undefined in other cases. Some CPUs don't have arithmetic right shift, so leaving right-shift of negative numbers undefined makes some sense, but otherwise it just feels like a ridiculous hoop to jump through now that all CPUs use 2's complement and 8bit bytes. </rant>

Is any other thread reading ‘a’ as 1 guaranteed to read ‘b’ as 2.

Yes, because of the guarantees provided by std::atomic.

Now we're getting into the memory model of the language, and the hardware it runs on.

C11 and C++11 have a very weak memory ordering model, which means the compiler is allowed to reorder memory operations unless you tell it not to. (source: Jeff Preshing's Weak vs. Strong Memory Models). Even if x86 is your target machine, you have to stop the compiler from re-ordering stores at compile time. (e.g. normally you'd want the compiler to hoist a = 1 out of a loop that also writes to b.)

Using C++11 atomic types gives you full sequential-consistency ordering of operations on them with respect to the rest of the program, by default. This means they're a lot more than just atomic. See below for relaxing the ordering to just what's needed, which avoids expensive fence operations.

Why does the MFENCE happen after the write to ‘a’ not before.

StoreStore fences are a no-op with x86's strong memory model, so the compiler just has to put the store to b before the store to a to implement the source code ordering.

Full sequential consistency also requires that the store be globally ordered / globally visible before any later loads in program order.

x86 can re-order stores after loads. In practice, what happens is that out-of-order execution sees an independent load in the instruction stream, and executes it ahead of a store that was still waiting on the data to be ready. Anyway, sequential-consistency forbids this, so gcc uses MFENCE, which is a full barrier, including StoreLoad (the only kind x86 doesn't have for free. (LFENCE/SFENCE are only useful for weakly-ordered operations like movnt.))

Another way to put this is the way the C++ docs use: sequential consistency guarantees that all threads see all changes in the same order. The MFENCE after every atomic store guarantees that this thread sees stores from other threads. Otherwise, our loads would see our stores before other thread's loads saw our stores. A StoreLoad barrier (MFENCE) delays our loads until after the stores that need to happen first.

The ARM32 asm for b=2; a=1; is:

# get pointers and constants into registers
str r1, [r3]     # store b=2
dmb sy           # Data Memory Barrier: full memory barrier to order the stores.
   #  I think just a StoreStore barrier here (dmb st) would be sufficient, but gcc doesn't do that.  Maybe later versions have that optimization, or maybe I'm wrong.
str r2, [r3, #4] # store a=1  (a is 4 bytes after b)
dmb sy           # full memory barrier to order this store wrt. all following loads and stores.

I don't know ARM asm, but what I've figured out so far is that normally it's op dest, src1 [,src2], but loads and stores always have the register operand first and the memory operand 2nd. This is really weird if you're used to x86, where a memory operand can be the source or dest for most non-vector instructions. Loading immediate constants also takes a lot of instructions, because the fixed instruction length only leaves room for 16b of payload for movw (move word) / movt (move top).

Release / Acquire

The release and acquire naming for one-way memory barriers comes from locks:

One thread modifies a shared data structure, then releases a lock. The unlock has to be globally visible after all the loads/stores to data it's protecting. (StoreStore + LoadStore)
Another thread acquires the lock (read, or RMW with a release-store), and must do all loads/stores to the shared data structure after the acquire becomes globally visible. (LoadLoad + LoadStore)

Note that std:atomic uses these names even for standalone fences which are slightly different from load-acquire or store-release operations. (See atomic_thread_fence, below).

Release/Acquire semantics are stronger than what producer-consumer requires. That just requires one-way StoreStore (producer) and one-way LoadLoad (consumer), without LoadStore ordering.

A shared hash table protected by a readers/writers lock (for example) requires an acquire-load / release-store atomic read-modify-write operation to acquire the lock. x86 lock xadd is a full barrier (including StoreLoad), but ARM64 has load-acquire/store-release version of load-linked/store-conditional for doing atomic read-modify-writes. As I understand it, this avoids the need for a StoreLoad barrier even for locking.

Using weaker but still sufficient ordering

Writes to std::atomic types are ordered with respect to every other memory access in source code (both loads and stores), by default. You can control what ordering is imposed with std::memory_order.

In your case, you only need your producer to make sure stores become globally visible in the correct order, i.e. a StoreStore barrier before the store to a. store(memory_order_release) includes this and more. std::atomic_thread_fence(memory_order_release) is just a 1-way StoreStore barrier for all stores. x86 does StoreStore for free, so all the compiler has to do is put the stores in source order.

Release instead of seq_cst will be a big performance win, esp. on architectures like x86 where release is cheap/free. This is even more true if the no-contention case is common.

Reading atomic variables also imposes full sequential consistency of the load with respect to all other loads and stores. On x86, this is free. LoadLoad and LoadStore barriers are no-ops and implicit in every memory op. You can make your code more efficient on weakly-ordered ISAs by using a.load(std::memory_order_acquire).

Why Does a Std::Atomic Store with Sequential Consistency Use Xchg