Why Gcc Does Not Use Load(Without Fence) and Store+Sfence for Sequential Consistency

Does the Intel Memory Model make SFENCE and LFENCE redundant?

Right, LFENCE and SFENCE are not useful in normal code because x86's acquire / release semantics for regular stores make them redundant unless you're using other special instructions or memory types.

The only fence that matters for normal lockless code is the full barrier (including StoreLoad) from a locked instruction, or a slow MFENCE. Prefer xchg for sequential-consistency stores over mov+mfence. Are loads and stores the only instructions that gets reordered? because it's faster.

Does `xchg` encompass `mfence` assuming no non-temporal instructions? (yes, even with NT instructions, as long as there's no WC memory.)

Jeff Preshing's Memory Reordering Caught in the Act article is an easier-to-read description of the same case Bartosz's post talks about, where you need a StoreLoad barrier like MFENCE. Only MFENCE will do; you can't construct MFENCE out of SFENCE + LFENCE. (Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?)

If you had questions after reading the link you posted, read Jeff Preshing's other blog posts. They gave me a good understanding of the subject. :) Although I think I found the tidbit about SFENCE/LFENCE normally being a no-op in Doug Lea's page. Jeff's posts didn't consider NT loads/stores.

Related: When should I use _mm_sfence _mm_lfence and _mm_mfence (my answer and @BeeOnRope's answer are good. I wrote this answer a lot longer ago than that answer, so parts of this answer are showing my inexperience years ago. My answer there considers the C++ intrinsics and C++ compile-time memory order, which is not at all the same thing as x86 asm runtime memory ordering. But you still don't want _mm_lfence().)

SFENCE is only relevant when using movnt (Non-Temporal) streaming stores, or working with memory regions with a type set to something other than the normal Write-Back. Or with clflushopt, which is kind of like a weakly-ordered store. NT stores bypass the cache as well as being weakly ordered. x86's normal memory model is strongly ordered, other than NT stores, WC (write-combining) memory, and ERMSB string ops (see below)).

LFENCE is only useful for memory ordering with weakly-ordered loads, which are very rare. (Or possible for LoadStore ordering with regular loads before NT stores?)

NT loads (movntdqa) from WB memory are still strongly ordered, even on a hypothetical future CPU that doesn't ignore the NT hint; the only way to do weakly-ordered loads on x86 is when reading from weakly-ordered memory (WC), and then I think only with movntdqa. This doesn't happen by accident in "normal" programs, so you only have to worry about this if you mmap video RAM or something.

(The primary use-case for lfence is not memory ordering at all, it's for serializing instruction execution, e.g. for Spectre mitigation, or with RDTSC. See Is LFENCE serializing on AMD processors? and the "linked questions" sidebar for that question.)

Memory ordering in C++, and how it maps to x86 asm

I got curious about this a couple weeks ago, and posted a fairly detailed answer to a recent question:
Atomic operations, std::atomic<> and ordering of writes. I included lots of links to stuff about the memory model of C++ vs. hardware memory models.

If you're writing in C++, using std::atomic<> is an excellent way to tell the compiler what ordering requirements you have, so it doesn't reorder your memory operations at compile time. You can and should use weaker release or acquire semantics where appropriate, instead of the default sequential consistency, so the compiler doesn't have to emit any barrier instructions at all on x86. It just has to keep the ops in source order.

On a weakly ordered architecture like ARM or PPC, or x86 with movnt, you need a StoreStore barrier instruction between writing a buffer and setting a flag to indicate the data is ready. Also, the reader needs a LoadLoad barrier instruction between checking the flag and reading the buffer.

Not counting movnt, x86 already has LoadLoad barriers between every load, and StoreStore barriers between every store. (LoadStore ordering is also guaranteed). MFENCE is all 4 kinds of barriers, including StoreLoad, which is the only barrier x86 doesn't do by default. MFENCE makes sure loads don't use old prefetched values from before other threads saw your stores and potentially did stores of their own. (As well as being a barrier for NT store ordering and load ordering.)

Fun fact: x86 lock-prefixed instructions are also full memory barriers. They can be used as a substitute for MFENCE in old 32bit code that might run on CPUs not supporting it. lock add [esp], 0 is otherwise a no-op, and does the read/modify/write cycle on memory that's very likely hot in L1 cache and already in the M state of the MESI coherency protocol.

SFENCE is a StoreStore barrier. It's useful after NT stores to create release semantics for a following store.

LFENCE is almost always irrelevant as a memory barrier because the only weakly-ordered load

a LoadLoad and also a LoadStore barrier. (loadNT / LFENCE / storeNT prevents the store from becoming globally visible before the load. I think this could happen in practice if the load address was the result of a long dependency chain, or the result of another load that missed in cache.)

ERMSB string operations

Fun fact #2 (thanks @EOF): The stores from ERMSB (Enhanced rep movsb/rep stosb on IvyBridge and later) are weakly-ordered (but not cache-bypassing). ERMSB builds on regular Fast-String Ops (wide stores from the microcoded implementation of rep stos/movsb that's been around since PPro).

Intel documents the fact that ERMSB stores "may appear to execute out of order" in section 7.3.9.3 of their Software Developers Manual, vol1. They also say

"Order-dependent code should write to a discrete semaphore variable
after any string operations to allow correctly ordered data to be seen
by all processors"

They don't mention any barrier instructions being necessary between the rep movsb and the store to a data_ready flag.

The way I read it, there's an implicit SFENCE after rep stosb / rep movsb (at least a fence for the string data, probably not other in-flight weakly ordered NT stores). Anyway, the wording implies that a write to the flag / semaphore becomes globally visible after all the string-move writes, so no SFENCE / LFENCE is needed in code that fills a buffer with a fast-string op and then writes a flag, or in code that reads it.

(LoadLoad ordering always happens, so you always see data in the order that other CPUs made it globally visible. i.e. using weakly-ordered stores to write a buffer doesn't change the fact that loads in other threads are still strongly ordered.)

summary: use a normal store to write a flag indicating that a buffer is ready. Don't have readers just check the last byte of the block written with memset/memcpy.

I also think ERMSB stores prevent any later stores from passing them, so you still only need SFENCE if you're using movNT. i.e. the rep stosb as a whole has release semantics wrt. earlier instructions.

There's a MSR bit that can be cleared to disable ERMSB for the benefit of new servers that need to run old binaries that writes a "data ready" flag as part of a rep stosb or rep movsb or something. (In that case I guess you get the old fast-string microcode that may use an efficient cache protocol, but does make all the stores appear to other cores in order).

Why does a std::atomic store with sequential consistency use XCHG?

mov-store + mfence and xchg are both valid ways to implement a sequential-consistency store on x86. The implicit lock prefix on an xchg with memory makes it a full memory barrier, like all atomic RMW operations on x86.

(x86's memory-ordering rules essentially make that full-barrier effect the only option for any atomic RMW: it's both a load and a store at the same time, stuck together in the global order. Atomicity requires that the load and store aren't separated by just queuing the store into the store buffer so it has to be drained, and load-load ordering of the load side requires that it not reorder.)

Plain mov is not sufficient; it only has release semantics, not sequential-release. (Unlike AArch64's stlr instruction, which does do a sequential-release store that can't reorder with later ldar sequential-acquire loads. This choice is obviously motivated by C++11 having seq_cst as the default memory ordering. But AArch64's normal store is much weaker; relaxed not release.)

See Jeff Preshing's article on acquire / release semantics, and note that regular release stores (like mov or any non-locked x86 memory-destination instruction other than xchg) allows reordering with later operations, including acquire loads (like mov or any x86 memory-source operand). e.g. If the release-store is releasing a lock, it's ok for later stuff to appear to happen inside the critical section.

There are performance differences between mfence and xchg on different CPUs, and maybe in the hot vs. cold cache and contended vs. uncontended cases. And/or for throughput of many operations back to back in the same thread vs. for one on its own, and for allowing surrounding code to overlap execution with the atomic operation.

See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for actual benchmarks of mfence vs. lock addl $0, -8(%rsp) vs. (%rsp) as a full barrier (when you don't already have a store to do).

On Intel Skylake hardware, mfence blocks out-of-order execution of independent ALU instructions, but xchg doesn't. (See my test asm + results in the bottom of this SO answer). Intel's manuals don't require it to be that strong; only lfence is documented to do that. But as an implementation detail, it's very expensive for out-of-order execution of surrounding code on Skylake.

I haven't tested other CPUs, and this may be a result of a microcode fix for erratum SKL079, SKL079 MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions. The existence of the erratum basically proves that SKL used to be able to execute instructions after MFENCE. I wouldn't be surprised if they fixed it by making MFENCE stronger in microcode, kind of a blunt instrument approach that significantly increases the impact on surrounding code.

I've only tested the single-threaded case where the cache line is hot in L1d cache. (Not when it's cold in memory, or when it's in Modified state on another core.) xchg has to load the previous value, creating a "false" dependency on the old value that was in memory. But mfence forces the CPU to wait until previous stores commit to L1d, which also requires the cache line to arrive (and be in M state). So they're probably about equal in that respect, but Intel's mfence forces everything to wait, not just loads.

AMD's optimization manual recommends xchg for atomic seq-cst stores. I thought Intel recommended mov + mfence, which older gcc uses, but Intel's compiler also uses xchg here.

When I tested, I got better throughput on Skylake for xchg than for mov+mfence in a single-threaded loop on the same location repeatedly. See Agner Fog's microarch guide and instruction tables for some details, but he doesn't spend much time on locked operations.

See gcc/clang/ICC/MSVC output on the Godbolt compiler explorer for a C++11 seq-cst my_atomic = 4; gcc uses mov + mfence when SSE2 is available. (use -m32 -mno-sse2 to get gcc to use xchg too). The other 3 compilers all prefer xchg with default tuning, or for znver1 (Ryzen) or skylake.

The Linux kernel uses xchg for __smp_store_mb().

Update: recent GCC (like GCC10) changed to using xchg for seq-cst stores like other compilers do, even when SSE2 for mfence is available.

Another interesting question is how to compile atomic_thread_fence(mo_seq_cst);. The obvious option is mfence, but lock or dword [rsp], 0 is another valid option (and used by gcc -m32 when MFENCE isn't available). The bottom of the stack is usually already hot in cache in M state. The downside is introducing latency if a local was stored there. (If it's just a return address, return-address prediction is usually very good so delaying ret's ability to read it is not much of a problem.) So lock or dword [rsp-4], 0 could be worth considering in some cases. (gcc did consider it, but reverted it because it makes valgrind unhappy. This was before it was known that it might be better than mfence even when mfence was available.)

All compilers currently use mfence for a stand-alone barrier when it's available. Those are rare in C++11 code, but more research is needed on what's actually most efficient for real multi-threaded code that has real work going on inside the threads that are communicating locklessly.

But multiple source recommend using lock add to the stack as a barrier instead of mfence, so the Linux kernel recently switched to using it for the smp_mb() implementation on x86, even when SSE2 is available.

See https://groups.google.com/d/msg/fa.linux.kernel/hNOoIZc6I9E/pVO3hB5ABAAJ for some discussion, including a mention of some errata for HSW/BDW about movntdqa loads from WC memory passing earlier locked instructions. (Opposite of Skylake, where it was mfence instead of locked instructions that were a problem. But unlike SKL, there's no fix in microcode. This may be why Linux still uses mfence for its mb() for drivers, in case anything ever uses NT loads to copy back from video RAM or something but can't let the reads happen until after an earlier store is visible.)

In Linux 4.14, smp_mb() uses mb(). That uses mfence is used if available, otherwise lock addl $0, 0(%esp).
__smp_store_mb (store + memory barrier) uses xchg (and that doesn't change in later kernels).
In Linux 4.15, smb_mb() uses lock; addl $0,-4(%esp) or %rsp, instead of using mb(). (The kernel doesn't use a red-zone even in 64-bit, so the -4 may help avoid extra latency for local vars).
mb() is used by drivers to order access to MMIO regions, but smp_mb() turns into a no-op when compiled for a uniprocessor system. Changing mb() is riskier because it's harder to test (affects drivers), and CPUs have errata related to lock vs. mfence. But anyway, mb() uses mfence if available, else lock addl $0, -4(%esp). The only change is the -4.
In Linux 4.16, no change except removing the #if defined(CONFIG_X86_PPRO_FENCE) which defined stuff for a more weakly-ordered memory model than the x86-TSO model that modern hardware implements.

x86 & x86_64. Where a store has an implicit acquire fence

You mean release, I hope. my_atomic.store(1, std::memory_order_acquire); won't compile, because write-only atomic operations can't be acquire operations. See also Jeff Preshing's article on acquire/release semantics.

Or asm volatile("" ::: "memory");

No, that's a compiler barrier only; it prevents all compile-time reordering across it, but doesn't prevent runtime StoreLoad reordering, i.e. the store being buffered until later, and not appearing in the global order until after a later load. (StoreLoad is the only kind of runtime reordering x86 allows.)

Anyway, another way to express what you want here is:

my_atomic.store(1, std::memory_order_release);        // mov
// with no operations in between, there's nothing for the release-store to be delayed past
std::atomic_thread_fence(std::memory_order_seq_cst);  // mfence

Using a release fence would not be strong enough (it and the release-store could both be delayed past a later load, which is the same thing as saying that release fences don't keep later loads from happening early). A release-acquire fence would do the trick, though, keeping later loads from happening early and not itself being able to reorder with the release store.

Related: Jeff Preshing's article on fences being different from release operations.

But note that seq-cst is special according to C++11 rules: only seq-cst operations are guaranteed to have a single global / total order which all threads agree on seeing. So emulating them with weaker order + fences might not be exactly equivalent in general on the C++ abstract machine, even if it is on x86. (On x86, all store have a single total order which all cores agree on. See also Globally Invisible load instructions: Loads can take their data from the store buffer, so we can't really say that there's a total order for loads + stores.)

Memory ordering behavior of std::atomic::load

Sigh, this was too long for a comment:

Isn't the meaning of atomic "to appear to occur instantaneously to the rest of the system"?

I'd say yes and no to that one, depending on how you think of it. For writes with SEQ_CST, yes. But as far as how atomic loads are handled, check out 29.3 of the C++11 standard. Specifically, 29.3.3 is really good reading, and 29.3.4 might be specifically what you're looking for:

For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_-
cst fence X sequenced before B, then B observes either the last memory_order_seq_cst modification of M
preceding X in the total order S or a later modification of M in its modification order.

Basically, SEQ_CST forces a global order just like the standard says, but reads can return and old value without violating the 'atomic' constraint.

To accomplish 'getting the absolute latest value' you'll need to perform an operation that forces the hardware coherency protocol to lock(the lock instruction on x86_64). This is what the atomic compare-and-exchange operations do, if you look at the assembly output.

Does STLR(B) provide sequential consistency on ARM64?

Yup, stlr is store-release on its own, and ldar can't pass an earlier stlr (i.e. no StoreLoad reordering) - that interaction between them satisfies that part of the seq_cst requirements which acq / rel doesn't have. (ARMv8.3 ldapr is like ldar without that interaction, being only a plain acquire load, allowing more efficient acq_rel.)

So on ARMv8.3, the difference between seq_cst and acq / rel is in the load side. ARMv8 before 8.3 can't do acq / rel while still allowing StoreLoad reordering, so it's unfortunately slow if you acquire-load something else after a release-store. ARMv8.3 ldapr fixes that, making acq / rel as efficient as on x86.

On x86, everything is an acquire load or release store (so acq_rel is free), and the least-bad way to achieve sequential consistency is by doing a full barrier on seq_cst stores. (You want atomic loads to be cheap, and it's going to be common for code to use the default seq_cst memory order.)

(C/C++11 mappings to processors discusses that tradeoff of wanting cheap loads, if you have to pick either load or store to attach the full barrier to.)

Separately, the IRIW litmus test (all threads agreeing on the order of independent stores) is guaranteed by the ARMv8 memory model even for relaxed stores. It's guaranteed to be "multicopy-atomic", which means that when a store becomes visible to any other core, it becomes visible to all other cores at the same time. This is sufficient for all cores to agree on a total order for all stores, to the limits of anything they can observe via two acquire loads.

In practical terms, that means stores only become visible by committing to L1d cache, which is coherent. Not for example by store-forwarding between logical cores sharing a physical core, the mechanism for IRIW reordering on the few POWER CPUs that can produce the effect in real life. ARMv8 originally allowed that on paper, but no ARM CPUs ever did. They strengthened the memory model to simply guarantee that no future CPU would be weird like that. See Simplifying ARM Concurrency: Multicopy-Atomic
Axiomatic and Operational Models for ARMv8 for details.

Note that this guarantee of all threads being able to agree on an order applies to all stores on ARM64, including relaxed. (There are very few HW mechanisms that can create it, in a machine with coherent shared memory, so it's only on rare ISAs that seq_cst has to actually do anything specific to prevent it.)

x86's TSO (Total Store Order) memory model has the required property right in the name. And yes, it's much stronger, basically program-order plus a store-buffer with store-forwarding. (So this allows StoreLoad reordering, and for a core to see its own stores before they're globally visible, but nothing else. Ignoring NT stores, and NT loads from WC memory such as video RAM...)

Why does this `std::atomic_thread_fence` work

So my major question is how can _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst); create a full barrier MFENCE

This compiles to an xchg instruction with a memory destination. This is a full memory barrier (draining the store buffer) exactly¹ like mfence.

With compiler barriers before and after that, compile-time reordering around it is also prevented. Therefore all reordering in either direction is prevented (of operations on atomic and non-atomic C++ objects), making it more than strong enough to do everything that ISO C++ atomic_thread_fence(mo_seq_cst) promises.

For orders weaker than seq_cst, only a compiler barrier is needed. x86's hardware memory-ordering model is program-order + a store buffer with store forwarding. That's strong enough for acq_rel without the compiler emitting any special asm instructions, just blocking compile-time reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/

Footnote 1: exactly enough for the purposes of std::atomic. Weakly ordered MOVNTDQA loads from WC memory may not be as strictly ordered by locked instructions as by MFENCE.

Which is a better write barrier on x86: lock+addl or xchgl?
Does lock xchg have the same behavior as mfence? - equivalent for std::atomic purposes, but some minor differences that might matter for a device driver using WC memory regions. And perf differences. Notably on Skylake where mfence blocks OoO exec like lfence
Why is LOCK a full barrier on x86?

Atomic read-modify-write (RMW) operation on x86 are only possible with a lock prefix, or xchg with memory which is like that even without a lock prefix in the machine code. A lock-prefixed instruction (or xchg with mem) is always a full memory barrier.

Using an instruction like lock add dword [esp], 0 as a substitute for mfence is a well-known technique. (And performs better on some CPUs.) This MSVC code is the same idea, but instead of a no-op on whatever the stack pointer is pointing-to, it does an xchg on a dummy variable. It doesn't actually matter where it is, but a cache line that's only ever accessed by the current core and is already hot in cache is the best choice for performance.

Using a static shared variable that all cores will contend for access to is the worst possible choice; this code is terrible! Interacting with the same cache line as other cores is not necessary to control the order of this core's operations on its own L1d cache. This is completely bonkers. MSVC still apparently uses this horrible code in its implementation of std::atomic_thread_fence(), even for x86-64 where mfence is guaranteed available. (Godbolt with MSVC 19.14)

If you're doing a seq_cst store, your options are mov+mfence (gcc does this) or doing the store and the barrier with a single xchg (clang and MSVC do this, so the codegen is fine, no shared dummy var).

Much of the early part of this question (stating "facts") seems wrong and contains some misinterpretations or things that are so misguided they're not even wrong.

std::memory_order_seq_cst makes no guarantee to prevent STORE-LOAD reorder.

C++ guarantees order using a totally different model, where acquire loads that see a value from a release store "synchronize with" it, and later operations in the C++ source are guaranteed to see all the stores from code before the release store.

It also guarantees that there's a total order of all seq_cst operations even across different objects. (Weaker orders allow a thread to reload its own stores before they become globally visible, i.e. store forwarding. That's why only seq_cst has to drain the store buffer. They also allow IRIW reordering. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)

Concepts like StoreLoad reordering are based on a model where:

All inter-core communication is via committing stores to cache-coherent shared memory
Reordering happens inside one core between its own accesses to cache. e.g. by the store buffer delaying store visibility until after later loads like x86 allows. (Except a core can see its own stores early via store forwarding.)

In terms of this model, seq_cst does require draining the store buffer at some point between a seq_cst store and a later seq_cst load. The efficient way to implement this is to put a full barrier after seq_cst stores. (Instead of before every seq_cst load. Cheap loads are more important than cheap stores.)

On an ISA like AArch64, there are load-acquire and store-release instructions which actually have sequential-release semantics, unlike x86 loads/stores which are "only" regular release. (So AArch64 seq_cst doesn't need a separate barrier; a microarchitecture could delay draining the store buffer unless / until a load-acquire executes while there's still a store-release not committed to L1d cache yet.) Other ISAs generally need a full barrier instruction to drain the store buffer after a seq_cst store.

Of course even AArch64 needs a full barrier instruction for a seq_cst fence, unlike a seq_cst load or store operation.

std::atomic_thread_fence(memory_order_seq_cst) always generates a full-barrier

In practice yes.

So I can always replace asm volatile("mfence" ::: "memory") with std::atomic_thread_fence(memory_order_seq_cst)

In practice yes, but in theory an implementation could maybe allow some reordering of non-atomic operations around std::atomic_thread_fence and still be standards-compliant. Always is a very strong word.

ISO C++ only guarantees anything when there are std::atomic load or store operations involved. GNU C++ would let you roll your own atomic operations out of asm("" ::: "memory") compiler barriers (acq_rel) and asm("mfence" ::: "memory") full barriers. Converting that to ISO C++ signal_fence and thread_fence would leave a "portable" ISO C++ program that has data-race UB and thus no guarantee of anything.

(Although note that rolling your own atomics should use at least volatile, not just barriers, to make sure the compiler doesn't invent multiple loads, even if you avoid the obvious problem of having loads hoisted out of a loop. Who's afraid of a big bad optimizing compiler?).

Always remember that what an implementation does has to be at least as strong as what ISO C++ guarantees. That often ends up being stronger.