C++ How Is Release-And-Acquire Achieved on X86 Only Using Mov

C++ How is release-and-acquire achieved on x86 only using MOV?

but this is for a single core. The multi-core section does not seem to mention how loads are enforced:

The first bullet point in that section is key: Individual processors use the same ordering principles as in a single-processor system. The implicit part of that statement is ... when loading/storing from cache-coherent shared memory. i.e. multi-processor systems don't introduce new ways for reordering, they just mean the possible observers now include code on other cores instead of just DMA / IO devices.

The model for reordering of access to shared memory is the single-core model, i.e. program order + a store buffer = basically acq_rel. Actually slightly stronger than acq_rel, which is fine.

The only reordering that happens is local, within each CPU core. Once a store becomes globally visible, it becomes visible to all other cores at the same time, and didn't become visible to any cores before that. (Except to the core doing the store, via store forwarding.) That's why only local barriers are sufficient to recover sequential consistency on top of a SC + store-buffer model. (For x86, just mo_seq_cst just needs mfence after SC stores, to drain the store buffer before any further loads can execute.
mfence and locked instructions (which are also full barriers) don't have to bother other cores, just make this one wait).

One key point to understand is that there is a coherent shared view of memory (through coherent caches) that all processors share. The very top of chapter 8 of Intel's SDM defines some of this background:

These multiprocessing mechanisms have the following characteristics:
To maintain system memory coherency — When two or more processors are attempting simultaneously to
access the same address in system memory, some communication mechanism or memory access protocol
must be available to promote data coherency and, in some instances, to allow one processor to temporarily lock
a memory location.
To maintain cache consistency — When one processor accesses data cached on another processor, it must not
receive incorrect data. If it modifies data, all other processors that access that data must receive the modified
data.
To allow predictable ordering of writes to memory — In some circumstances, it is important that memory writes
be observed externally in precisely the same order as programmed.
[...]
The caching mechanism and cache consistency of Intel 64 and IA-32 processors are discussed in Chapter 11.

(CPUs use some variant of MESI; Intel in practice uses MESIF, AMD in practice uses MOESI.)

The same chapter also includes some litmus tests that help illustrate / define the memory model. The parts you quoted aren't really a strictly formal definition of the memory model. But the section 8.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations shows that loads aren't reordered with loads. Another section also shows that LoadStore reordering is forbidden. Acq_rel is basically blocking all reordering except StoreLoad, and that's what x86 does. (https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120930/weak-vs-strong-memory-models/)

how are barriers/fences and acquire, release semantics implemented microarchitecturally?
x86 mfence and C++ memory barrier - asking why no barriers are needed for acq_rel, but coming at it from a different angle (wondering about how data ever becomes visible to other cores).
How do memory_order_seq_cst and memory_order_acq_rel differ? (seq_cst requires flushing the store buffer).
C11 Atomic Acquire/Release and x86_64 lack of load/store coherence?
Globally Invisible load instructions program-order + store buffer isn't exactly the same as acq_rel, especially once you consider a load that only partially overlaps a recent store.
x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors - a formal memory model for x86.

Other ISAs

In general, most weaker memory HW models also only allow local reordering so barriers are still only local within a CPU core, just making (some part of) that core wait until some condition. (e.g. x86 mfence blocks later loads and stores from executing until the store buffer drains. Other ISAs also benefit from light-weight barriers for efficiency for stuff that x86 enforces between every memory operation, e.g. blocking LoadLoad and LoadStore reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/)

A few ISAs (only PowerPC these days) allow stores to become visible to some other cores before becoming visible to all, allowing IRIW reordering. Note that mo_acq_rel in C++ allows IRIW reordering; only seq_cst forbids it. Most HW memory models are slightly stronger than ISO C++ and make it impossible, so all cores agree on the global order of stores.

Acquire-release on x86

Correct, acquire/release semantics cannot prevent StoreLoad reordering, i.e. taking a store followed by a load and interchanging their order. And such reordering is allowed for ordinary load and store instructions on x86.

If you want to avoid such reordering in C11, you need to use memory_order_seq_cst on both the store and the load. In x86 assembly, you need a barrier in between the two instructions. mfence serves this purpose, but so does any locked read-modify-write instruction, including xchg which does so even without the lock prefix. So if you look at the generated assembly for memory_order_seq_cst operations, you'll see some such barrier in between. (For certain reasons, something like lock add [rsp], 0, or xchg between some register and memory whose contents are unimportant, can actually be more performant than mfence, so some compilers will do that even though it looks weird.)

Does the MOV x86 instruction implement a C++11 memory_order_release atomic store?

That does appear to be the mapping, at least in code compiled with the Intel compiler, where I see:

0000000000401100 <_Z5storeRSt6atomicIiE>:
  401100:       48 89 fa                mov    %rdi,%rdx
  401103:       b8 32 00 00 00          mov    $0x32,%eax
  401108:       89 02                   mov    %eax,(%rdx)
  40110a:       c3                      retq
  40110b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000401110 <_Z4loadRSt6atomicIiE>:
  401110:       48 89 f8                mov    %rdi,%rax
  401113:       8b 00                   mov    (%rax),%eax
  401115:       c3                      retq
  401116:       0f 1f 00                nopl   (%rax)
  401119:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

for the code:

#include <atomic>
#include <stdio.h>

void store( std::atomic<int> & b ) ;

int load( std::atomic<int> & b ) ;

int main()
{
   std::atomic<int> b ;

   store( b ) ;

   printf("%d\n", load( b ) ) ;

   return 0 ;
}

void store( std::atomic<int> & b )
{
   b.store(50, std::memory_order_release ) ;
}

int load( std::atomic<int> & b )
{
   int v = b.load( std::memory_order_acquire ) ;

   return v ;
}

The current Intel architecture documents, Volume 3 (System Programming Guide), does a nice job explaining this. See:

8.2.2 Memory Ordering in P6 and More Recent Processor Families

Reads are not reordered with other reads.
Writes are not reordered with older reads.
Writes to memory are not reordered with other writes, with the following exceptions: ...

The full memory model is explained there. I'd assume that Intel and the C++ standard folks have worked together in detail to nail down the best mapping for each of the memory order operations possible with that conforms to the memory model described in Volume 3, and plain stores and loads have been determined to be sufficient in those cases.

Note that just because no special instructions are required for this ordered store on x86-64, doesn't mean that will be universally true. For powerpc I'd expect to see something like a lwsync instruction along with the store, and on hpux (ia64) the compiler should be using a st4.rel instruction.

C11 Atomic Acquire/Release and x86_64 lack of load/store coherence?

x86's memory model is basically sequential-consistency plus a store buffer (with store forwarding). So every store is a release-store¹. This is why only seq-cst stores need any special instructions. (C/C++11 atomics mappings to asm). Also, https://stackoverflow.com/tags/x86/info has some links to x86 docs, including a formal description of the x86-TSO memory model (basically unreadable for most humans; requires wading through a lot of definitions).

Since you're already reading Jeff Preshing's excellent series of articles, I'll point you at another one that goes into more detail:
https://preshing.com/20120930/weak-vs-strong-memory-models/

The only reordering that's allowed on x86 is StoreLoad, not LoadStore, if we're talking in those terms. (Store forwarding can do extra fun stuff if a load only partially overlaps a store; Globally Invisible load instructions, although you'll never get that in compiler-generated code for stdatomic.)

@EOF commented with the right quote from Intel's manual:

Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide, 8.2.3.3 Stores Are Not Reordered With Earlier Loads.

Footnote 1: ignoring weakly-ordered NT stores; this is why you normally sfence after doing NT stores. C11 / C++11 implementations assume you aren't using NT stores. If you are, use _mm_sfence before a release operation to make sure it respects your NT stores. (In general don't use _mm_mfence / _mm_sfence in other cases; usually you only need to block compile-time reordering. Or of course just use stdatomic.)

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Much of this has been covered in other Q&As (especially the later C++ How is release-and-acquire achieved on x86 only using MOV?), but I'll give a summary here. Still, good question, it's useful to collect this all in one place.

On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW speculatively loads earlier than allowed and then checks that speculation. (Potentially resulting in a memory-order mis-speculation pipeline nuke.) To track this, Intel calls the combination of load and store buffers the "Memory Order Buffer".

Weakly-ordered ISAs don't have to speculate, they can just load in any order.

x86 store ordering is maintained by only letting stores commit from the store buffer to L1d in program order.

On Intel CPUs at least, a store-buffer entry is allocated for a store when it issues (from the front-end into the ROB + RS). All uops need to have a ROB entry allocated for them, but some uops also need to have other resources allocated, like load or store buffer entries, RAT entries for registers they read/write, and so on.

So I think the store buffer itself is ordered. When a store-address or store-data uop executes, it merely writes an address or data into its already-allocated store-buffer entry. Since commit (freeing SB entries) and allocate are both in program order, I assume it's physically a circular buffer with a head and tail, like the ROB. (And unlike the RS).

Avoiding LoadStore is basically free: a load can't retire until it's executed (taken data from the cache). A store can't commit until after it retires. In-order retirement automatically means that all previous loads are done before a store is "graduated" and ready for commit.

A weakly-ordered uarch that can in practice do load-store reordering might scoreboard loads as well as tracking them in the ROB: let them retire once they're known to be non-faulting but, even if the data hasn't arrived.

This seems more likely on an in-order core, but IDK. So you could have a load that's retired but the register destination will still stall if anything tries to read it before the data actually arrives. We know that in-order cores do in practice work this way, not requiring loads to complete before later instructions can execute. (That's why software-pipelining using lots of registers is so valuable on such cores, e.g. to implement a memcpy. Reading a load result right away on an in-order core destroys memory parallelism.)

How is load->store reordering possible with in-order commit? goes into this more deeply, for in-order vs. out-of-order.

Barrier instructions

The only barrier instruction that does anything for regular stores is mfence which in practice stalls memory ops (or the whole pipeline) until the store buffer is drained. Are loads and stores the only instructions that gets reordered? covers the Skylake-with-updated-microcode behaviour of acting like lfence as well.

lfence mostly exists for the microarchitectural effect of blocking later instructions from even issuing until all previous instructions have left the out-of-order back-end (retired). The use-cases for lfence fo memory ordering are nearly non-existent.

C++ How is release-and-acquire achieved on x86 only using MOV?
How is the transitivity/cumulativity property of memory barriers implemented micro-architecturally?
How many memory barriers instructions does an x86 CPU have?
How can I experience "LFENCE or SFENCE can not pass earlier read/write"
Does lock xchg have the same behavior as mfence?
Does the Intel Memory Model make SFENCE and LFENCE redundant?
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths goes into a lot of detail about how LFENCE stops execution of later instructions, and what that means for performance.
When should I use _mm_sfence _mm_lfence and _mm_mfence high-level languages have weaker memory models than x86, so you sometimes only need a barrier that compiles to no asm instructions. Using _mm_sfence() when you haven't used any NT stores just makes your code slower for no reason than atomic_thread_fence(mo_release).

Does x86-SSE-instructions have an automatic release-acquire order?

Here is an excerpt from Intel's Software Developers Manual, volume 3, section 8.2.2 (the edition 325384-052US of September 2014):

Reads are not reordered with other reads.

Writes are not reordered with older reads.

Writes to memory are not reordered with other writes, with the following exceptions:

writes executed with the CLFLUSH instruction;

streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and

string operations (see Section 8.2.4.1).

Reads may be reordered with older writes to different locations but not with older writes to the same location.

Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.

Reads cannot pass earlier LFENCE and MFENCE instructions.

Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.

LFENCE instructions cannot pass earlier reads.

SFENCE instructions cannot pass earlier writes.

MFENCE instructions cannot pass earlier reads or writes.

The first three bullets describe the release-acquire ordering, and the exceptions are explicitly listed there. As you might see, only cacheability control instructions (MOVNT*) are in the exception list, while the rest of SSE/SSE2 and other vector instructions obey to the general memory ordering rules, and do not require use of [LSM]FENCE.

Acquire/Release semantics

The complete quote is:

If we let both threads run and find that r1 == 1, that serves as
confirmation that the value of A assigned in Thread 1 was passed
successfully to Thread 2. As such, we are guaranteed that r2 == 42.

The aquire-release semantics only guarantee that

A = 42 doesn't happen after Ready = 1 in thread 1
r2 = A doesn't happen before r1 = Ready in thread 2

So the value of r1 has to be checked in thread 2 to be sure that A has been written by thread 1. The scenario in the question can indeed happen, but r1 will be 0 in that case.

x86 acquire semantics fetch and increment?

What is untrue in the 'lock xadd' implementation of the fetch_add ?

If you mean that you do not want the stronger semantic of the full barrier provided by the locked RWM op, then on x86 you indeed do not have any other choice. For loads, plain load MOV instruction does provide the 'true' acquire semantic in this sense, since stores executed in program order before the MOV might be observed by other CPUs after it, due to store buffering.

How do memory_order_seq_cst and memory_order_acq_rel differ?

http://en.cppreference.com/w/cpp/atomic/memory_order has a good example at the bottom that only works with memory_order_seq_cst. Essentially memory_order_acq_rel provides read and write orderings relative to the atomic variable, while memory_order_seq_cst provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.

The example boils down to this:

bool x= false;
bool y= false;
int z= 0;

a() { x= true; }
b() { y= true; }
c() { while (!x); if (y) z++; }
d() { while (!y); if (x) z++; }

// kick off a, b, c, d, join all threads
assert(z!=0);

Operations on z are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z is always incremented.

C++ How Is Release-And-Acquire Achieved on X86 Only Using Mov