C++11 Memory_Order_Acquire and Memory_Order_Release Semantics

C++11 memory_order_acquire and memory_order_release semantics?

The spinlock mutex implementation looks okay to me. I think they got the definitions of acquire and release completely wrong.

Here is the clearest explanation of acquire/release consistency models that I am aware of: Gharachorloo; Lenoski; Laudon; Gibbons; Gupta; Hennessy: Memory consistency and event ordering in scalable shared-memory multiprocessors, Int'l Symp Comp Arch, ISCA(17):15-26, 1990, doi 10.1145/325096.325102. (The doi is behind the ACM paywall. The actual link is to a copy not behind a paywall.)

Look at Condition 3.1 in Section 3.3 and the accompanying Figure 3:

before an ordinary load or store access is allowed
to perform with respect to any other processor,
all previous acquire accesses must be performed, and
before a release access is allowed to perform with
respect to any other processor, all previous ordinary
load and store accesses must be performed, and
special accesses are [sequentially] consistent with respect
to one another.

The point is this: acquires and releases are sequentially consistent¹ (all threads globally agree on the order in which acquires and releases happened.) All threads globally agree that the stuff that happens between an acquire and a release on a specific thread happened between the acquire and release. But normal loads and stores after a release are allowed to be moved (either by hardware or the compiler) above the release, and normal loads and stores before an acquire are allowed to be moved (either by hardware or the compiler) to after the acquire.

(Footnote 1: This is true for most implementations, but an overstatement for ISO C++ in general. Reader threads are allowed to disagree about the order of 2 stores done by 2 other threads. See Acquire/release semantics with 4 threads, and this answer for details of how C++ compiled for POWER CPUs demonstrates the difference in practice with release and acquire, but not seq_cst. But most CPUs do only get data between cores via coherent cache that means a global order does exist.)

In the C++ standard (I used the link to the Jan 2012 draft) the relevant section is 1.10 (pages 11 through 14).

The definition of happens-before is intended to be modeled after Lamport; Time, Clocks, and the Ordering of Events in a Distributed System, CACM, 21(7):558-565, Jul 1978. C++ acquires correspond to Lamport's receives, C++ releases correspond to Lamport's sends. Lamport placed a total order on the sequence of events within a single thread, where C++ has to allow a partial order (see Section 1.9, Paragraphs 13-15, page 10 for the C++ definition of sequenced-before.) Still, the sequenced-before ordering is pretty much what you would expect. Statements are sequenced in the order they are given in the program. Section 1.9, paragraph 14: "Every value computation and side eﬀect associated with a full-expression is sequenced before every value
computation and side eﬀect associated with the next full-expression to be evaluated."

The whole point of Section 1.10 is to say that a program that is data-race-free produces the same well defined value as if the program were run on a machine with a sequentially consistent memory and no compiler reordering. If there is a data race then the program has no defined semantics at all. If there is no data race then the compiler (or machine) is permitted to reorder operations that don't contribute to the illusion of sequential consistency.

Section 1.10, Paragraph 21 (page 14) says: A program is not data-race-free if there is a pair of accesses A and B from different threads to object X, at least one of those accesses has a side effect, and neither A happens-before B, nor B happens-before A. Otherwise the program is data-race-free.

Paragraphs 6-20 give a very careful definition of the happens-before relation. The key definition is Paragraph 12:

"An evaluation A happens before an evaluation B if:

A is sequenced before B, or
A inter-thread happens before B."

So if an acquire is sequenced before (in the same thread) pretty much any other statement, then the acquire must appear to happen before that statement. (Including if that statement performs a write.)

Likewise: if pretty much any statement is sequenced before (in the same thread) a release, then that statement must appear to happen before the release. (Including if that statement just does a value computation (read).)

The reason that the compiler is allowed to move other computations from after a release to before a release (or from before an acquire to after an acquire) is because of the fact that those operations specifically do not have an inter-thread happens before relationship (because they are outside the critical section). If they race the semantics are undefined, and if they don't race (because they aren't shared) then you can't tell exactly when they happened with regard to the synchronization.

Which is a very long way of saying: cppreference.com's definitions of acquire and release are dead wrong. Your example program has no data race condition, and PANIC can not occur.

Understanding `memory_order_acquire` and `memory_order_release` in C++11

Acquire and Release are Memory Barriers.
If your program reads data after an acquire barrier you are assured you will be reading data consistent in order with any preceding release by any other thread in respect of the same atomic variable. Atomic variables are guaranteed to have an absolute order (when using memory_order_acquire and memory_order_release though weaker operations are provided for) to their reads and writes across all threads. These barriers in effect propagate that order to any threads using that atomic variable.
You can use atomics to indicate something has 'finished' or is 'ready' but if the consumer reads beyond that atomic variable the consumer can't be rely on 'seeing' the right 'versions' of other memory and atomics would have limited value.

The statements about 'moving before' or 'moving after' are instructions to the optimizer that it shouldn't re-order operations to take place out of order. Optimizers are very good at re-ordering instructions and even omitting redundant reads/writes but if they re-organise the code across the memory barriers they may unwittingly violate that order.

Your code relies on the std::string object (a) having been constructed in producer() before ptr is assigned and (b) the constructed version of that string (i.e. the version of the memory it occupies) being the one that consumer() reads.
Put simply consumer() is going to eagerly read the string as soon as it sees ptr assigned so it damn well better see a valid and fully constructed object or bad times will ensue.
In that code 'the act' of assigning ptr is how producer() 'tells' consumer the string is 'ready'. The memory barrier exists to make sure that's what the consumer sees.

Conversely if ptr was declared as an ordinary std::string * then the compiler could decide to optimize p away and assign the allocated address directly to ptr and only then construct the object and assign the int data. That is likely a disaster for the consumer thread which is using that assignment as the indicator that the objects producer is preparing are ready.
To be accurate if ptr were a pointer the consumer may never see the value assigned or on some architectures read a partially assigned value where only some of the bytes have been assigned and it points to a garbage memory location. However those aspects are about it being atomic not the wider memory barriers.

When can memory_order_acquire or memory_order_release be safely removed from compare_exchange?

memory_order_acquire makes only sense for operations that read a value, and memory_order_release makes only sense for operations that write a value. Since a read-modify-write operations reads and writes, it is possible to combine these memory orders, but it is not always necessary.

The m_event.m_state.compare_exchange_weak uses memory_order_release to write the new value, because it tries to replace a value that has previously been read using memory_order_acquire:

  // load initial value using memory_order_acquire
  void* oldValue = m_event.m_state.load(std::memory_order_acquire);
  do {
    ...
  } while (!m_event.m_state.compare_exchange_weak(oldValue, this,
             std::memory_order_release,
             // in case of failure, load new value using memory_order_acquire
             std::memory_order_acquire));

IMHO in this case it is not even necessary to use memory_order_acquire at all, since oldValue is never de-referenced, but only stored as next pointer, i.e., it would be perfectly find to replace these two memory_order_acquire with memory_order_relaxed.

In async_manual_reset_event::set() the situtation is different:

  void* oldValue = m_state.exchange(this, std::memory_order_acq_rel);
  if (oldValue != this)
  {
    auto* waiters = static_cast<awaiter*>(oldValue);
    while (waiters != nullptr)
    {
      // we are de-referencing the pointer read from m_state!
      auto* next = waiters->m_next;
      waiters->m_awaitingCoroutine.resume();
      waiters = next;
    }

Since we are de-referencing the pointer we read from m_state, we have to ensure that these reads happen after the writes to these waiter objects. This is ensured via the synchronize-with relation on m_state. The writer is added via the previously discussed compare_exchange using memory_order_release. The acquire-part of the exchange synchronizes with the release-compare_exchange (and in fact all prior release-compare_exchange that are part of the release sequence), thus providing the necessary happens-before relation.

To be honest, I am not sure why this exchange would need the release part. I think the author might have wanted to be on "the safe side", since several other operations are also stronger than necessary (I already mentioned that await_suspend does not need memory_order_acquire, and the same goes for is_set and reset).

For your lock implementation it is very simple - when you want to acquire the lock (try_lock_shared/try_lock) use memory_order_acquire for the compare-exchange operation only. Releasing the lock has to use memory_order_release.

The argument is also quite simple: you have to ensure that when you have acquired the lock, any changes previously made to the data protected by the lock is visible to the current owner, that is, you have to ensure that these changes happened before the operations you are about to perform after acquiring the lock. This is achieved by establishing a synchronize-with relation between the try_lock (acquire-CAS) and the previous unlock (release-store).

When trying to argue about the correctness of an implementation based on the semantics of the C++ memory model I usually do this as follows:

identify the necessary happens-before relations (like for your lock)
make sure that these happens-before relations are established correctly on all code paths

And I always annotate the atomic operations to document how these relations are established (i.e., which other operations are involved). For example:

  // (1) - this acquire-load synchronizes-with the release-CAS (11)
  auto n = head.load(std::memory_order_acquire);

  // (8) - this acquire-load synchronizes-with the release-CAS (11)
  h.acquire(head, std::memory_order_acquire);

  // (11) - this release-CAS synchronizes-with the acquire-load (1, 8)
  if (head.compare_exchange_weak(expected, next, std::memory_order_release, std::memory_order_relaxed))

(see https://github.com/mpoeter/xenium/blob/master/xenium/michael_scott_queue.hpp for the full code)

For more details about the C++ memory model I can recommend this paper which I have co-authored: Memory Models for C/C++ Programmers

Does the semantics of `std::memory_order_acquire` requires processor instructions on x86/x86_64?

No. The semantics of std::memory_order_acquire doesn't requires processor instructions on x86/x86_64.

Any load()/store() operations on x86_64 doesn't require processor instructions (lock/fence) except atomic.store(val, std::memory_order_seq_cst); which requires (LOCK) XCHG or alternative: MOV (into memory),MFENCE.

Processor memory-barriers-instructions for x86(except CAS), and also ARM and PowerPC: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

Disassembler GCC 4.8.1 x86_64 - GDB - load():

    20      temp = a.load(std::memory_order_relaxed);
    21      temp = a.load(std::memory_order_acquire);
    22      temp = a.load(std::memory_order_seq_cst);
0x46140b  <+0x007b>         mov    0x38(%rsp),%ebx
0x46140f  <+0x007f>         mov    0x34(%rsp),%esi
0x461413  <+0x0083>         mov    0x30(%rsp),%edx

Disassembler GCC 4.8.1 x86_64 - GDB - store():

a.store(temp, std::memory_order_relaxed);
a.store(temp, std::memory_order_release);
a.store(temp, std::memory_order_seq_cst);
0x4613dc  <+0x004c>         mov    %eax,0x20(%rsp)
0x4613e0  <+0x0050>         mov    0x38(%rsp),%eax
0x4613e4  <+0x0054>         mov    %eax,0x20(%rsp)
0x4613e8  <+0x0058>         mov    0x38(%rsp),%eax
0x4613ec  <+0x005c>         mov    %eax,0x20(%rsp)
0x4613f0  <+0x0060>         mfence
0x4613f3  <+0x0063>         mov    %ebx,0x20(%rsp)

Disassembler MSVS 2012 x86_64 - load() - it is the same as this bug report: http://connect.microsoft.com/VisualStudio/feedback/details/770885:

    temp = a.load(std::memory_order_relaxed);
000000013FE51A1F  prefetchw   [a]  
000000013FE51A24  mov         eax,dword ptr [a]  
000000013FE51A28  nop         dword ptr [rax+rax]  
000000013FE51A30  mov         ecx,eax  
000000013FE51A32  lock cmpxchg dword ptr [a],ecx  
000000013FE51A38  jne         main+40h (013FE51A30h)  
000000013FE51A3A  mov         dword ptr [temp],eax  
    temp = a.load(std::memory_order_acquire);
000000013FE51A3E  prefetchw   [a]  
000000013FE51A43  mov         eax,dword ptr [a]  
000000013FE51A47  nop         word ptr [rax+rax]  
000000013FE51A50  mov         ecx,eax  
000000013FE51A52  lock cmpxchg dword ptr [a],ecx  
000000013FE51A58  jne         main+60h (013FE51A50h)  
000000013FE51A5A  mov         dword ptr [temp],eax  
    temp = a.load(std::memory_order_seq_cst);
000000013FE51A5E  prefetchw   [a]  
    temp = a.load(std::memory_order_seq_cst);
000000013FE51A63  mov         eax,dword ptr [a]  
000000013FE51A67  nop         word ptr [rax+rax]  
000000013FE51A70  mov         ecx,eax  
000000013FE51A72  lock cmpxchg dword ptr [a],ecx  
000000013FE51A78  jne         main+80h (013FE51A70h)  
000000013FE51A7A  mov         dword ptr [temp],eax

Disassembler MSVS 2012 x86_64 - store():

    a.store(temp, std::memory_order_relaxed);
000000013F8C1A58  mov         eax,dword ptr [temp]  
000000013F8C1A5C  mov         dword ptr [a],eax  

    a.store(temp, std::memory_order_release);
000000013F8C1A60  mov         eax,dword ptr [temp]  
000000013F8C1A64  mov         dword ptr [a],eax  

    a.store(temp, std::memory_order_seq_cst);
000000013F8C1A68  mov         eax,dword ptr [temp]  
000000013F8C1A6C  xchg        eax,dword ptr [a]

Acquire/Release VS Sequential Consistency in C++11？

Yes, the assert can fire.

The principal property that is not guaranteed by acquire / release is a single total order of modifications. It only guarantees that (the non-existent) previous actions of a and b are observed by c and d if they see true from the loads.

A (slightly contrived) example of this is on a multi-cpu (physical socket) system that isn't fully cache-coherant. Die 1 has core A running thread a and core C running thread c. Die 2 has core B running thread b and core D running thread d. The interconnect between the two sockets has a long latency when compared to a memory operation that hits on-die cache.

a and b run at the same wall clock time. C is on-die with A, so can see the store to x immediately, but the interconnect delays it's observation of the store to y, so it sees the old value. Similarly D is on-die with B, so it sees the store to y, but misses the store to x.

Whereas if you have sequential consistency, some co-ordination is required to enforce a total order, such as "C and D are blocked while the interconnect syncs the caches".

C11 Atomic Acquire/Release and x86_64 lack of load/store coherence?

x86's memory model is basically sequential-consistency plus a store buffer (with store forwarding). So every store is a release-store¹. This is why only seq-cst stores need any special instructions. (C/C++11 atomics mappings to asm). Also, https://stackoverflow.com/tags/x86/info has some links to x86 docs, including a formal description of the x86-TSO memory model (basically unreadable for most humans; requires wading through a lot of definitions).

Since you're already reading Jeff Preshing's excellent series of articles, I'll point you at another one that goes into more detail:
https://preshing.com/20120930/weak-vs-strong-memory-models/

The only reordering that's allowed on x86 is StoreLoad, not LoadStore, if we're talking in those terms. (Store forwarding can do extra fun stuff if a load only partially overlaps a store; Globally Invisible load instructions, although you'll never get that in compiler-generated code for stdatomic.)

@EOF commented with the right quote from Intel's manual:

Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide, 8.2.3.3 Stores Are Not Reordered With Earlier Loads.

Footnote 1: ignoring weakly-ordered NT stores; this is why you normally sfence after doing NT stores. C11 / C++11 implementations assume you aren't using NT stores. If you are, use _mm_sfence before a release operation to make sure it respects your NT stores. (In general don't use _mm_mfence / _mm_sfence in other cases; usually you only need to block compile-time reordering. Or of course just use stdatomic.)

C++11 Memory_Order_Acquire and Memory_Order_Release Semantics