Acquire/Release Semantics with 4 Threads

Acquire/release semantics with 4 threads

You are thinking in terms of sequential consistency, the strongest (and default) memory order. If this memory order is used, all accesses to atomic variables constitute a total order, and the assertion indeed cannot be triggered.

However, in this program, a weaker memory order is used (release stores and acquire loads). This means, by definition that you cannot assume a total order of operations. In particular, you cannot assume that changes become visible to other threads in the same order. (Only a total order on each individual variable is guaranteed for any atomic memory order, including memory_order_relaxed.)

The stores to x and y occur on different threads, with no synchronization between them. The loads of x and y occur on different threads, with no synchronization between them. This means it is entirely allowed that thread c sees x && ! y and thread d sees y && ! x. (I'm just abbreviating the acquire-loads here, don't take this syntax to mean sequentially consistent loads.)

Bottom line: Once you use a weaker memory order than sequentially consistent, you can kiss your notion of a global state of all atomics, that is consistent between all threads, goodbye. Which is exactly why so many people recommend sticking with sequential consistency unless you need the performance (BTW, remember to measure if it's even faster!) and are certain of what you are doing. Also, get a second opinion.

Now, whether you will get burned by this, is a different question. The standard simply allows a scenario where the assertion fails, based on the abstract machine that is used to describe the standard requirements. However, your compiler and/or CPU may not exploit this allowance for one reason or another. So it is possible that for a given compiler and CPU, you may never see that the assertion is triggered, in practice. Keep in mind that a compiler or CPU may always use a stricter memory order than the one you asked for, because this can never introduce violations of the minimum requirements from the standard. It may only cost you some performance – but that is not covered by the standard anyway.

UPDATE in response to comment: The standard defines no hard upper limit on how long it takes for one thread to see changes to an atomic by another thread. There is a recommendation to implementers that values should become visible eventually.

There are sequencing guarantees, but the ones pertinent to your example do not prevent the assertion from firing. The basic acquire-release guarantee is that if:

Thread e performs a release-store to an atomic variable x
Thread f performs an acquire-load from the same atomic variable
Then if the value read by f is the one that was stored by e, the store in e synchronizes-with the load in f. This means that any (atomic and non-atomic) store in e that was, in this thread, sequenced before the given store to x, is visible to any operation in f that is, in this thread, sequenced after the given load. [Note that there are no guarantees given regarding threads other than these two!]

So, there is no guarantee that f will read the value stored by e, as opposed to e.g. some older value of x. If it doesn't read the updated value, then also the load does not synchronize with the store, and there are no sequencing guarantees for any of the dependent operations mentioned above.

I liken atomics with lesser memory order than sequentially consistent to the Theory of Relativity, where there is no global notion of simultaneousness.

PS: That said, an atomic load cannot just read an arbitrary older value. For example, if one thread performs periodic increments (e.g. with release order) of an atomic<unsigned> variable, initialized to 0, and another thread periodically loads from this variable (e.g. with acquire order), then, except for eventual wrapping, the values seen by the latter thread must be monotonically increasing. But this follows from the given sequencing rules: Once the latter thread reads a 5, anything that happened before the increment from 4 to 5 is in the relative past of anything that follows the read of 5. In fact, a decrease other than wrapping is not even allowed for memory_order_relaxed, but this memory order does not make any promises to the relative sequencing (if any) of accesses to other variables.

Acquire/Release semantics

The complete quote is:

If we let both threads run and find that r1 == 1, that serves as
confirmation that the value of A assigned in Thread 1 was passed
successfully to Thread 2. As such, we are guaranteed that r2 == 42.

The aquire-release semantics only guarantee that

A = 42 doesn't happen after Ready = 1 in thread 1
r2 = A doesn't happen before r1 = Ready in thread 2

So the value of r1 has to be checked in thread 2 to be sure that A has been written by thread 1. The scenario in the question can indeed happen, but r1 will be 0 in that case.

On acquire/release semantics not being commutative to other threads

and since y can be 10 if and only if x is 10,

And that's the part that is incorrect.

An acquire/release pair works between a releasing store operation and an acquiring load operation which reads the value that was release-stored.

In thread 1, we have a releasing store to x. In thread 2, we have an acquiring load from x. If thread 2 read the value stored by thread 1, then the two threads have an acquire/release relationship. What that means is that any values written to other objects in thread 1 before the releasing store are visible to thread 2 after its acquiring load.

In thread 2, we have a releasing store to y. In thread 3, we have an acquiring load from y. If thread 3 read the value stored by thread 2, then the two threads have an acquire/release relationship. What that means is that any values written to other objects in thread 2 before the releasing store are visible to thread 3 after its acquiring load.

But notice what I said: "any values written to other objects in thread 2".

x was not written by thread 2; it was written by thread 1. And thread 3 has no relationship to thread 1.

Pure acquire/release is not transitive. If you need transitive access to memory, then you need sequential consistency. Indeed, that's what seq_cst is for: ensuring consistently across all acquire/releases which transitively rely on each other.

Note that this is primarily about caches. A pure acquire/release operation may only flush certain caches, particularly if the compiler can clearly see exactly what memory a thread is making available in a release.

How does mixing relaxed and acquire/release accesses on the same atomic variable affect synchronises-with?

Because you use relaxed ordering on a separate load & store in T2, the release sequence is broken and the second assert can trigger (although not on a TSO platform such as X86).

You can fix this by either using acq/rel ordering in thread T2 (as you suggested) or by modifying T2 to use an atomic read-modify-write operation (RMW), like this:

[Thread T2]
int ret;
do {
    int val = 1;
    ret = atm.compare_exchange_weak(val, 2, std::memory_order_relaxed);
} while (ret != 0);

The modification order of atm is 0-1-2 and T3 will pick up on either 1 or 2 and no assert can fail.

Another valid implementation of T2 is:

[thread T2]
if (atm.load(std::memory_order_relaxed) == 1)
{
    atm.exchange(2, std::memory_order_relaxed);
}

Here the RMW itself is unconditional and it must be accompanied by an if-statement & (relaxed) load to ensure that the modification order of atm is 0-1 or 0-1-2

Without the if-statement, the modification order could be 0-2 which can cause the assert to fail. (This works because we know there is only one other write in the whole rest of the program. Separate if() / exchange is of course not in general equivalent to compare_exchange_strong.)

In the C++ standard, the following quotes are related:

[intro.races]

A release sequence headed by a release operation A on an atomic object M is a maximal contiguous subsequence
of side effects in the modification order of M, where the first operation is A, and every subsequent
operation is an atomic read-modify-write operation.

[atomics.order]

An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from any side effect in the release
sequence headed by A.

this question is about why an RMW works in a release sequence.

Acquire-release memory order between multiple threads

Yes. The release-acquire relationship holds separately for all pairs of threads (that access the same atomic location!) and guarantees that all writes (that appear in program order) before the release are visible to all reads (that appear in program order) after whichever corresponding acquire.

Can multiple readers synchronize with the same writers with acquire/release ordering?

This is a text-book example of a happens-before relationship between the store to i in writer and the load from i in both reader threads.

That multiple readers are involved does not matter. The store to b synchronizes with all readers that observe the updated value (which will eventually happen thanks to the loop).

I think the quote you're looking for is:

An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from any side effect in the release
sequence headed by A.

It does not say that this is limited to a single load operation

C++ How is release-and-acquire achieved on x86 only using MOV?

but this is for a single core. The multi-core section does not seem to mention how loads are enforced:

The first bullet point in that section is key: Individual processors use the same ordering principles as in a single-processor system. The implicit part of that statement is ... when loading/storing from cache-coherent shared memory. i.e. multi-processor systems don't introduce new ways for reordering, they just mean the possible observers now include code on other cores instead of just DMA / IO devices.

The model for reordering of access to shared memory is the single-core model, i.e. program order + a store buffer = basically acq_rel. Actually slightly stronger than acq_rel, which is fine.

The only reordering that happens is local, within each CPU core. Once a store becomes globally visible, it becomes visible to all other cores at the same time, and didn't become visible to any cores before that. (Except to the core doing the store, via store forwarding.) That's why only local barriers are sufficient to recover sequential consistency on top of a SC + store-buffer model. (For x86, just mo_seq_cst just needs mfence after SC stores, to drain the store buffer before any further loads can execute.
mfence and locked instructions (which are also full barriers) don't have to bother other cores, just make this one wait).

One key point to understand is that there is a coherent shared view of memory (through coherent caches) that all processors share. The very top of chapter 8 of Intel's SDM defines some of this background:

These multiprocessing mechanisms have the following characteristics:
To maintain system memory coherency — When two or more processors are attempting simultaneously to
access the same address in system memory, some communication mechanism or memory access protocol
must be available to promote data coherency and, in some instances, to allow one processor to temporarily lock
a memory location.
To maintain cache consistency — When one processor accesses data cached on another processor, it must not
receive incorrect data. If it modifies data, all other processors that access that data must receive the modified
data.
To allow predictable ordering of writes to memory — In some circumstances, it is important that memory writes
be observed externally in precisely the same order as programmed.
[...]
The caching mechanism and cache consistency of Intel 64 and IA-32 processors are discussed in Chapter 11.

(CPUs use some variant of MESI; Intel in practice uses MESIF, AMD in practice uses MOESI.)

The same chapter also includes some litmus tests that help illustrate / define the memory model. The parts you quoted aren't really a strictly formal definition of the memory model. But the section 8.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations shows that loads aren't reordered with loads. Another section also shows that LoadStore reordering is forbidden. Acq_rel is basically blocking all reordering except StoreLoad, and that's what x86 does. (https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120930/weak-vs-strong-memory-models/)

how are barriers/fences and acquire, release semantics implemented microarchitecturally?
x86 mfence and C++ memory barrier - asking why no barriers are needed for acq_rel, but coming at it from a different angle (wondering about how data ever becomes visible to other cores).
How do memory_order_seq_cst and memory_order_acq_rel differ? (seq_cst requires flushing the store buffer).
C11 Atomic Acquire/Release and x86_64 lack of load/store coherence?
Globally Invisible load instructions program-order + store buffer isn't exactly the same as acq_rel, especially once you consider a load that only partially overlaps a recent store.
x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors - a formal memory model for x86.

Other ISAs

In general, most weaker memory HW models also only allow local reordering so barriers are still only local within a CPU core, just making (some part of) that core wait until some condition. (e.g. x86 mfence blocks later loads and stores from executing until the store buffer drains. Other ISAs also benefit from light-weight barriers for efficiency for stuff that x86 enforces between every memory operation, e.g. blocking LoadLoad and LoadStore reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/)

A few ISAs (only PowerPC these days) allow stores to become visible to some other cores before becoming visible to all, allowing IRIW reordering. Note that mo_acq_rel in C++ allows IRIW reordering; only seq_cst forbids it. Most HW memory models are slightly stronger than ISO C++ and make it impossible, so all cores agree on the global order of stores.

Are acquire/release semantics really enough for implementing critical sections?

The "moves" interpretation of acquire/release is a useful guide, but can break down because it describes how a thread sees other thread's actions. Each thread must see its own actions as happening in order. For example, Thread 2 could see this sequence:

Thread 1 acquires A
Thread 2 acquires B

but subsequently Thread 2 will see itself as releasing B before acquiring A. Conversely, during the same run, Thread 1 could see it as:

Thread 2 acquires B
Thread 1 acquires A

but subsequently Thread 1 will see itself as releasing A before acquiring B.

Release/Acquire semantics wrt std::mutex

Now can it be said that std::mutex::lock will have acquire semantics and that std::mutex::unlock essentially has release semantics?

Yes, this is correct.

From what I understand synchronize with is not explicitly defined in the standard

Well, in theory Paragraph 1.10/8 is probably meant to give the definition of synchronizes with:

Certain library calls synchronize with other library calls performed by another thread. For example, an
atomic store-release synchronizes with a load-acquire that takes its value from the store (29.3). [Note: ...]

On the other hand, this does not sound like a very formal definition. However, a better, though implicit one is indirectly given in Paragraph 1.10/10:

An evaluation A is dependency-ordered before an evaluation B if

— A performs a release operation on an atomic object M, and, in another thread, B performs a consume
operation on M and reads a value written by any side effect in the release sequence headed by A, or

— for some evaluation X, A is dependency-ordered before X and X carries a dependency to B.

[ Note: The relation “is dependency-ordered before” is analogous to “synchronizes with”, but uses release/-
consume in place of release/acquire. —end note ]

Since the "is analogous to" relationship is most often symmetric, I would say that the above definition of "is-dependency-ordered before" indirectly provides a definition of "synchronizes with" as well - although you might correctly object that notes are non-normative; still, this seems to be the intended definition.

My intuition of the synchronizes with relationship is that it occurs between a write (atomic) operation performed by one thread that stores a certain value and the first (atomic) operation that reads that value. That operation might as well be in the same thread.

If the two operations are on different threads, then the synchronizes-with relation establishes a cross-thread ordering on operations.

In the Standard I can find this under section

30.4.1.2 Mutex types [thread.mutex.requirements.mutex]

11 Synchronization: Prior unlock() operations on the same object shall synchronize with (1.10) this operation.

To me, this seems compatible with the interpretation given above. An operation with release semantics (unlock, store) will synchronize with an operation of acquire semantics (lock, load).

however, from my understanding of acquire/release semantics, this has more to do with memory reordering. synchronize with could also be called release/acquire semantics?

Release and acquire semantics describe the nature of some operations; the synchronizes-with relationship is (indeed) a relationship which is established between operations that have acquire or release semantics, in a well-defined way.

So in a sense, synchronizes-with is a consequence of the semantics of those operations, and we use those semantics to achieve the correct ordering of instructions and constraint the possible reordering that the CPU or the compiler will perform.