What Is the Significance of 'Strongly Happens Before' Compared to '(Simply) Happens Before'

What is the significance of 'strongly happens before' compared to '(simply) happens before'?

Here's my current understanding, which could be incomplete or incorrect. A verification would be appreciated.

C++20 renamed strongly happens before to simply happens before, and introduced a new, more relaxed definition for strongly happens before, which imposes less ordering.

Simply happens before is used to reason about the presence of data races in your code. (Actually that would be the plain 'happens before', but the two are equivalent in absence of consume operations, the use of which is discouraged by the standard, since most (all?) major compilers treat them as acquires.)

The weaker strongly happens before is used to reason about the global order of seq-cst operations.

This change was introduced in proposal P0668R5: Revising the C++ memory model, which is based on the paper Repairing Sequential Consistency in C/C++11 by Lahav et al (which I didn't fully read).

The proposal explains why the change was made. Long story short, the way most compilers implement atomics on Power and ARM architectures turned out to be non-conformant in rare edge cases, and fixing the compilers had a performance cost, so they fixed the standard instead.

The change only affects you if you mix seq-cst operations with acquire-release operations on the same atomic variable (i.e. if an acquire operation reads a value from a seq-cst store, or a seq-cst operation reads a value from a release store).

If you don't mix operations in this manner, then you're not affected (i.e. can treat simply happens before and strongly happens before as equivalent).

The gist of the change is that the synchronization between a seq-cst operation and the corresponding acquire/release operation no longer affects the position of this specific seq-cst operation in the global seq-cst order, but the synchronization itself is still there.

This makes the seq-cst order for such seq-cst operations very moot, see below.

The proposal presents following example, and I'll try to explain my understanding of it:

atomic_int x = 0, y = 0;
int a = 0, b = 0, c = 0;
// Thread 1
x.store(1, seq_cst);
y.store(1, release);
// Thread 2
b = y.fetch_add(1, seq_cst); // b = 1 (the value of y before increment)
c = y.load(relaxed); // c = 3
// Thread 3
y.store(3, seq_cst);
a = x.load(seq_cst); // a = 0

The comments indicate one of the ways that this code can execute, which the standard used to forbid (before this change), but which actually can happen on the affected architectures.

The execution proceeds as follows:

.-- T3 y.store(3, seq_cst);                   --.                 (2)
|        |                                      | strongly
|        | sequenced before                     | happens
|        V                                      | before
|   T3 a = x.load(seq_cst); // a = 0    --.   <-'                 (3)
|                                         : coherence-
|                                         : ordered
|                                         : before
|   T1 x.store(1, seq_cst);             <-'   --. --.             (4)
|        |                                      |st |
|        | sequenced before                     |h  |
|        V                                      |b  |
| . T1 y.store(1, release);                   <-'   |
| |      :                                          | strongly
| |      : synchronizes with                        | happens
| |      V                                          | before
| > T2 b = y.fetch_add(1, seq_cst); // b = 1  --.   |             (1)
|        |                                      |st |
|        | sequenced before                     |h  |
|        V                                      |b  |
'-> T2 c = y.load(relaxed); // c = 3          <-' <-'

Where:

Parenthesized numbers on the right show the global seq-cst order.
Arrows on the left show how the values propagate between some loads and stores.
Arrows in the middle show:
- 'Sequenced before', the good old single-threaded evaluation order.
- 'Synchronizes with', the release-acquire synchronization (seq-cst loads count as acquire operations, and seq-cst stores count as release operations).
Those two together comprise 'simply happens before'.
Arrows on the right are based on the arrows in the middle, they show:
- The newly redefined 'strongly happens before' relation.
- 'Coherence-ordered before', a new relation introduced in this proposal, which is only used to define the global seq-cst order, and apparently imposes no synchronization (unlike release-acquire operations).
  It seems that it includes everything other than 'simply happens before' that affects the global seq-cst order. In this case, it's just common sense that if a load doesn't see the value written by the store, then the load goes before the store.
The global seq-cst order is consistent with both.

Notice that on this picture, nothing strongly happens before b = y.fetch_add(1, seq_cst);, so there isn't anything that must be before it in the global seq-cst order, so it's possible to move it way up to the beginning of the seq-cst order, which is what ends up happening is this scenario, even though it reads the values produced by later (in this order) operations.

How is modification order of a variable defined in C++?

The modification order for an object is the order a thread would see if it was spinning in a tight loop running while(1) { y.load(relaxed); }, and happened to see every every change. (Not that that's a useful way to actually observe it, but every object has its own modification order that all threads can always agree on, like on real CPUs thanks to MESI exclusive ownership being required to commit a store to L1d cache. Or see it early via private store-forwarding between SMT threads on POWER CPUs.)

Some random facts:

The modification order for a single object is compatible with program order within one thread, even with relaxed
After a thread sees a value with a load, it can only see that value or later ones in the modification order, even if the loads and stores are relaxed.
The observed value of a variable can't change more times than there are stores by other threads. If you had a bunch of stores from a bunch of threads, one of them will be last, and that's the value that all readers will see (eventually). And while the dust is settling, any given reader won't see the value changing back and forth, other than actually seeing later stores in the mod order. (See the "coherency" rules in [intro.races] in the standard)

I think this evaluation is showing actual effective order, so the mod order for y is just reading top to bottom, 2 x 1. (Because it's using enough seq_cst operations that all threads can agree on the order, and showing that some other things end up ordering the release store (x) after the seq_cst store (2).)

This eval order is saying that the (2) store did become visible before the (x) store, so the (x) store replaced it.

And the dust has settled on that before the y.fetch_add (1), otherwise it would have synced-with (2) instead of (x).

Does std::memory_order_release and std::memory_order_acquire only guarantee the relative order within the current code block?

Yes, the effect of release covers everything thread 1 did before, not limited to the code block.

It can even cover things done by different threads, if they had synchronized with thread 1 before.

what's the exact definition of "prior to"?

The standard calls this happens-before. It has a rather obscure definition, but fortunately, you don't need to care about it.

In absense of consume operations (the use of which is discouraged by the standard, since apparently no modern compiler supports them, silently replacing them with stronger acquire operations), this is equivalent to a lot less obscure simply-happens-before.

A simply-happens-before B if:

A is sequenced-before B (i.e. they're both in the same thread, and one runs before the other), or
A synchronizes-with B (most often A is a release operation, and B is an acquire operation in a different thread that reads the value from A; but synchronization also happens in some other scenarios).
This relation is transitive, meaning that any number of chained "sequenced-before" and "synchronizes-with" also imposes "simply-happens-before".

There's also strongly-happens-before, which is same as simply-happens-before in absense of mixing release/acquire with seq-cst on the same variable (read more).

What do each memory_order mean?

The GCC Wiki gives a very thorough and easy to understand explanation with code examples.

(excerpt edited, and emphasis added)

IMPORTANT:

Upon re-reading the below quote copied from the GCC Wiki in the process of adding my own wording to the answer, I noticed that the quote is actually wrong. They got acquire and consume exactly the wrong way around. A release-consume operation only provides an ordering guarantee on dependent data whereas a release-acquire operation provides that guarantee regardless of data being dependent on the atomic value or not.

The first model is "sequentially consistent". This is the default mode used when none is specified, and it is the most restrictive. It can also be explicitly specified via memory_order_seq_cst. It provides the same restrictions and limitation to moving loads around that sequential programmers are inherently familiar with, except it applies across threads.

[...]

From a practical point of view, this amounts to all atomic operations acting as optimization barriers. It's OK to re-order things between atomic operations, but not across the operation. Thread local stuff is also unaffected since there is no visibility to other threads. [...] This mode also provides consistency across all threads.

The opposite approach is memory_order_relaxed. This model allows for much less synchronization by removing the happens-before restrictions. These types of atomic operations can also have various optimizations performed on them, such as dead store removal and commoning. [...] Without any happens-before edges, no thread can count on a specific ordering from another thread.

The relaxed mode is most commonly used when the programmer simply wants a variable to be atomic in nature rather than using it to synchronize threads for other shared memory data.

The third mode (memory_order_acquire / memory_order_release) is a hybrid between the other two. The acquire/release mode is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables. This allows for a relaxing of the synchronization required between independent reads of independent writes.

memory_order_consume is a further subtle refinement in the release/acquire memory model that relaxes the requirements slightly by removing the happens before ordering on non-dependent shared variables as well.

[...]

The real difference boils down to how much state the hardware has to flush in order to synchronize. Since a consume operation may therefore execute faster, someone who knows what they are doing can use it for performance critical applications.

Here follows my own attempt at a more mundane explanation:

A different approach to look at it is to look at the problem from the point of view of reordering reads and writes, both atomic and ordinary:

All atomic operations are guaranteed to be atomic within themselves (the combination of two atomic operations is not atomic as a whole!) and to be visible in the total order in which they appear on the timeline of the execution stream. That means no atomic operation can, under any circumstances, be reordered, but other memory operations might very well be. Compilers (and CPUs) routinely do such reordering as an optimization.

It also means the compiler must use whatever instructions are necessary to guarantee that an atomic operation executing at any time will see the results of each and every other atomic operation, possibly on another processor core (but not necessarily other operations), that were executed before.

Now, a relaxed is just that, the bare minimum. It does nothing in addition and provides no other guarantees. It is the cheapest possible operation. For non-read-modify-write operations on strongly ordered processor architectures (e.g. x86/amd64) this boils down to a plain normal, ordinary move.

The sequentially consistent operation is the exact opposite, it enforces strict ordering not only for atomic operations, but also for other memory operations that happen before or after. Neither one can cross the barrier imposed by the atomic operation. Practically, this means lost optimization opportunities, and possibly fence instructions may have to be inserted. This is the most expensive model.

A release operation prevents ordinary loads and stores from being reordered after the atomic operation, whereas an acquire operation prevents ordinary loads and stores from being reordered before the atomic operation. Everything else can still be moved around.

The combination of preventing stores being moved after, and loads being moved before the respective atomic operation makes sure that whatever the acquiring thread gets to see is consistent, with only a small amount of optimization opportunity lost.

One may think of that as something like a non-existent lock that is being released (by the writer) and acquired (by the reader). Except... there is no lock.

In practice, release/acquire usually means the compiler needs not use any particularly expensive special instructions, but it cannot freely reorder loads and stores to its liking, which may miss out some (small) optimization opportuntities.

Finally, consume is the same operation as acquire, only with the exception that the ordering guarantees only apply to dependent data. Dependent data would e.g. be data that is pointed-to by an atomically modified pointer.

Arguably, that may provide for a couple of optimization opportunities that are not present with acquire operations (since fewer data is subject to restrictions), however this happens at the expense of more complex and more error-prone code, and the non-trivial task of getting dependency chains correct.

It is currently discouraged to use consume ordering while the specification is being revised.

When to use AtomicReference in Java?

Atomic reference should be used in a setting where you need to do simple atomic (i.e. thread-safe, non-trivial) operations on a reference, for which monitor-based synchronization is not appropriate. Suppose you want to set a specific field only if the state of the object has changed during processing:

AtomicReference<Object> cache = new AtomicReference<Object>();

Object cachedValue = new Object();
cache.set(cachedValue);

//... time passes ...
Object cachedValueToUpdate = cache.get();
//... do some work to transform cachedValueToUpdate into a new version
Object newValue = someFunctionOfOld(cachedValueToUpdate);
boolean success = cache.compareAndSet(cachedValue,cachedValueToUpdate);

Because of the atomic reference semantics, you can do this even if the cache object is shared amongst threads, without using synchronized. In general, you're better off using synchronizers or the java.util.concurrent framework rather than bare Atomic* unless you know what you're doing.

Two excellent dead-tree references which will introduce you to this topic:

Herlihy's excellent Art of Multiprocessor Programming
Java Concurrency in Practice

Note that (I don't know if this has always been true) reference assignment (i.e. =) is itself atomic (updating primitive 64-bit types like long or double may not be atomic; but updating a reference is always atomic, even if it's 64 bit) without explicitly using an Atomic*.

See the Java Language Specification 3ed, Section 17.7.

What Is the Significance of 'Strongly Happens Before' Compared to '(Simply) Happens Before'