How to Understand Happens-Before Consistent

How to understand happens-before consistent

Each thread can be on a different core with its own private registers which Java can use to hold values of variables, unless you force access to coherent shared memory. This means that one thread can write to a value storing in a register, and this value is not visible to another thread for some time, like the duration of a loop or whole function. (milli-seconds is not uncommon)

A more extreme example is that the reading thread's code is optimised with the assumption that since it never changes the value, it doesn't need to read it from memory. In this case the optimised code never sees the change performed by another thread.

In both cases, the use of volatile ensures that reads and write occur in a consistent order and both threads see the same value. This is sometimes described as always reading from main memory, though it doesn't have to be the case because the caches can talk to each other directly. (So the performance hit is much smaller than you might expect).

On normal CPUs, caches are "coherent" (can't hold stale / conflicting values) and transparent, not managed manually. Making data visible between threads just means doing an actual load or store instruction in asm to access memory (through the data caches), and optionally waiting for the store buffer to drain to give ordering wrt. other later operations.

Understanding happens-before and synchronization

When the JLS says that some event X in thread A establishes a happens before relationship with event Y in thread B, it does not mean that X will happen before Y.

It means that IF X happens before Y, then both threads will agree that X happened before Y. That is to say, both threads will see the program's memory in a state that is consistent with X happening before Y.

It's all about memory. Threads communicate through shared memory, but when there are multiple CPUs in a system, all trying to access the same memory system, then the memory system becomes a bottleneck. Therefore, the CPUs in a typical multi-CPU computer are allowed to delay, re-order, and cache memory operations in order to speed things up.

That works great when threads are not interacting with one another, but it causes problems when they actually do want to interact: If thread A stores a value into an ordinary variable, Java makes no guarantee about when (or even if) thread B will see the value change.

In order to overcome that problem when it's important, Java gives you certain means of synchronizing threads. That is, getting the threads to agree on the state of the program's memory. The volatile keyword and the synchronized keyword are two means of establishing synchronization between threads.

I think the reason they called it "happens before" is to emphasize the transitive nature of the relationship: If you can prove that A happens before B, and you can prove that B happens before C, then according to the rules specified in the JLS, you have proved that A happens before C.

What is the significance of 'strongly happens before' compared to '(simply) happens before'?

Here's my current understanding, which could be incomplete or incorrect. A verification would be appreciated.

C++20 renamed strongly happens before to simply happens before, and introduced a new, more relaxed definition for strongly happens before, which imposes less ordering.

Simply happens before is used to reason about the presence of data races in your code. (Actually that would be the plain 'happens before', but the two are equivalent in absence of consume operations, the use of which is discouraged by the standard, since most (all?) major compilers treat them as acquires.)

The weaker strongly happens before is used to reason about the global order of seq-cst operations.

This change was introduced in proposal P0668R5: Revising the C++ memory model, which is based on the paper Repairing Sequential Consistency in C/C++11 by Lahav et al (which I didn't fully read).

The proposal explains why the change was made. Long story short, the way most compilers implement atomics on Power and ARM architectures turned out to be non-conformant in rare edge cases, and fixing the compilers had a performance cost, so they fixed the standard instead.

The change only affects you if you mix seq-cst operations with acquire-release operations on the same atomic variable (i.e. if an acquire operation reads a value from a seq-cst store, or a seq-cst operation reads a value from a release store).

If you don't mix operations in this manner, then you're not affected (i.e. can treat simply happens before and strongly happens before as equivalent).

The gist of the change is that the synchronization between a seq-cst operation and the corresponding acquire/release operation no longer affects the position of this specific seq-cst operation in the global seq-cst order, but the synchronization itself is still there.

This makes the seq-cst order for such seq-cst operations very moot, see below.

The proposal presents following example, and I'll try to explain my understanding of it:

atomic_int x = 0, y = 0;
int a = 0, b = 0, c = 0;
// Thread 1
x.store(1, seq_cst);
y.store(1, release);
// Thread 2
b = y.fetch_add(1, seq_cst); // b = 1 (the value of y before increment)
c = y.load(relaxed); // c = 3
// Thread 3
y.store(3, seq_cst);
a = x.load(seq_cst); // a = 0

The comments indicate one of the ways that this code can execute, which the standard used to forbid (before this change), but which actually can happen on the affected architectures.

The execution proceeds as follows:

.-- T3 y.store(3, seq_cst);                   --.                 (2)
|        |                                      | strongly
|        | sequenced before                     | happens
|        V                                      | before
|   T3 a = x.load(seq_cst); // a = 0    --.   <-'                 (3)
|                                         : coherence-
|                                         : ordered
|                                         : before
|   T1 x.store(1, seq_cst);             <-'   --. --.             (4)
|        |                                      |st |
|        | sequenced before                     |h  |
|        V                                      |b  |
| . T1 y.store(1, release);                   <-'   |
| |      :                                          | strongly
| |      : synchronizes with                        | happens
| |      V                                          | before
| > T2 b = y.fetch_add(1, seq_cst); // b = 1  --.   |             (1)
|        |                                      |st |
|        | sequenced before                     |h  |
|        V                                      |b  |
'-> T2 c = y.load(relaxed); // c = 3          <-' <-'

Where:

Parenthesized numbers on the right show the global seq-cst order.
Arrows on the left show how the values propagate between some loads and stores.
Arrows in the middle show:
- 'Sequenced before', the good old single-threaded evaluation order.
- 'Synchronizes with', the release-acquire synchronization (seq-cst loads count as acquire operations, and seq-cst stores count as release operations).
Those two together comprise 'simply happens before'.
Arrows on the right are based on the arrows in the middle, they show:
- The newly redefined 'strongly happens before' relation.
- 'Coherence-ordered before', a new relation introduced in this proposal, which is only used to define the global seq-cst order, and apparently imposes no synchronization (unlike release-acquire operations).
  It seems that it includes everything other than 'simply happens before' that affects the global seq-cst order. In this case, it's just common sense that if a load doesn't see the value written by the store, then the load goes before the store.
The global seq-cst order is consistent with both.

Notice that on this picture, nothing strongly happens before b = y.fetch_add(1, seq_cst);, so there isn't anything that must be before it in the global seq-cst order, so it's possible to move it way up to the beginning of the seq-cst order, which is what ends up happening is this scenario, even though it reads the values produced by later (in this order) operations.

How to understand JSR-133 Happens-Before is too Weak Figure6/7

Figure 6 is an example for the JMM's out of thing air restriction. Because no field is marked as volatile, the JVM is normally allowed to apply optimizations to the example code. This might result in surpising outcomes.

Assume that thread 1 observed x = 1 to be true. Do not worry for now, how this could be the case. In this case, it the thread would set y = 1. The latter observation would then justify the initial observation of x = 1 what is however cyclic reasoning. If this was true, x = y = 1 would be a legal outcome and the JVM could simply replace the code in both threads to write each value. This form of reasoning is forbidden by the JMM such that x = y = 0 is the only legal result according to the JMM.

A similar argument is made in figure 7. Again, assume the observation x = 42 by thread 1. In this case, the thread would write y = 42. Based on this observation, one can now argue that the previous x = 42 is a legal observation. The initial assumption was however made out of thin air what is forbidden such that x = y = 0 is the only legal outcome.

The definition of a happens-before relationship only applies to explicitly synchronized programs and would not forbid the hypothetical cyclic reasoning. Therefore, the out of thin air restriction is needed. This is of course rather academic but with the JMM being an academic model, it is still necessary to define this limitation.

Because of this, there are no racing conditions in either figure for which value x or y can take, i.e. there is no racing condition and x = y = 0 is the only possible observation. It follows that both examples are correctly synchronized despite the absense of volatile.

Java Memory Model: a JLS statement about sequential consistency seems incorrect

Your error is in bullet point #1: The reads of v1 and v2 are not synchronized-with.

There are happens-before relationships created only by the interactions with vv, so for example in this case, if you added vv to the beginning of your print statement, you would be guaranteed not to see vv=20,v2=4. Since you busy-wait on vv becoming nonzero but then don't interact with it again, the only guarantee is that you will see all of the effects that happened before it became nonzero (the assignments of 1 and 2). You may also see future effects, because you don't have any further happens-befores.

Even if you declare all of the variables as volatile, it is still possible for you to output v1=1,v2=4 because the multithreaded accesses of the variables do not have a defined order, and the global sequence can go like this:

T1: write v1=1
T1: write v2=2
T1: write vv=10 (Thread 2 cannot exit the while loop before here and is guaranteed to see all of these effects.)
T2: read vv=10
T2: read v1=1
T1: write v1=3
T1: write v2=4
T2: read v2=4

After each of these steps, the memory model guarantees that all threads will see the same values of the volatile variables, but you have a data race, and that is because the accesses are not atomic (grouped). In order to assure that you see them in a group, you need to use some other means, such as executing in a synchronized block or putting all of the values into a record class and using volatile or AtomicReference to swap out the entire record.

Formally, the data race as defined by the JLS consists of the operations T1(write v1=3) and T2(read v1) (and a second data race on v2). These are conflicting accesses (because the T1 access is a write), but while both of these events happen after T2(read vv), they are not ordered in relation to each other.

Java Specification: reads see writes that occur later in the execution order

The example

Table 17.4.8-A

      Thread 1                  Thread 2
r1 = x;                    r2 = y;
if (r1 != 0) y = 1;        if (r2 != 0) x = 1;

is possible on an architecture that has speculative execution. So the CPU encounters the conditional branch when the value requested from the main memory has not arrived yet and decides to execute the conditional code and roll it back in case the condition is not fulfilled.

When the value arrives at the CPU core, it’s the value written by the other thread, so the condition is fulfilled and the changes are kept.

You can not reproduce this in Java code, as the specification continues with the explanation that such a scenario is not allowed for Java code. It’s an example of what would be allowed, if we only had happens-before consistency. But Java additionally forbids “out of thin air values”. So, JVM implementations for architectures supporting such speculative execution must ensure that it is limited to variables not seen by other threads (or disable it completely).

How to Understand Happens-Before Consistent