Why Do I Need a Memory Barrier

What is a memory fence?

For performance gains modern CPUs often execute instructions out of order to make maximum use of the available silicon (including memory read/writes). Because the hardware enforces instructions integrity you never notice this in a single thread of execution. However for multiple threads or environments with volatile memory (memory mapped I/O for example) this can lead to unpredictable behavior.

A memory fence/barrier is a class of instructions that mean memory read/writes occur in the order you expect. For example a 'full fence' means all read/writes before the fence are comitted before those after the fence.

Note memory fences are a hardware concept. In higher level languages we are used to dealing with mutexes and semaphores - these may well be implemented using memory fences at the low level and explicit use of memory barriers are not necessary. Use of memory barriers requires a careful study of the hardware architecture and more commonly found in device drivers than application code.

The CPU reordering is different from compiler optimisations - although the artefacts can be similar. You need to take separate measures to stop the compiler reordering your instructions if that may cause undesirable behaviour (e.g. use of the volatile keyword in C).

Why do I need a memory barrier?

Barrier #2 guarentees that the write to _complete gets committed immediately. Otherwise it could remain in a queued state meaning that the read of _complete in B would not see the change caused by A even though B effectively used a volatile read.

Of course, this example does not quite do justice to the problem because A does nothing more after writing to _complete which means that the write will be comitted immediately anyway since the thread terminates early.

The answer to your question of whether the if could still evaluate to false is yes for exactly the reasons you stated. But, notice what the author says regarding this point.

Barriers 1 and 4 prevent this example
from writing “0”. Barriers 2 and 3
provide a freshness guarantee: they
ensure that if B ran after A, reading
_complete would evaluate to true.

The emphasis on "if B ran after A" is mine. It certainly could be the case that the two threads interleave. But, the author was ignoring this scenario presumably to make his point regarding how Thread.MemoryBarrier works simpler.

By the way, I had a hard time contriving an example on my machine where barriers #1 and #2 would have altered the behavior of the program. This is because the memory model regarding writes was strong in my environment. Perhaps, if I had a multiprocessor machine, was using Mono, or had some other different setup I could have demonstrated it. Of course, it was easy to demonstrate that removing barriers #3 and #4 had an impact.

What is the difference between memory barrier and complier-only fence

As a concrete example, consider the following code:

int x = 0, y = 0;

void foo() {
    x = 10;
    y = 20;
}

As it stands, without any barriers or fences, the compiler may reorder the two stores and emit assembly (pseudo)code like

STORE [y], 20
STORE [x], 10

If you insert a compiler-only fence between x = 10; and y = 20;, the compiler is inhibited from doing this, and must instead emit

STORE [x], 10
STORE [y], 20

However, suppose we have another observer looking at the values of x and y in memory, such as a memory-mapped hardware device, or another thread that is going to do

void observe() {
    std::cout << x << ", ";
    std::cout << y << std::endl;
}

(Assume for simplicity that the loads from x and y in observe() do not get reordered in any way, and that loads and stores to int happen to be atomic on this system.) Depending on when its loads take place with respect to the stores in foo(), we can see that it could print out 0, 0 or 10, 0 or 10, 20. It might appear that 0, 20 would be impossible, but that is actually not so in general.

Even though the instructions in foo stored x and y in that order, on some architectures without strict store ordering, that does not guarantee that those stores will become visible to observe() in the same order. It could be that due to out-of-order execution, the core executing foo() actually executed the store to y before the store to x. (Say, if the cache line containing y was already in L1 cache, but the cache line for x was not; the CPU might as well go ahead and do the store to y rather than stalling for possibly hundreds of cycles while the cache line for x is loaded.) Or, the stores could be held in a store buffer and possibly flushed out to L1 cache in the opposite order. Either way, it is possible that observe() prints out 0, 20.

To ensure the desired ordering, the CPU has to be told to do so, often by executing an explicit memory barrier instruction between the two stores. This will cause the CPU to wait until the store to x has been made visible (by loading the cache line, draining the store buffer, etc) before making the store to y visible. So if you ask the compiler to put in a memory barrier, it will emit assembly like

STORE [x], 10
BARRIER
STORE [y], 20

In this case, you can be assured that observe() will print either 0, 0 or 10, 0 or 10, 20, but never 0, 20.

(Please note that many simplifying assumptions have been made here. If trying to write this in actual C++, you'd need to use std::atomic types and some similar barrier in observe() to ensure its loads were not reordered.)

Do I need a memory barrier?

I interpret you to be asking whether a compiler can reorder the assignments to buffer_full with the calls to fill_buffer() and read_buffer(). Such an optimization (and any optimization) is permitted only if it does not alter the externally-observable behavior of the program.

In this case, because buffer_full has external linkage, it is unlikely that the compiler can be confident about whether the optimization is permitted. It might be able to do so if the definitions of the fill_buffer() and use_buffer() functions, and of every function they themselves call, etc. are in the same translation unit with the writer_thread() and reader_thread() functions, but that depends somewhat on their implementations. If a conforming compiler is not confident that the optimization is allowed, then it must not perform it.

Inasmuch as your naming implies that the two functions will run in different threads, however, then without synchronization actions such as a memory barrier, you cannot be confident about the relative order in which one thread will perceive modifications to shared, non-_Atomic, non-volatile data performed by a different thread.

Moreover, if one thread writes a non-atomic variable and another thread accesses that same variable (read or write), then there is a data race unless a synchronization action or an atomic operation intervenes between the two in every possible overall order of operations. volatile variables don't really help here (see Why is volatile not considered useful in multithreaded C or C++ programming?). If you make buffer_full atomic, however, or if you implement your functions using atomic read and write operations on it, then that will serve to avoid data races involving not only that variable, but buffer as well (for your present code structure).

Where do I set memory barriers so that conditional loops observe multithread value changes?

So where would I set memory barriers / fences so that this optimization can't be performed?

A memory barrier/fence restricts the memory ordering, which prevents the compiler from reordering two or more memory accesses in the same thread. However, memory ordering is not the issue here, as you are only talking about a single memory access per thread.

Instead, you are talking about two memory operations on the same variable from two different threads. Therefore, this is a matter of thread synchronization, not memory ordering.

According to §6.9.2.2 ¶21 of the ISO C++20 standard, if two threads access the same variable without a synchronization operation (e.g. mutex) in between, and if these memory accesses are not

both atomic, or
both read-only,

then this results in undefined behavior.

Therefore, in the case of your while loop

while (!stop) {
  // do something
}

assuming that stop is a non-atomic data type, the compiler is allowed to assume that the variable stop will never be modified by another thread while the loop is running (unless you happen to have a synchronization operation inside the loop), because doing so would result in undefined behavior (which means that the compiler would be allowed to do anything).

For this reason, a good optimizing compiler will effectively change the code to the following:

if (!stop) {
  while (true) {  
    // do something
  }
}

However, if you change the variable stop to an atomic data type, such as std::atomic_flag or std::atomic<bool>, then it is not undefined behavior for several threads to access this variable at the same time. In that case, the compiler will not be allowed to assume that stop will not be modified by another thread, so that it will not be allowed to perform the optimization mentioned above.

Are memory barriers needed because of cpu out of order execution or because of cache consistency problem?

You need barriers to order this core / thread's accesses to globally-visible coherent cache when the ISA's memory ordering rules are weaker than the semantics you need for your algorithm.

Cache is always coherent, but that's a separate thing from consistency (ordering between multiple operations).

You can have memory reordering on an in-order CPU. In more detail, How is load->store reordering possible with in-order commit? shows how you can get memory reordering on a pipeline that starts executing instructions in program order, but with a cache that allows hit-under-miss and/or a store buffer allowing OoO commit.

Does an x86 CPU reorder instructions? talks about the difference between memory reordering vs. out of order exec. (And how x86's strongly ordered memory model is implemented on top of aggressive out-of-order execution by having hardware track ordering, with the store buffer decoupling store execution from store visibility to other threads/cores.)
x86 memory ordering: Loads Reordered with Earlier Stores vs. Intra-Processor Forwarding
Globally Invisible load instructions

See also https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and https://preshing.com/20120930/weak-vs-strong-memory-models for some more basics. x86 has a "strong" memory ordering model: program order plus a store buffer with store-forwarding. C++ acquire and release are "free", only atomic RMWs and seq_cst stores need barriers.

ARM has a "weak" memory ordering model: only C++ memory_order_consume (data dependency ordering) is "free", acquire and release require special instructions (like ldar / stlr) or barriers.

Are memory barriers necessary for atomic reference counting shared immutable data?

On x86, it will turn into a lock prefixed assembly instruction, like LOCK XADD.

Being a single instruction, it is non-interruptible. As an added "feature", the lock prefix results in a full memory barrier:

"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.

A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64, because mfence is slower on many CPUs even when it's guaranteed to be available, like in 64-bit mode. (Does lock xchg have the same behavior as mfence?)

So you have a full fence on x86 as an added bonus, whether you like it or not. :-)

On PPC, it is different. An LL/SC pair - lwarx & stwcx - with a subtraction inside can be used to load the memory operand into a register, subtract one, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted (meaning it will fail and retry).

It also does not mean an automatic full fence.

This does not however compromise the atomicity of the counter in any way.

It just means that in the x86 case, you happen to get a fence as well, "for free".

_{On PPC, one can insert a (partial or) full fence by emitting a (lw)sync instruction.}

All in all, explicit memory barriers are not necessary for the atomic counter to work properly.