What is a memory fence?
For performance gains modern CPUs often execute instructions out of order to make maximum use of the available silicon (including memory read/writes). Because the hardware enforces instructions integrity you never notice this in a single thread of execution. However for multiple threads or environments with volatile memory (memory mapped I/O for example) this can lead to unpredictable behavior.
A memory fence/barrier is a class of instructions that mean memory read/writes occur in the order you expect. For example a 'full fence' means all read/writes before the fence are comitted before those after the fence.
Note memory fences are a hardware concept. In higher level languages we are used to dealing with mutexes and semaphores - these may well be implemented using memory fences at the low level and explicit use of memory barriers are not necessary. Use of memory barriers requires a careful study of the hardware architecture and more commonly found in device drivers than application code.
The CPU reordering is different from compiler optimisations - although the artefacts can be similar. You need to take separate measures to stop the compiler reordering your instructions if that may cause undesirable behaviour (e.g. use of the volatile keyword in C).
Why do I need a memory barrier?
Barrier #2 guarentees that the write to _complete
gets committed immediately. Otherwise it could remain in a queued state meaning that the read of _complete
in B
would not see the change caused by A
even though B
effectively used a volatile read.
Of course, this example does not quite do justice to the problem because A
does nothing more after writing to _complete
which means that the write will be comitted immediately anyway since the thread terminates early.
The answer to your question of whether the if
could still evaluate to false
is yes for exactly the reasons you stated. But, notice what the author says regarding this point.
Barriers 1 and 4 prevent this example
from writing “0”. Barriers 2 and 3
provide a freshness guarantee: they
ensure that if B ran after A, reading
_complete would evaluate to true.
The emphasis on "if B ran after A" is mine. It certainly could be the case that the two threads interleave. But, the author was ignoring this scenario presumably to make his point regarding how Thread.MemoryBarrier
works simpler.
By the way, I had a hard time contriving an example on my machine where barriers #1 and #2 would have altered the behavior of the program. This is because the memory model regarding writes was strong in my environment. Perhaps, if I had a multiprocessor machine, was using Mono, or had some other different setup I could have demonstrated it. Of course, it was easy to demonstrate that removing barriers #3 and #4 had an impact.
What is the difference between memory barrier and complier-only fence
As a concrete example, consider the following code:
int x = 0, y = 0;
void foo() {
x = 10;
y = 20;
}
As it stands, without any barriers or fences, the compiler may reorder the two stores and emit assembly (pseudo)code like
STORE [y], 20
STORE [x], 10
If you insert a compiler-only fence between x = 10;
and y = 20;
, the compiler is inhibited from doing this, and must instead emit
STORE [x], 10
STORE [y], 20
However, suppose we have another observer looking at the values of x
and y
in memory, such as a memory-mapped hardware device, or another thread that is going to do
void observe() {
std::cout << x << ", ";
std::cout << y << std::endl;
}
(Assume for simplicity that the loads from x
and y
in observe()
do not get reordered in any way, and that loads and stores to int
happen to be atomic on this system.) Depending on when its loads take place with respect to the stores in foo()
, we can see that it could print out 0, 0
or 10, 0
or 10, 20
. It might appear that 0, 20
would be impossible, but that is actually not so in general.
Even though the instructions in foo
stored x
and y
in that order, on some architectures without strict store ordering, that does not guarantee that those stores will become visible to observe()
in the same order. It could be that due to out-of-order execution, the core executing foo()
actually executed the store to y
before the store to x
. (Say, if the cache line containing y
was already in L1 cache, but the cache line for x
was not; the CPU might as well go ahead and do the store to y
rather than stalling for possibly hundreds of cycles while the cache line for x
is loaded.) Or, the stores could be held in a store buffer and possibly flushed out to L1 cache in the opposite order. Either way, it is possible that observe()
prints out 0, 20
.
To ensure the desired ordering, the CPU has to be told to do so, often by executing an explicit memory barrier instruction between the two stores. This will cause the CPU to wait until the store to x
has been made visible (by loading the cache line, draining the store buffer, etc) before making the store to y
visible. So if you ask the compiler to put in a memory barrier, it will emit assembly like
STORE [x], 10
BARRIER
STORE [y], 20
In this case, you can be assured that observe()
will print either 0, 0
or 10, 0
or 10, 20
, but never 0, 20
.
(Please note that many simplifying assumptions have been made here. If trying to write this in actual C++, you'd need to use std::atomic
types and some similar barrier in observe()
to ensure its loads were not reordered.)
Do I need a memory barrier?
I interpret you to be asking whether a compiler can reorder the assignments to buffer_full
with the calls to fill_buffer()
and read_buffer()
. Such an optimization (and any optimization) is permitted only if it does not alter the externally-observable behavior of the program.
In this case, because buffer_full
has external linkage, it is unlikely that the compiler can be confident about whether the optimization is permitted. It might be able to do so if the definitions of the fill_buffer()
and use_buffer()
functions, and of every function they themselves call, etc. are in the same translation unit with the writer_thread()
and reader_thread()
functions, but that depends somewhat on their implementations. If a conforming compiler is not confident that the optimization is allowed, then it must not perform it.
Inasmuch as your naming implies that the two functions will run in different threads, however, then without synchronization actions such as a memory barrier, you cannot be confident about the relative order in which one thread will perceive modifications to shared, non-_Atomic
, non-volatile
data performed by a different thread.
Moreover, if one thread writes a non-atomic variable and another thread accesses that same variable (read or write), then there is a data race unless a synchronization action or an atomic operation intervenes between the two in every possible overall order of operations. volatile
variables don't really help here (see Why is volatile not considered useful in multithreaded C or C++ programming?). If you make buffer_full
atomic, however, or if you implement your functions using atomic read and write operations on it, then that will serve to avoid data races involving not only that variable, but buffer
as well (for your present code structure).
Where do I set memory barriers so that conditional loops observe multithread value changes?
So where would I set memory barriers / fences so that this optimization can't be performed?
A memory barrier/fence restricts the memory ordering, which prevents the compiler from reordering two or more memory accesses in the same thread. However, memory ordering is not the issue here, as you are only talking about a single memory access per thread.
Instead, you are talking about two memory operations on the same variable from two different threads. Therefore, this is a matter of thread synchronization, not memory ordering.
According to §6.9.2.2 ¶21 of the ISO C++20 standard, if two threads access the same variable without a synchronization operation (e.g. mutex) in between, and if these memory accesses are not
- both atomic, or
- both read-only,
then this results in undefined behavior.
Therefore, in the case of your while
loop
while (!stop) {
// do something
}
assuming that stop
is a non-atomic data type, the compiler is allowed to assume that the variable stop
will never be modified by another thread while the loop is running (unless you happen to have a synchronization operation inside the loop), because doing so would result in undefined behavior (which means that the compiler would be allowed to do anything).
For this reason, a good optimizing compiler will effectively change the code to the following:
if (!stop) {
while (true) {
// do something
}
}
However, if you change the variable stop
to an atomic data type, such as std::atomic_flag
or std::atomic<bool>
, then it is not undefined behavior for several threads to access this variable at the same time. In that case, the compiler will not be allowed to assume that stop
will not be modified by another thread, so that it will not be allowed to perform the optimization mentioned above.
Are memory barriers needed because of cpu out of order execution or because of cache consistency problem?
You need barriers to order this core / thread's accesses to globally-visible coherent cache when the ISA's memory ordering rules are weaker than the semantics you need for your algorithm.
Cache is always coherent, but that's a separate thing from consistency (ordering between multiple operations).
You can have memory reordering on an in-order CPU. In more detail, How is load->store reordering possible with in-order commit? shows how you can get memory reordering on a pipeline that starts executing instructions in program order, but with a cache that allows hit-under-miss and/or a store buffer allowing OoO commit.
Related:
- Does an x86 CPU reorder instructions? talks about the difference between memory reordering vs. out of order exec. (And how x86's strongly ordered memory model is implemented on top of aggressive out-of-order execution by having hardware track ordering, with the store buffer decoupling store execution from store visibility to other threads/cores.)
- x86 memory ordering: Loads Reordered with Earlier Stores vs. Intra-Processor Forwarding
- Globally Invisible load instructions
See also https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and https://preshing.com/20120930/weak-vs-strong-memory-models for some more basics. x86 has a "strong" memory ordering model: program order plus a store buffer with store-forwarding. C++ acquire
and release
are "free", only atomic RMWs and seq_cst stores need barriers.
ARM has a "weak" memory ordering model: only C++ memory_order_consume
(data dependency ordering) is "free", acquire and release require special instructions (like ldar
/ stlr
) or barriers.
Are memory barriers necessary for atomic reference counting shared immutable data?
On x86, it will turn into a lock
prefixed assembly instruction, like LOCK XADD
.
Being a single instruction, it is non-interruptible. As an added "feature", the lock
prefix results in a full memory barrier:
"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.
A memory barrier is in fact implemented as a dummy LOCK OR
or LOCK AND
in both the .NET and the JAVA JIT on x86/x64, because mfence
is slower on many CPUs even when it's guaranteed to be available, like in 64-bit mode. (Does lock xchg have the same behavior as mfence?)
So you have a full fence on x86 as an added bonus, whether you like it or not. :-)
On PPC, it is different. An LL/SC pair - lwarx
& stwcx
- with a subtraction inside can be used to load the memory operand into a register, subtract one, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted (meaning it will fail and retry).
It also does not mean an automatic full fence.
This does not however compromise the atomicity of the counter in any way.
It just means that in the x86 case, you happen to get a fence as well, "for free".
On PPC, one can insert a (partial or) full fence by emitting a (lw)sync
instruction.
All in all, explicit memory barriers are not necessary for the atomic counter to work properly.
Related Topics
Possible to Iterate Backwards Through a Foreach
Struct Constructor: "Fields Must Be Fully Assigned Before Control Is Returned to the Caller."
How to Generate a Cryptographically Secure Pseudorandom Number in C#
How to Pass a Single Object[] to a Params Object[]
The Name "Xyz" Does Not Exist in the Namespace "Clr-Namespace:Abc"
If Int32 Is Just an Alias for Int, How Can the Int32 Class Use an Int
Create PDF in Memory Instead of Physical File
How Does Comparison Operator Works with Null Int
Using Variables Inside Strings
C# How to Test a File Is a Jpeg
Meanings of Declaring, Instantiating, Initializing and Assigning an Object
Double Precision Problems on .Net
Why C# Doesn't Allow Inheritance of Return Type When Implementing an Interface