How Does Memory Reordering Help Processors and Compilers

How does memory reordering help processors and compilers?

TL;DR: It gives the compiler and hardware more room to take advantage of the as-if rule by not requiring it to preserve all behaviour of the original source, only the result of the single thread itself.

Taking the externally-observable (from other threads) ordering of loads/stores out of the picture as something that optimizations must preserve gives the compiler a lot of room to merge things into fewer operations. For the hardware, delaying stores is the big one, but for compilers all kinds of reordering can help.

(Letting cache-hit loads finish before earlier cache-miss loads is also useful for hardware, and same for stores, without having to do load-order speculation with possible roll-back. Jeff Preshing's Memory Barriers Are Like Source Control Operations has useful analogies and descriptions of plausible ways hardware can do it. But modern x86 CPUs throw enough transistors and power at these ordering requirements to manage them without huge performance downsides in most code.)

(See part-way down for a section on why it helps the compiler)

Why it helps hardware

Hardware reordering earlier stores with later loads (StoreLoad reordering) inside the CPU is essential for out-of-order execution. (See below).

Other kinds of reordering (e.g. StoreStore reordering, which is the subject of your question) aren't essential, and high performance CPUs can be built with only StoreLoad reordering, not the other three kinds. (The prime example being tag:x86, where every store is a release-store, every load is an acquire-load. See the x86 tag wiki for more details.)

Some people, like Linus Torvalds, argue that reordering stores with other stores doesn't help the hardware much, because hardware already has to track store-ordering to support out-of-order execution of a single thread. (A single thread always runs as if all of its own stores/loads happen in program order.) See other posts in that thread on realworldtech if you're curious. And/or if you find Linus's mix of insults and sensible technical arguments entertaining :P

For Java, the issue is that, architectures exist where the hardware doesn't provide these ordering guarantees. Weak memory ordering is a common feature of RISC ISAs like ARM, PowerPC, and MIPS. (But not SPARC-TSO). The reasons behind that design decision are the same ones being argued over in the realworldtech thread I linked: make the hardware simpler, and let software request ordering when needed.

So Java's architect(s) didn't have much of a choice: Implementing a JVM for an architecture with a weaker memory model than the Java standard would require a store-barrier instruction after every single store, and a load-barrier before every load. (Except when the JVM's JIT-compiler can prove that no other thread can have a reference to that variable.) Running barrier instructions all the time is slow.

A strong memory model for Java would make efficient JVMs on ARM (and other ISAs) impossible. Proving that barriers aren't needed is near-impossible, requiring AI levels of global program-understanding. (This goes WAY beyond what normal optimizers do).

Why it helps compilers

(see also Jeff Preshing's excellent blog post on C++ compile-time reordering. This basically applies to Java when you include JIT-compiling to native code as part of the process.)

Another reason for keeping the Java and C/C++ memory models weak is to allow more optimizations. Since other threads are allowed (by the weak memory model) to observe our stores and loads in any order, aggressive transformations are allowed even when the code involves stores to memory.

e.g. in a case like Davide's example:

c.a = 1;
c.b = 1;
c.a++;
c.b++;

// same observable effects as the much simpler
c.a = 2;
c.b = 2;

There's no requirement that other threads be able to observe the intermediate states. So a compiler can just compile that to c.a = 2; c.b = 2;, either at Java-compile time or when the bytecode is JIT-compiled to machine code.

It's common for a method that increments something to be called multiple times from another method. Without this rule, turning it into c.a += 4 could only happen if the compiler could prove that no other thread could observe the difference.

C++ programmers sometimes make the mistake of thinking that since they're compiling for x86, they don't need std::atomic<int> to get some ordering guarantees for a shared variable. This is wrong, because optimizations happen based on the as-if rule for the language memory model, not the target hardware.

More technical hardware explanations:

Why StoreLoad reordering helps performance:

Once a store is committed into cache, it becomes globally visible to threads running on other cores (via the cache-coherency protocol). At that point, it's too late to roll it back (another core might already have gotten a copy of the value). So it can't happen until it's known for certain that the store won't fault, and neither will any instruction before it (i.e. at or after retirement from the out-of-order back-end). And that there wasn't a branch-mispredict at some point earlier, etc. etc. i.e. we need to rule out all cases of mis-speculation before we can commit a store instruction to L1d cache. (This is why we have a store buffer)

Without StoreLoad reordering, every load would have to wait for all preceding stores to retire (i.e. be totally finished executing, known non-speculative) and actually commit the data to cache, before they could read a value from cache for use by later instructions that depend on the value loaded. (The moment when a load copies a value from cache into a register is the critical time when it "happens" as part of the coherency order of loads and stores on that memory location.)

Normally CPUs commit stores from the store-buffer to L1d cache after the corresponding store instruction has retired from the out-of-order back-end (ReOrder Buffer = ROB). Some amount of "graduated" stores can be in the store buffer, so this decouples execution from cache-miss stores. But you could give up that benefit and make commit to L1d happen as part of retirement. (Execution of a store would still work by writing the address+data into the store buffer, so it can be speculative, happening when the data is ready. The store buffer keeps this speculation private to the core.)

Without StoreLoad reordering, loads couldn't execute until all earlier stores had committed to cache. This would be a huge roadblock for memory parallelism. i.e. every load would be like an x86 lfence, draining the out-of-order back-end, and like mfence waiting for the store buffer to empty since we're proposing that commit would happen at retirement, not after. Including waiting for any earlier cache miss loads or stores, and waiting for the CPU to chew through all dependency chains, so it would tend to serialize things instead of letting the CPU overlap independent iterations of a loop, or other long dep chains.

Modern x86 CPUs do speculative early loads ahead of other (cache-miss) loads, potentially taking a memory-order mis-speculation pipeline nuke if they detect that their copy of the cache line didn't stay valid from when the load actually happened to when it's architecturally allowed. In that case they discard the ROB contents to roll back to a consistent retirement state and start executing again. (This normally only happens when another core is modifying a cache line, but can also happen if it incorrectly predicted that a load wouldn't reload a store.) (Of course real x86 can freely reorder loads ahead of stores.)

If StoreLoad reordering wasn't allowed, a CPU could still do loads speculatively, but would probably still have to commit stores earlier than normal CPUs. Load speculation could track stores in the store buffer post-retirement.

So really the limitation would be that loads can't retire until earlier stores commit. Those stores can still stay in the store buffer after they retire (become non-speculative). Which doesn't sound that bad on modern CPUs with huge ROBs and large store buffers, but it would be a disaster for in-order CPUs, or for more modest out-of-order execution capabilities in CPUs that existed when memory models were being designed.

Even with huge out-of-order exec capabilities, it introduces quite a lot more speculation, or a larger time-window for speculation, where the CPU might need to nuke the pipeline (discard the ROB). With a large ROB / out-of-order state, that's a lot of work potentially lost. In parallel code that accesses shared memory, this could be bad. (Including false sharing, where two variables happen to be in the same cache line). Penalties for this are already quite substantial and happen even with just loads on current x86 CPUs. (And aren't great even on other ISAs where load ordering isn't required, but the cache-line ping-pong is still a problem).

And cache-miss stores couldn't be hidden as effectively. If your code isn't doing many stores, one that misses in cache can sit in the store buffer for the hundreds of cycles of latency it might take to get the line from DRAM. (Or to get exclusive ownership via a read-for-ownership if the line was dirty in another core, and other cores are contending to read or write it.) If there aren't a lot of stores in the code being executed, it could potentially get hundreds of instructions ahead of the store before the store commits, including multiple loads that come and go. But if all those later loads couldn't retire from the ROB until stores commit, it would stop new potentially-independent instructions from entering the out-of-order back-end and being scheduled to execution units during this delay.

(Most code does quite a bit of storing, though, and a store buffer would quickly fill up. Except on weakly-ordered ISAs that allow StoreStore reordering, so on cache-miss store doesn't bottleneck later stores that hit in cache.)

(I rewrote the above section after realizing that x86 CPUs do speculatively load early, and could apply that to a hypothetical StoreLoad rule as well as x86's actual LoadLoad ordering rule. (Program order + a store buffer with store-forwarding). Some of this section may now be redundant with that.)

How actual CPUs work (when StoreLoad reordering is allowed):

I included some links as part of a brief intro to computer architecture in the early part of my answer on Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs. That might be helpful, or more confusing, if you're finding this hard to follow.

CPUs avoid WAR and WAW pipeline hazards for stores by buffering them in a store queue until store instructions are ready to retire. Loads from the same core have to check the store queue (to preserve the appearance of in-order execution for a single thread, otherwise you'd need memory-barrier instructions before loading anything that might have been stored recently!). The store queue is invisible to other threads; stores only become globally visible when the store instruction retires, but loads become globally visible as soon as they execute. (And can use values prefetched into the cache well ahead of that).

See also this answer I wrote explaining store buffers and how they decouple execution from cache-miss store commit, and allow speculative execution of stores. Also wikipedia's article on the classic RISC pipeline has some stuff for simpler CPUs. A store-buffer inherently creates StoreLoad reordering (and also store-forwarding so a core can see its own stores before they become globally visible, assuming the core can do store forwarding instead of stalling.)

So out-of-order execution is possible for stores, but they're only reordered inside the store queue. Since instructions have to retire in order to support precise exceptions, there doesn't appear to be much benefit at all to having the hardware enforce StoreStore ordering.

Since loads become globally visible when they execute, enforcing LoadLoad ordering may require delaying loads after a load that misses in cache. Of course, in reality the CPU would speculatively execute the following loads, and detect a memory-order mis-speculation if it occurs. This is nearly essential for good performance: A large part of the benefit of out-of-order execution is to keep doing useful work, hiding the latency of cache misses.

One of Linus' arguments is that weakly-ordered CPUs require multi-threaded code to use a lot of memory barrier instructions, so they'll need to be cheap for multi-threaded code to not suck. That's only possible if you have hardware tracking the dependency ordering of loads and stores.

But if you have that hardware tracking of dependencies, you can just have the hardware enforce ordering all the time, so software doesn't have to run as many barrier instructions. If you have hardware support to make barriers cheap, why not just make them implicit on every load / store, like x86 does.

His other major argument is that memory ordering is HARD, and a major source of bugs. Getting it right once in hardware is better than every software project having to get it right. (This argument only works because it's possible in hardware without huge performance overhead.)

GCC's reordering of read/write instructions

The reordering that GCC may do is unrelated to the reordering an (x86) CPU may do.

Let's start off with compiler reordering. The C language rules are such that GCC is forbidden from reordering volatile loads and store memory accesses with respect to each other, or deleting them, when a sequence point occurs between them (Thanks to bobc for this clarification). That is to say, in the assembly output, those memory accesses will appear, and will be sequenced precisely in the order you specified. Non-volatile accesses, on the other hand, can be reordered with respect to all other accesses, volatile or not, provided that (by the as-if rule) the end result of the calculation is the same.

For instance, a non-volatile load in the C code could be done as many times as the code says, but in a different order (e.g. If the compiler feels it's more convenient to do it earlier or later when more registers are available). It could be done fewer times than the code says (e.g. If a copy of the value happened to still be available in a register in the middle of a large expression). Or it could even be deleted (e.g. if the compiler can prove the uselessness of the load, or if it moved a variable entirely into a register).

To prevent compiler reorderings at other times, you must use a compiler-specific barrier. GCC uses __asm__ __volatile__("":::"memory"); for this purpose.

This is different from CPU reordering, a.k.a. the memory-ordering model. Ancient CPUs executed instructions precisely in the order they appeared in the program; This is called program ordering, or the strong memory-ordering model. Modern CPUs, however, sometimes resort to "cheats" to run faster, by weakening a little the memory model.

The way x86 CPUs weaken the memory model is documented in Intel's Software Developer Manuals, Volume 3, Chapter 8, Section 8.2.2 "Memory Ordering in P6 and More Recent Processor Families". This is, in part, what it reads:

Reads are not reordered with other reads.
Writes are not reordered with older reads.
Writes to memory are not reordered with other writes, with [some] exceptions.
Reads may be reordered with older writes to different locations but not with older writes to the same location.
Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
Reads cannot pass earlier LFENCE and MFENCE instructions.
Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
LFENCE instructions cannot pass earlier reads.
SFENCE instructions cannot pass earlier writes.
MFENCE instructions cannot pass earlier reads or writes.

It also gives very good examples of what can and cannot be reordered, in Section 8.2.3 "Examples Illustrating the Memory-Ordering Principles".

As you can see, one uses FENCE instructions to prevent an x86 CPU from reordering memory accesses inappropriately.

Lastly, you may be interested in this link, which goes into further detail and comes with the assembly examples you crave.

compiler reordering vs memory reordering

The first barrier only applies during compilation. Once the compiler is done, it has no impact since nothing is added to the code. This could be useful to avoid some memory ordering issues (the compiler doesn't know how other threads may manipulate these memory locations, although hardly any compiler with normal settings would dare reorder variables with a potential for that).

However, this is far from enough since on modern out-of-order CPUs the hardware itself may reorder operations under the hood. To avoid that, you have ways to tell the HW to watch out, given the exact level and form of restriction you want to achieve (with sequential consistency being the most restrictive and "safe" ordering model, but usually also the most expensive in terms of performance).

To achieve these restrictions, you can either try manually maintaining barriers and similar constructs that the ISA provides (through intrinsics, inline assembly, serializing operations, or any other trick). This is usually complicated even if you know what you're doing, and may even be micro-architectural specific (some CPUs may grant some restrictions "for free", making explicit fencing useless), so c++11 added the atomic semantics to make this task easier, and now the compiler adds the necessary code for you depending on the specified ordering model you want.

In your example, the mfence is an example of doing things manually, but you also need to know where to apply it. Used correctly, the mfence can be string enough to provide seq consistency, but is also very expensive since it includes a store-fence (mfence = sfence + lfence), which requires draining all pending stores from the internal buffers, a slow operation since the buffering is done to allow them a lazy commit.
On the other hand, if you want acquire/release semantics, you can chose to implement them with proper partial fences at the correct places considering your architecture, or let the compiler do that for you. If you choose the latter and run over an x86 machine for example, you'll discover that most of the times nothing needs to be added since stores have implicit release semantics and loads have acquire semantics, but the same may not apply on other architectures.

Here's a nice summary of the implementation of various ordering semantics per architecture -
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

Is memory reordering visible to other threads on a uniprocessor?

As a C++ question the answer must be that the program contains a data race, so the behavior is undefined. In reality that means that it could print something other than 42.

That is independent of the underlying hardware. As has been pointed out, the loop can be optimized away and the compiler can reorder the assignments in thread 1, so that result can occur even on uniprocessor machines.

[I'll assume that with "uniprocessor" machine, you mean processors with a single core and hardware thread.]

You now say, that you want to assume compiler reordering or loop elimination does not happen. With this, we have left the realm of C++ and are really asking about corresponding machine instructions. If you want to eliminate compiler reordering, we can probably also rule out any form of SIMD instructions and consider only instructions operating on a single memory location at a time.

So essentially thread1 has two store instructions in the order store-to-x store-to-f, while thread2 has test-f-and-loop-if-not-zero (this may be multiple instructions, but involves a load-from-f) and then a load-from-x.

On any hardware architecture I am aware of or can reasonably imagine, thread 2 will print 42.

One reason is that, if instructions processed by a single processors are not sequentially consistent among themselves, you could hardly assert anything about the effects of a program.

The only event that could interfere here, is an interrupt (as is used to trigger a preemptive context switch). A hypothetical machine that stores the entire state of its current execution pipeline state upon an interrupt and restores it upon return from the interrupt, could produce a different result, but such a machine is impractical and afaik does not exist. These operations would create quite a bit of additional complexity and/or require additional redundant buffers or registers, all for no good reason - except to break your program. Real processors either flush or roll back the current pipeline upon interrupt, which is enough to guarantee sequential consistency for all instructions on a single hardware thread.

And there is no memory model issue to worry about. The weaker memory models originate from the separate buffers and caches that separate the separate hardware processors from the main memory or nth level cache they actually share. A single processor has no similarly partitioned resources and no good reason to have them for multiple (purely software) threads. Again there is no reason to complicate the architecture and waste resources to make the processor and/or memory subsystem aware of something like separate thread contexts, if there aren't separate processing resources (processors/hardware threads) to keep these resources busy.

Compiler and CPU reordering

The compiler can never reorder calls to functions in external libraries. If your compiler implements these functions as intrinsics, it will be smart enough not to reorder them.

As far as CPU reordering is concerned, the MSDN documentation says "This function generates a full memory barrier (or fence) to ensure that memory operations are completed in order."

Can memory store be reordered really, in an OoOE processor?

The consistency model (or memory model) for the architecture determines what memory operations can be reordered. The idea is always to achieve the best performance from the code, while preserving the semantics expected by the programmer. That is the point from wikipedia, the memory operations appear in order to the programmer, even though they may have been reordered. Reordering is generally safe when the code is single-threaded, as the processor can easily detect potential violations.

On x86, the common model is that writes are not reordered with other writes. Yet, the processor is using out of order execution (OoOE), so instructions are being reordered constantly. Generally, the processor has several additional hardware structures to support OoOE, like a reorder buffer and load-store queue. The reorder buffer ensures that all instructions appear to execute in order, such that interrupts and exceptions break a specific point in the program. The load-store queue functions similarly, in that it can restore the order of memory operations according to the memory model. The load-store queue also disambiguates addresses, so that the processor can identify when the operations are made to the same or different addresses.

Back to OoOE, the processor is executing 10s to 100s of instructions in every cycle. Loads and stores are computing their addresses, etc. The processor may prefetch the cache lines for the accesses (which may include cache coherence), but it cannot actually access the line either to read or write until it is safe (according to the memory model) to do so.

Inserting store barriers, memory fences, etc tell both the compiler and processor about further restrictions to reordering the memory operations. The compiler is part of implementing the memory model, as some languages like java have specific memory model, while others like C obey the "memory accesses should appear as if they were executed in order".

In conclusion, yes, data and ready can be reordered in an OoOE. But it depends on the memory model as to whether they actually are. So if you need a specific order, provide the appropriate indication using barriers, etc such that the compiler, processor, etc will not choose a different order for higher performance.

How Does Memory Reordering Help Processors and Compilers