C++11: the Difference Between Memory_Order_Relaxed and Memory_Order_Consume

What does memory_order_consume really do?

First of all, memory_order_consume is temporarily discouraged by the ISO C++ committee until they come up with something compilers can actually implement. For a few years now, compilers have treated consume as a synonym for acquire. See the section at the bottom of this answer.

Hardware still provides the data dependency, so it's interesting to talk about that, despite not having any safely portable ISO C++ ways to take advantage currently. (Only hacks with mo_relaxed or hand-rolled atomics, and careful coding based on understanding of compiler optimizations and asm, kind of like you're trying to do with relaxed. But you don't need volatile.)

Oh, maybe it uses data dependencies to prevent reordering of instructions while ensuring the visibility of Payload?

Not exactly "reordering of instructions", but memory reordering. As you say, sanity and causality are enough in this case if the hardware provides dependency ordering. C++ is portable to machines that don't. (e.g DEC Alpha.)

The normal way to get visibility for Payload is via release-store in the writer, acquire load in the reader which sees the value from that release-store. https://preshing.com/20120913/acquire-and-release-semantics/. (So of course repeatedly storing the same value to a "ready_flag" or pointer doesn't let the reader figure out whether it's seeing a new or old store.)

Release / acquire creates a happens-before synchronization relationship between the threads, which guarantees visibility of everything the writer did before the release-store. (consume doesn't, that's why only the dependent loads are ordered.)

(consume is an optimization on this: avoiding a memory barrier in the reader by letting the compiler take advantage of hardware guarantees as long as you follow some dependency rules.)


You have some misconceptions about what CPU cache is, and about what volatile does, which I commented about under the question. A release-store makes sure earlier non-atomic assignments are visible in memory.

(Also, cache is coherent; it provides all CPUs with a shared view of memory that they can agree on. Registers are thread-private and not coherent, that's what people mean when they say a value is "cached". Registers are not CPU cache, but software can use them to hold a copy of something from memory. When to use volatile with multi threading? - never, but it does have some effects in real CPUs because they have coherent cache. It's a bad way to roll your own mo_relaxed. See also https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/)

In practice on real CPUs, memory reordering happens locally within each core; cache itself is coherent and never gets "out of sync". (Other copies are invalided before a store can become globally visible). So release just has to make sure the local CPUs stores become globally visible (commit to L1d cache) in the right order. ISO C++ doesn't specify any of that level of detail, and an implementation that worked very differently is hypothetically possible.

Making the writer's store volatile is irrelevant in practice because a non-atomic assignment followed by a release-store already has to make everything visible to other threads that might do an acquire-load and sync with that release store. It's irrelevant on paper in pure ISO C++ because it doesn't avoid data-race UB.

(Of course, it's theoretically possible for whole-program optimization to see that there are no acquire or consume loads that would ever load this store, and optimize away the release property. But compilers currently don't optimize atomics in general even locally, and never try to do that kind of whole-program analysis. So code-gen for writer functions will assume that there might be a reader that syncs with any given store of release or seq_cst ordering.)



What does memory_order_consume really do?

One thing mo_consume does is to make sure the compiler uses a barrier instruction on implementations where the underlying hardware doesn't provide dependency ordering naturally / for free. In practice that means only on DEC Alpha. Dependent loads reordering in CPU / Memory order consume usage in C11

Your question is a near duplicate of C++11: the difference between memory_order_relaxed and memory_order_consume - see the answers there for the body of your question about misguided attempts to do stuff with volatile and relaxed. (I'm mostly answering because of the title question.)

It also ensures that the compiler uses a barrier at some point before execution passes into code that doesn't know about the data dependency this value carries. (i.e. no [[carries_dependency]] tag on the function arg in the declaration). Such code might replace x-x with a constant 0 and optimize away, losing the data dependency. But code that knows about the dependency would have to use something like a sub r1, r1, r1 instruction to get a zero with a data dependency.

That can't happen for your use-case (where relaxed will work in practice on ISAs other than Alpha), but the on-paper design of mo_consume allowed all kinds of stuff that would require different code-gen from what compilers would normally do. This is part of what made it so hard to implement efficiently that compilers just promote it to mo_acquire.

The other part of the problem is that it requires code to be littered with kill_dependency and/or [[carries_dependency]] all over the place, or you'll end up with a barrier at function boundaries anyway. These problems led the ISO C++ committee to temporarily discourage consume.

  • C++11: the difference between memory_order_relaxed and memory_order_consume
  • P0371R1: Temporarily discourage memory_order_consume and other C++ wg21 documents linked from that about why consume is discouraged.
  • Memory order consume usage in C11 - more about the hardware mechanism / guarantee that consume is intended to expose to software. Out-of-order exec can only reorder independent work anyway, not start a load before the load address is known, so on most CPUs enforcing dependency ordering happens for free anyway: only a few models of DEC Alpha could violate causality and effectively load data from before it had the pointer that gave it the address.

And BTW:

The example code is safe with release + consume regardless of volatile. It's safe on most compilers and most ISAs in practice with release store + relaxed load, although of course ISO C++ has nothing to say about the correctness of that code. But with the current state of compilers, that's a hack that some code makes (like the Linux kernel's RCU).

If you need that level of read-side scaling, you'll have to work outside of what ISO C++ guarantees. That means your code will have to make assumptions about how compilers work (and that you're running on a "normal" ISA that isn't DEC Alpha), which means you need to support some set of compilers (and maybe ISAs, although there aren't many multi-core ISAs around). The Linux kernel only cares about a few compilers (mostly recent GCC, also clang I think), and the ISAs that they have kernel code for.

What do each memory_order mean?

The GCC Wiki gives a very thorough and easy to understand explanation with code examples.

(excerpt edited, and emphasis added)

IMPORTANT:

Upon re-reading the below quote copied from the GCC Wiki in the process of adding my own wording to the answer, I noticed that the quote is actually wrong. They got acquire and consume exactly the wrong way around. A release-consume operation only provides an ordering guarantee on dependent data whereas a release-acquire operation provides that guarantee regardless of data being dependent on the atomic value or not.

The first model is "sequentially consistent". This is the default mode used when none is specified, and it is the most restrictive. It can also be explicitly specified via memory_order_seq_cst. It provides the same restrictions and limitation to moving loads around that sequential programmers are inherently familiar with, except it applies across threads.

[...]

From a practical point of view, this amounts to all atomic operations acting as optimization barriers. It's OK to re-order things between atomic operations, but not across the operation. Thread local stuff is also unaffected since there is no visibility to other threads. [...] This mode also provides consistency across all threads.

The opposite approach is memory_order_relaxed. This model allows for much less synchronization by removing the happens-before restrictions. These types of atomic operations can also have various optimizations performed on them, such as dead store removal and commoning. [...] Without any happens-before edges, no thread can count on a specific ordering from another thread.

The relaxed mode is most commonly used when the programmer simply wants a variable to be atomic in nature rather than using it to synchronize threads for other shared memory data.

The third mode (memory_order_acquire / memory_order_release) is a hybrid between the other two. The acquire/release mode is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables. This allows for a relaxing of the synchronization required between independent reads of independent writes.

memory_order_consume is a further subtle refinement in the release/acquire memory model that relaxes the requirements slightly by removing the happens before ordering on non-dependent shared variables as well.

[...]

The real difference boils down to how much state the hardware has to flush in order to synchronize. Since a consume operation may therefore execute faster, someone who knows what they are doing can use it for performance critical applications.

Here follows my own attempt at a more mundane explanation:

A different approach to look at it is to look at the problem from the point of view of reordering reads and writes, both atomic and ordinary:

All atomic operations are guaranteed to be atomic within themselves (the combination of two atomic operations is not atomic as a whole!) and to be visible in the total order in which they appear on the timeline of the execution stream. That means no atomic operation can, under any circumstances, be reordered, but other memory operations might very well be. Compilers (and CPUs) routinely do such reordering as an optimization.

It also means the compiler must use whatever instructions are necessary to guarantee that an atomic operation executing at any time will see the results of each and every other atomic operation, possibly on another processor core (but not necessarily other operations), that were executed before.

Now, a relaxed is just that, the bare minimum. It does nothing in addition and provides no other guarantees. It is the cheapest possible operation. For non-read-modify-write operations on strongly ordered processor architectures (e.g. x86/amd64) this boils down to a plain normal, ordinary move.

The sequentially consistent operation is the exact opposite, it enforces strict ordering not only for atomic operations, but also for other memory operations that happen before or after. Neither one can cross the barrier imposed by the atomic operation. Practically, this means lost optimization opportunities, and possibly fence instructions may have to be inserted. This is the most expensive model.

A release operation prevents ordinary loads and stores from being reordered after the atomic operation, whereas an acquire operation prevents ordinary loads and stores from being reordered before the atomic operation. Everything else can still be moved around.

The combination of preventing stores being moved after, and loads being moved before the respective atomic operation makes sure that whatever the acquiring thread gets to see is consistent, with only a small amount of optimization opportunity lost.

One may think of that as something like a non-existent lock that is being released (by the writer) and acquired (by the reader). Except... there is no lock.

In practice, release/acquire usually means the compiler needs not use any particularly expensive special instructions, but it cannot freely reorder loads and stores to its liking, which may miss out some (small) optimization opportuntities.

Finally, consume is the same operation as acquire, only with the exception that the ordering guarantees only apply to dependent data. Dependent data would e.g. be data that is pointed-to by an atomically modified pointer.

Arguably, that may provide for a couple of optimization opportunities that are not present with acquire operations (since fewer data is subject to restrictions), however this happens at the expense of more complex and more error-prone code, and the non-trivial task of getting dependency chains correct.

It is currently discouraged to use consume ordering while the specification is being revised.

Memory order consume usage in C11

consume is cheaper than acquire. All CPUs (except DEC Alpha AXP's famously weak memory model1) do it for free, unlike acquire. (Except on x86 and SPARC-TSO, where the hardware has acq/rel memory ordering without extra barriers or special instructions.)

On ARM/AArch64/PowerPC/MIPS/etc weakly-ordered ISAs, consume and relaxed are the only orderings that don't require any extra barriers, just ordinary cheap load instructions. i.e. all asm load instructions are (at least) consume loads, except on Alpha. acquire requires LoadStore and LoadLoad ordering, which is a cheaper barrier instruction than a full-barrier for seq_cst, but still more expensive than nothing.

mo_consume is like acquire only for loads with a data dependency on the consume load. e.g. float *array = atomic_ld(&shared, mo_consume);, then access to any array[i] is safe if the producer stored the buffer and then used a mo_release store to write the pointer to the shared variable. But independent loads/stores don't have to wait for the consume load to complete, and can happen before it even if they appear later in program order. So consume only orders the bare minimum, not affecting other loads or stores.


(It's basically free to implement support for consume semantics in hardware for most CPU designs, because OoO exec can't break true dependencies, and a load has a data dependency on the pointer, so loading a pointer and then dereferencing it inherently orders those 2 loads just by the nature of causality. Unless CPUs do value-prediction or something crazy.
Value prediction is like branch prediction, but guess what value is going to be loaded instead of which way a branch is going to go.

Alpha had to do some crazy stuff to make CPUs that could actually load data from before the pointer value was truly loaded, when the stores were done in order with sufficient barriers.

Unlike for stores, where the store buffer can introduce reordering between store execution and commit to L1d cache, loads become "visible" by taking data from L1d cache when they execute, not when the retire + eventually commit. So ordering 2 loads wrt. each other really does just mean executing those 2 loads in order. With a data dependency of one on the other, causality requires that on CPUs without value prediction, and on most architectures the ISA rules do specifically require that. So you don't have to use a barrier between loading + using a pointer in asm, e.g. for traversing a linked list.)

See also Dependent loads reordering in CPU


But current compilers just give up and strengthen consume to acquire

... instead of trying to map C dependencies to asm data dependencies (without accidentally breaking having only a control dependency that branch prediction + speculative execution could bypass). Apparently it's a hard problem for compilers to keep track of it and make it safe.

It's non-trivial to map C to asm, because if the dependency is only in the form of a conditional branch, the asm rules don't apply. So it's hard to define C rules for mo_consume propagating dependencies only in ways that line up with what does "carry a dependency" in terms of asm ISA rules.

So yes, you're correct that consume can be safely replaced with acquire, but you're totally missing the point.


ISAs with weak memory-ordering rules do have rules about which instructions carry a dependency. So even an instruction like ARM eor r0,r0 which unconditionally zeroes r0 is architecturally required to still carry a data dependency on the old value, unlike x86 where the xor eax,eax idiom is specially recognized as dependency-breaking2.

See also http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/

I also mentioned mo_consume in an answer on Atomic operations, std::atomic<> and ordering of writes.


Footnote 1: The few Alpha models that actually could in theory "violate causality" didn't do value-prediction, there was a different mechanism with their banked cache. I think I've seen a more detailed explanation of how it was possible, but Linus's comments about how rare it actually was are interesting.

Linus Torvalds (Linux lead developer), in a RealWorldTech forum thread

I wonder, did you see non-causality on Alpha by yourself or just in the manual?


I never saw it myself, and I don't think any of the models I ever had
access to actually did it. Which actually made the (slow) RMB
instruction extra annoying, because it was just pure downside.

Even on CPU's that actually could re-order the loads, it was
apparently basically impossible to hit in practice. Which is actually
pretty nasty. It result in "oops, I forgot a barrier, but everything
worked fine for a decade, with three odd reports of 'that can't
happen' bugs from the field" kinds of things. Figuring out what's
going on is just painful as hell.

Which models actually had it? And how exactly they got here?


I think it was the 21264, and I have this dim memory of it being due
to a partitioned cache: even if the originating CPU did two writes in
order (with a wmb in between), the reading CPU might end up having the
first write delayed (because the cache partition that it went into was
busy with other updates), and would read the second write first. If
that second write was the address to the first one, it could then
follow that pointer, and without a read barrier to synchronize the
cache partitions, it could see the old stale value.

But note the "dim memory". I may have confused it with something else.
I haven't actually used an alpha in closer to two decades by now. You
can get very similar effects from value prediction, but I don't think
any alpha microarchitecture ever did that.

Anyway, there definitely were versions of the alpha that could do
this, and it wasn't just purely theoretical.

(RMB = Read Memory Barrier asm instruction, and/or the name of Linux kernel function rmb() that wraps whatever inline asm is necessary to make that happen. e.g. on x86, just a barrier to compile-time reordering, asm("":::"memory"). I think modern Linux manages to avoid an acquire barrier when only a data dependency is needed, unlike C11/C++11, but I forget. Linux is only portable to a few compilers, and those compilers do take care to support what Linux depends on, so they have an easier time than the ISO C11 standard in cooking up something that works in practice on real ISAs.)

See also https://lkml.org/lkml/2012/2/1/521 re: Linux's smp_read_barrier_depends() which is necessary in Linux only because of Alpha. (But a reply from Hans Boehm points out that "compilers can, and sometimes do, remove dependencies", which is why C11 memory_order_consume support needs to be so elaborate to avoid risk of breakage. Thus smp_read_barrier_depends is potentially brittle.)


Footnote 2: x86 orders all loads whether they carry a data dependency on the pointer or not, so it doesn't need to preserve "false" dependencies, and with a variable-length instruction set it actually saves code size to xor eax,eax (2 bytes) instead mov eax,0 (5 bytes).

So xor reg,reg became the standard idiom since early 8086 days, and now it's recognized and actually handled like mov, with no dependency on the old value or RAX. (And in fact more efficiently than mov reg,0 beyond just code-size: What is the best way to set a register to zero in x86 assembly: xor, mov or and?)

But this is impossible for ARM or most other weakly ordered ISAs, like I said they're literally not allowed to do this.

ldr r3, [something]       ; load r3 = mem
eor r0, r3,r3 ; r0 = r3^r3 = 0
ldr r4, [r1, r0] ; load r4 = mem[r1+r0]. Ordered after the other load

is required to inject a dependency on r0 and order the load of r4 after the load of r3, even though the load address r1+r0 is always just r1 because r3^r3 = 0. But only that load, not all other later loads; it's not an acquire barrier or an acquire load.

Difference between memory_order_consume and memory_order_acquire

The C11 Standard's ruling is as follows.

5.1.2.4 Multi-threaded executions and data races


  1. An evaluation A is dependency-ordered before 16) an evaluation B if:

    A performs a release operation on an atomic object M, and, in another thread, B performs a consume operation on M and reads a value written by any side effect in the release sequence headed by A, or

    — for some evaluation X, A is dependency-ordered before X and X carries a dependency to B.

  2. An evaluation A inter-thread happens before an evaluation B if A synchronizes with B, A is dependency-ordered before B, or, for some evaluation X:

    — A synchronizes with X and X is sequenced before B,

    — A is sequenced before X and X inter-thread happens before B, or

    — A inter-thread happens before X and X inter-thread happens before B.

  3. NOTE 7 The ‘‘inter-thread happens before’’ relation describes arbitrary concatenations of ‘‘sequenced before’’, ‘‘synchronizes with’’, and ‘‘dependency-ordered before’’ relationships, with two exceptions. The first exception is that a concatenation is not permitted to end with ‘‘dependency-ordered before’’ followed by ‘‘sequenced before’’. The reason for this limitation is that a consume operation participating in a ‘‘dependency-ordered before’’ relationship provides ordering only with respect to operations to which this consume operation actually carries a dependency. The reason that this limitation applies only to the end of such a concatenation is that any subsequent release operation will provide the required ordering for a prior consume operation. The second exception is that a concatenation is not permitted to consist entirely of ‘‘sequenced before’’. The reasons for this limitation are (1) to permit ‘‘inter-thread happens before’’ to be transitively closed and (2) the ‘‘happens before’’ relation, defined below, provides for relationships consisting entirely of ‘‘sequenced before’’.

  4. An evaluation A happens before an evaluation B if A is sequenced before B or A inter-thread happens before B.

  5. A visible side effect A on an object M with respect to a value computation B of M satisfies the conditions:

    A happens before B, and

    — there is no other side effect X to M such that A happens before X and X happens before B.

    The value of a non-atomic scalar object M, as determined by evaluation B, shall be the value stored by the visible side effect A.

(emphasis added)


In the commentary below, I'll abbreviate below as follows:

  • Dependency-ordered before: DOB
  • Inter-thread happens before: ITHB
  • Happens before: HB
  • Sequenced before: SeqB

Let us review how this applies. We have 4 relevant memory operations, which we will name Evaluations A, B, C and D:

Thread 1:

y.store (20);             //    Release; Evaluation A
x.store (10); // Release; Evaluation B

Thread 2:

if (x.load() == 10) {     //    Consume; Evaluation C
assert (y.load() == 20) // Consume; Evaluation D
y.store (10)
}

To prove the assert never trips, we in effect seek to prove that A is always a visible side-effect at D. In accordance with 5.1.2.4 (15), we have:

A SeqB B DOB C SeqB D

which is a concatenation ending in DOB followed by SeqB. This is explicitly ruled by (17) to not be an ITHB concatenation, despite what (16) says.

We know that since A and D are not in the same thread of execution, A is not SeqB D; Hence neither of the two conditions in (18) for HB is satisfied, and A does not HB D.

It follows then that A is not visible to D, since one of the conditions of (19) is not met. The assert may fail.


How this could play out, then, is described here, in the C++ standard's memory model discussion and here, Section 4.2 Control Dependencies:

  1. (Some time ahead) Thread 2's branch predictor guesses that the if will be taken.
  2. Thread 2 approaches the predicted-taken branch and begins speculative fetching.
  3. Thread 2 out-of-order and speculatively loads 0xGUNK from y (Evaluation D). (Maybe it was not yet evicted from cache?).
  4. Thread 1 stores 20 into y (Evaluation A)
  5. Thread 1 stores 10 into x (Evaluation B)
  6. Thread 2 loads 10 from x (Evaluation C)
  7. Thread 2 confirms the if is taken.
  8. Thread 2's speculative load of y == 0xGUNK is committed.
  9. Thread 2 fails assert.

The reason why it is permitted for Evaluation D to be reordered before C is because a consume does not forbid it. This is unlike an acquire-load, which prevents any load/store after it in program order from being reordered before it. Again, 5.1.2.4(15) states, a consume operation participating in a ‘‘dependency-ordered before’’ relationship provides ordering only with respect to operations to which this consume operation actually carries a dependency, and there most definitely is not a dependency between the two loads.


CppMem verification

CppMem is a tool that helps explore shared data access scenarios under the C11 and C++11 memory models.

For the following code that approximates the scenario in the question:

int main() {
atomic_int x, y;
y.store(30, mo_seq_cst);
{{{ { y.store(20, mo_release);
x.store(10, mo_release); }
||| { r3 = x.load(mo_consume).readsvalue(10);
r4 = y.load(mo_consume); }
}}};
return 0; }

The tool reports two consistent, race-free scenarios, namely:

Consume, Success Scenario

In which y=20 is successfully read, and

Consume, Failure Scenario

In which the "stale" initialization value y=30 is read. Freehand circle is mine.

By contrast, when mo_acquire is used for the loads, CppMem reports only one consistent, race-free scenario, namely the correct one:

Acquire, Success Scenario

in which y=20 is read.

Are memory orderings: consume, acq_rel and seq_cst ever needed on Intel x86?

If you care about portable performance, you should ideally write your C++ source with the minimum necessary ordering for each operation. The only thing that really costs "extra" on x86 is mo_seq_cst for a pure store, so make a point of avoiding that even for x86.

(relaxed ops can also allow more compile-time optimization of the surrounding non-atomic operations, e.g. CSE and dead store elimination, because relaxed ops avoid a compiler barrier. If you don't need any order wrt. surrounding code, tell the compiler that fact so it can optimize.)

Keep in mind that you can't fully test weaker orders if you only have x86 hardware, especially atomic RMWs with only acquire or release, so in practice it's safer to leave your RMWs as seq_cst if you're doing anything that's already complicated and hard to reason about correctness.

x86 asm naturally has acquire loads, release stores, and seq_cst RMW operations. Compile-time reordering is possible with weaker orders in the source, but after the compiler makes its choices, those are "nailed down" into x86 asm. (And stronger store orders require an mfence after mov, or using xchg. seq_cst loads don't actually have any extra cost, but it's more accurate to describe them as acquire because earlier stores can reorder past them, and all being acquire means they can't reorder with each other.)


There are very few use-cases where seq_cst is required (draining the store buffer before later loads can happen).



Related Topics



Leave a reply



Submit