C++, Std::Atomic, What Is Std::Memory_Order and How to Use Them

c++, std::atomic, what is std::memory_order and how to use them?

Can anyone explain what is std::memory_order in plain English,

The best "Plain English" explanation I've found for the various memory orderings is Bartoz Milewski's article on relaxed atomics: http://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/

And the follow-up post: http://bartoszmilewski.com/2008/12/23/the-inscrutable-c-memory-model/

But note that whilst these articles are a good introduction, they pre-date the C++11 standard and won't tell you everything you need to know to use them safely.

and how to use them with std::atomic<>?

My best advice to you here is: don't. Relaxed atomics are (probably) the trickiest and most dangerous thing in C++11. Stick to std::atomic<T> with the default memory ordering (sequential consistency) until you're really, really sure that you have a performance problem that can be solved by using the relaxed memory orderings.

In the second article linked above, Bartoz Milewski reaches the following conclusion:

I had no idea what I was getting myself into when attempting to reason
about C++ weak atomics. The theory behind them is so complex that it’s
borderline unusable. It took three people (Anthony, Hans, and me) and
a modification to the Standard to complete the proof of a relatively
simple algorithm. Imagine doing the same for a lock-free queue based
on weak atomics!

std::memory_order for std::atomicT::wait

I posted this issue to GitHub for this proposal, and got a response that std::atomic::wait is meant to be implemented on a futex with logic over it, specifically to mask spurious wakes.

So, cppreference.com is wrong about this:

These functions are allowed to unblock spuriously, i.e. return due to reasons other than value change or notification.

And while loop is superfluous in my example, I should use just:

std::atomic<bool> x;

x.wait(false, std::memory_order_acquire);

What do each memory_order mean?

The GCC Wiki gives a very thorough and easy to understand explanation with code examples.

(excerpt edited, and emphasis added)

IMPORTANT:

Upon re-reading the below quote copied from the GCC Wiki in the process of adding my own wording to the answer, I noticed that the quote is actually wrong. They got acquire and consume exactly the wrong way around. A release-consume operation only provides an ordering guarantee on dependent data whereas a release-acquire operation provides that guarantee regardless of data being dependent on the atomic value or not.

The first model is "sequentially consistent". This is the default mode used when none is specified, and it is the most restrictive. It can also be explicitly specified via memory_order_seq_cst. It provides the same restrictions and limitation to moving loads around that sequential programmers are inherently familiar with, except it applies across threads.

[...]

From a practical point of view, this amounts to all atomic operations acting as optimization barriers. It's OK to re-order things between atomic operations, but not across the operation. Thread local stuff is also unaffected since there is no visibility to other threads. [...] This mode also provides consistency across all threads.

The opposite approach is memory_order_relaxed. This model allows for much less synchronization by removing the happens-before restrictions. These types of atomic operations can also have various optimizations performed on them, such as dead store removal and commoning. [...] Without any happens-before edges, no thread can count on a specific ordering from another thread.

The relaxed mode is most commonly used when the programmer simply wants a variable to be atomic in nature rather than using it to synchronize threads for other shared memory data.

The third mode (memory_order_acquire / memory_order_release) is a hybrid between the other two. The acquire/release mode is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables. This allows for a relaxing of the synchronization required between independent reads of independent writes.

memory_order_consume is a further subtle refinement in the release/acquire memory model that relaxes the requirements slightly by removing the happens before ordering on non-dependent shared variables as well.

[...]

The real difference boils down to how much state the hardware has to flush in order to synchronize. Since a consume operation may therefore execute faster, someone who knows what they are doing can use it for performance critical applications.

Here follows my own attempt at a more mundane explanation:

A different approach to look at it is to look at the problem from the point of view of reordering reads and writes, both atomic and ordinary:

All atomic operations are guaranteed to be atomic within themselves (the combination of two atomic operations is not atomic as a whole!) and to be visible in the total order in which they appear on the timeline of the execution stream. That means no atomic operation can, under any circumstances, be reordered, but other memory operations might very well be. Compilers (and CPUs) routinely do such reordering as an optimization.

It also means the compiler must use whatever instructions are necessary to guarantee that an atomic operation executing at any time will see the results of each and every other atomic operation, possibly on another processor core (but not necessarily other operations), that were executed before.

Now, a relaxed is just that, the bare minimum. It does nothing in addition and provides no other guarantees. It is the cheapest possible operation. For non-read-modify-write operations on strongly ordered processor architectures (e.g. x86/amd64) this boils down to a plain normal, ordinary move.

The sequentially consistent operation is the exact opposite, it enforces strict ordering not only for atomic operations, but also for other memory operations that happen before or after. Neither one can cross the barrier imposed by the atomic operation. Practically, this means lost optimization opportunities, and possibly fence instructions may have to be inserted. This is the most expensive model.

A release operation prevents ordinary loads and stores from being reordered after the atomic operation, whereas an acquire operation prevents ordinary loads and stores from being reordered before the atomic operation. Everything else can still be moved around.

The combination of preventing stores being moved after, and loads being moved before the respective atomic operation makes sure that whatever the acquiring thread gets to see is consistent, with only a small amount of optimization opportunity lost.

One may think of that as something like a non-existent lock that is being released (by the writer) and acquired (by the reader). Except... there is no lock.

In practice, release/acquire usually means the compiler needs not use any particularly expensive special instructions, but it cannot freely reorder loads and stores to its liking, which may miss out some (small) optimization opportuntities.

Finally, consume is the same operation as acquire, only with the exception that the ordering guarantees only apply to dependent data. Dependent data would e.g. be data that is pointed-to by an atomically modified pointer.

Arguably, that may provide for a couple of optimization opportunities that are not present with acquire operations (since fewer data is subject to restrictions), however this happens at the expense of more complex and more error-prone code, and the non-trivial task of getting dependency chains correct.

It is currently discouraged to use consume ordering while the specification is being revised.

C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads

Following up on some things in the comments:

As has been discussed, there is a well-defined modification order M for done on any given run of the program. Every thread does one store to done, which means one entry in M. And by the nature of atomic read-modify-writes, the value returned by each thread's test_and_set is the value that immediately precedes its own store in the order M. That's promised in C++20 atomics.order p10, which is the critical clause for understanding atomic RMW in the C++ memory model.

Now there are a finite number of threads, each corresponding to one entry in M, which is a total order. Necessarily there is one such entry that precedes all the others. Call it m1. The test_and_set whose store is entry m1 in M must return the preceding value in M. That can only be the value 0 which initialized done. So the thread corresponding to m1 will see test_and_set return 0. Every other thread will see it return 1, because each of their modifications m2, ..., mN follows (in M) another modification, which must have been a test_and_set storing the value 1.

We may not be bothering to observe all of the total order M, but this program does determine which of its entries is first on this particular run. It's the unique one whose test_and_set returns 0. A thread that sees its test_and_set return 1 won't know whether it came 2nd or 8th or 96th in that order, but it does know that it wasn't first, and that's all that matters here.

Another way to think about it: suppose it were possible for two threads (tA, tB) both to load the value 0. Well, each one makes an entry in the modification order; call them mA and mB. M is a total order so one has to go before the other. And bearing in mind the all-important [atomics.order p10], you will quickly find there is no legal way for you to fill out the rest of M.

All of this is promised by the standard without any reference to memory ordering, so it works even with std::memory_order_relaxed. The only effect of relaxed memory ordering is that we can't say much about how our load/store will become visible with respect to operations on other variables. That's irrelevant to the program at hand; it doesn't even have any other variables.

In the actual implementation, this means that an atomic RMW really has to exclusively own the variable for the duration of the operation. We must ensure that no other thread does a store to that variable, nor the load half of a read-modify-write, during that period. In a MESI-like coherent cache, this is done by temporarily locking the cache line in the E state; if the system makes it possible for us to lose that lock (like an LL/SC architecture), abort and start again.

As to your comment about "a thread reading false from its own cache/buffer": the implementation mustn't allow that in an atomic RMW, not even with relaxed ordering. When you do an atomic RMW, you must read it while you hold the lock, and use that value in the RMW operation. You can't use some old value that happens to be in a buffer somewhere. Likewise, you have to complete the write while you still hold the lock; you can't stash it in a buffer and let it complete later.

C++11 atomic: is std::memory_order code portable?

The C++ standard does have the concept of so-called "freestanding" implementations, which can support a subset of the standard library. However, it also defines bare-minimum functionality that even a freestanding implementation must support. Within that list is the entirety of the <atomic> header.

So yes, implementations must support these.

However, that doesn't mean that a particular flag will do exactly and only what that flag describes. The flags represent the minimum memory barrier, the specific things that are guaranteed to be visible. The implementation could issue a full memory barrier even for flags that don't require it, if the hardware implementation doesn't have lower-tier memory barriers.

So you should write code against what the standard says, and let the compiler sort out the details. If it proves to be inefficient on a platform, you can inspect the assembly to see if you might be able to improve matters.

But to answer your main question, yes, atomic-based code is portable (modulo compiler bugs).

How to understand RELAXED ORDERING in std::memory_order (C++)

You and a friend both agree that x=false and y=false. One day, you send him a letter telling him that x=true. The next day you send him a letter telling him that y=true. You definitely send him the letters in the right order.

Sometime later, your friend receives a letter from you saying that y=true. Now what does your friend know about x? He probably already received the letter that told him x=true. But maybe the postal system temporarily lost it and he'll receive it tomorrow. So for him, x=false and x=true are both valid possibilities when he receives the y=true letter.

So, back to the silicon world. Memory between threads has no guarantee at all that writes from other threads turn up in any particular order, and so the 'delayed x' is totally a possibility. All adding an atomic and using relaxed does is stop two threads racing on a single variable from becoming undefined behaviour. It makes no guarantees at all to ordering. Thats what the stronger orderings are for.

Or, in a slightly more crude way, behold my MSPaint skills:

Sample Image

In this case, the purple arrow which is the flow of 'x' from the first thread to the second thread comes too late, whereas the green arrow (y crossing over) happens fast.

Do I understand the semantics of std::memory_order correctly?

cppreference is only a summary of the C++ standard, and sometimes its text is less precise. The actual standard draft makes it clear: The final C++20 working draft N4681 states in atomics.order, par. 4 (p. 1525):

There is a single total order S on all memory_order::seq_cst operations, including fences, that satisfies the following constraints [...]

This clearly says all seq_cst operations, not just all operations on a particular object.

And notes 6 and 7 further down emphasize that the order does not apply to weaker memory orders:

6 [Note: We do not require that S be consistent with “happens before” (6.9.2.1). This allows more efficient
implementation of memory_order::acquire and memory_order::release on some machine architectures.
It can produce surprising results when these are mixed with memory_order::seq_cst accesses. — end note]
7 [Note: memory_order::seq_cst ensures sequential consistency only for a program that is free of data races
and uses exclusively memory_order::seq_cst atomic operations. Any use of weaker ordering will invalidate
this guarantee unless extreme care is used. In many cases, memory_order::seq_cst atomic operation

Understanding `memory_order_acquire` and `memory_order_release` in C++11

Acquire and Release are Memory Barriers.
If your program reads data after an acquire barrier you are assured you will be reading data consistent in order with any preceding release by any other thread in respect of the same atomic variable. Atomic variables are guaranteed to have an absolute order (when using memory_order_acquire and memory_order_release though weaker operations are provided for) to their reads and writes across all threads. These barriers in effect propagate that order to any threads using that atomic variable.
You can use atomics to indicate something has 'finished' or is 'ready' but if the consumer reads beyond that atomic variable the consumer can't be rely on 'seeing' the right 'versions' of other memory and atomics would have limited value.

The statements about 'moving before' or 'moving after' are instructions to the optimizer that it shouldn't re-order operations to take place out of order. Optimizers are very good at re-ordering instructions and even omitting redundant reads/writes but if they re-organise the code across the memory barriers they may unwittingly violate that order.

Your code relies on the std::string object (a) having been constructed in producer() before ptr is assigned and (b) the constructed version of that string (i.e. the version of the memory it occupies) being the one that consumer() reads.
Put simply consumer() is going to eagerly read the string as soon as it sees ptr assigned so it damn well better see a valid and fully constructed object or bad times will ensue.
In that code 'the act' of assigning ptr is how producer() 'tells' consumer the string is 'ready'. The memory barrier exists to make sure that's what the consumer sees.

Conversely if ptr was declared as an ordinary std::string * then the compiler could decide to optimize p away and assign the allocated address directly to ptr and only then construct the object and assign the int data. That is likely a disaster for the consumer thread which is using that assignment as the indicator that the objects producer is preparing are ready.
To be accurate if ptr were a pointer the consumer may never see the value assigned or on some architectures read a partially assigned value where only some of the bytes have been assigned and it points to a garbage memory location. However those aspects are about it being atomic not the wider memory barriers.

Atomic wait memory_order

If I need additional logic based on the atomic variable I need to call other functions

Do you?

Consider if T is a bool. It only has one of two states: true or false. If you wait until it is not true, then it must be false and no further atomic operations are needed.

And that's just the case where T is a type that can only (generally) assume two values. There are also cases where T is a type that technically can assume more values, but you're only using two of them. Or the cases where T could have multiple values, but your waiting code doesn't care which one it is, merely that it isn't the one you asked to wait on. "Is not zero" is a common example, as you may not care which kind of "not zero" it is.

C++, Std::Atomic, What Is Std::Memory_Order and How to Use Them