What Do Each Memory_Order Mean

What do each memory_order mean?

The GCC Wiki gives a very thorough and easy to understand explanation with code examples.

(excerpt edited, and emphasis added)

IMPORTANT:

Upon re-reading the below quote copied from the GCC Wiki in the process of adding my own wording to the answer, I noticed that the quote is actually wrong. They got acquire and consume exactly the wrong way around. A release-consume operation only provides an ordering guarantee on dependent data whereas a release-acquire operation provides that guarantee regardless of data being dependent on the atomic value or not.

The first model is "sequentially consistent". This is the default mode used when none is specified, and it is the most restrictive. It can also be explicitly specified via memory_order_seq_cst. It provides the same restrictions and limitation to moving loads around that sequential programmers are inherently familiar with, except it applies across threads.

[...]

From a practical point of view, this amounts to all atomic operations acting as optimization barriers. It's OK to re-order things between atomic operations, but not across the operation. Thread local stuff is also unaffected since there is no visibility to other threads. [...] This mode also provides consistency across all threads.

The opposite approach is memory_order_relaxed. This model allows for much less synchronization by removing the happens-before restrictions. These types of atomic operations can also have various optimizations performed on them, such as dead store removal and commoning. [...] Without any happens-before edges, no thread can count on a specific ordering from another thread.

The relaxed mode is most commonly used when the programmer simply wants a variable to be atomic in nature rather than using it to synchronize threads for other shared memory data.

The third mode (memory_order_acquire / memory_order_release) is a hybrid between the other two. The acquire/release mode is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables. This allows for a relaxing of the synchronization required between independent reads of independent writes.

memory_order_consume is a further subtle refinement in the release/acquire memory model that relaxes the requirements slightly by removing the happens before ordering on non-dependent shared variables as well.

[...]

The real difference boils down to how much state the hardware has to flush in order to synchronize. Since a consume operation may therefore execute faster, someone who knows what they are doing can use it for performance critical applications.

Here follows my own attempt at a more mundane explanation:

A different approach to look at it is to look at the problem from the point of view of reordering reads and writes, both atomic and ordinary:

All atomic operations are guaranteed to be atomic within themselves (the combination of two atomic operations is not atomic as a whole!) and to be visible in the total order in which they appear on the timeline of the execution stream. That means no atomic operation can, under any circumstances, be reordered, but other memory operations might very well be. Compilers (and CPUs) routinely do such reordering as an optimization.

It also means the compiler must use whatever instructions are necessary to guarantee that an atomic operation executing at any time will see the results of each and every other atomic operation, possibly on another processor core (but not necessarily other operations), that were executed before.

Now, a relaxed is just that, the bare minimum. It does nothing in addition and provides no other guarantees. It is the cheapest possible operation. For non-read-modify-write operations on strongly ordered processor architectures (e.g. x86/amd64) this boils down to a plain normal, ordinary move.

The sequentially consistent operation is the exact opposite, it enforces strict ordering not only for atomic operations, but also for other memory operations that happen before or after. Neither one can cross the barrier imposed by the atomic operation. Practically, this means lost optimization opportunities, and possibly fence instructions may have to be inserted. This is the most expensive model.

A release operation prevents ordinary loads and stores from being reordered after the atomic operation, whereas an acquire operation prevents ordinary loads and stores from being reordered before the atomic operation. Everything else can still be moved around.

The combination of preventing stores being moved after, and loads being moved before the respective atomic operation makes sure that whatever the acquiring thread gets to see is consistent, with only a small amount of optimization opportunity lost.

One may think of that as something like a non-existent lock that is being released (by the writer) and acquired (by the reader). Except... there is no lock.

In practice, release/acquire usually means the compiler needs not use any particularly expensive special instructions, but it cannot freely reorder loads and stores to its liking, which may miss out some (small) optimization opportuntities.

Finally, consume is the same operation as acquire, only with the exception that the ordering guarantees only apply to dependent data. Dependent data would e.g. be data that is pointed-to by an atomically modified pointer.

Arguably, that may provide for a couple of optimization opportunities that are not present with acquire operations (since fewer data is subject to restrictions), however this happens at the expense of more complex and more error-prone code, and the non-trivial task of getting dependency chains correct.

It is currently discouraged to use consume ordering while the specification is being revised.

c++, std::atomic, what is std::memory_order and how to use them?

Can anyone explain what is std::memory_order in plain English,

The best "Plain English" explanation I've found for the various memory orderings is Bartoz Milewski's article on relaxed atomics: http://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/

And the follow-up post: http://bartoszmilewski.com/2008/12/23/the-inscrutable-c-memory-model/

But note that whilst these articles are a good introduction, they pre-date the C++11 standard and won't tell you everything you need to know to use them safely.

and how to use them with std::atomic<>?

My best advice to you here is: don't. Relaxed atomics are (probably) the trickiest and most dangerous thing in C++11. Stick to std::atomic<T> with the default memory ordering (sequential consistency) until you're really, really sure that you have a performance problem that can be solved by using the relaxed memory orderings.

In the second article linked above, Bartoz Milewski reaches the following conclusion:

I had no idea what I was getting myself into when attempting to reason
about C++ weak atomics. The theory behind them is so complex that it’s
borderline unusable. It took three people (Anthony, Hans, and me) and
a modification to the Standard to complete the proof of a relatively
simple algorithm. Imagine doing the same for a lock-free queue based
on weak atomics!

How do memory_order_seq_cst and memory_order_acq_rel differ?

http://en.cppreference.com/w/cpp/atomic/memory_order has a good example at the bottom that only works with memory_order_seq_cst. Essentially memory_order_acq_rel provides read and write orderings relative to the atomic variable, while memory_order_seq_cst provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.

The example boils down to this:

bool x= false;
bool y= false;
int z= 0;

a() { x= true; }
b() { y= true; }
c() { while (!x); if (y) z++; }
d() { while (!y); if (x) z++; }

// kick off a, b, c, d, join all threads
assert(z!=0);

Operations on z are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z is always incremented.

Do I understand the semantics of std::memory_order correctly?

cppreference is only a summary of the C++ standard, and sometimes its text is less precise. The actual standard draft makes it clear: The final C++20 working draft N4681 states in atomics.order, par. 4 (p. 1525):

There is a single total order S on all memory_order::seq_cst operations, including fences, that satisfies the following constraints [...]

This clearly says all seq_cst operations, not just all operations on a particular object.

And notes 6 and 7 further down emphasize that the order does not apply to weaker memory orders:

6 [Note: We do not require that S be consistent with “happens before” (6.9.2.1). This allows more efficient
implementation of memory_order::acquire and memory_order::release on some machine architectures.
It can produce surprising results when these are mixed with memory_order::seq_cst accesses. — end note]
7 [Note: memory_order::seq_cst ensures sequential consistency only for a program that is free of data races
and uses exclusively memory_order::seq_cst atomic operations. Any use of weaker ordering will invalidate
this guarantee unless extreme care is used. In many cases, memory_order::seq_cst atomic operation

Which types of memory_order should be used for non-blocking behaviour with an atomic_flag?

The memory ordering you're using here is correct.

The acquire memory order when you test and set your flag (to take your hand-written lock) has the effect, informally speaking, of preventing any memory accesses of the following code from becoming visible before the flag is tested. That's what you want, because you want to ensure that those accesses are effectively not done if the flag was already set. Likewise, the release order on the clear at the end prevents any of the preceding accesses from becoming visible after the clear, which is also what you need so that they only happen while the lock is held.

However, it's probably simpler to just use a std::mutex. If you don't want to wait to take the lock, but instead do something else if you can't, that's what try_lock is for.

class SharedData {
    // ...
private:
    std::mutex my_lock;
}
// ...
if (my_lock.try_lock()) {
    // lock was taken, proceed with critical section
    my_lock.unlock();
} else {
    // lock not taken, do non-critical work
}

This may have a bit more overhead, but avoids the need to think about atomicity and memory ordering. It also gives you the option to easily do a blocking wait if that later becomes useful. If you've designed your program around an atomic_flag and later find a situation where you must wait to take the lock, you may find yourself stuck with either spinning while continually retrying the lock (which is wasteful of CPU cycles), or something like std::this_thread::yield(), which may wait for longer than necessary after the lock is available.

It's true this pattern is somewhat unusual. If there is always non-critical work to be done that doesn't need the lock, commonly you'd design your program to have a separate thread that just does the non-critical work continuously, and then the "critical" thread can just block as it waits for the lock.

C++11 memory_order_acquire and memory_order_release semantics?

The spinlock mutex implementation looks okay to me. I think they got the definitions of acquire and release completely wrong.

Here is the clearest explanation of acquire/release consistency models that I am aware of: Gharachorloo; Lenoski; Laudon; Gibbons; Gupta; Hennessy: Memory consistency and event ordering in scalable shared-memory multiprocessors, Int'l Symp Comp Arch, ISCA(17):15-26, 1990, doi 10.1145/325096.325102. (The doi is behind the ACM paywall. The actual link is to a copy not behind a paywall.)

Look at Condition 3.1 in Section 3.3 and the accompanying Figure 3:

before an ordinary load or store access is allowed
to perform with respect to any other processor,
all previous acquire accesses must be performed, and
before a release access is allowed to perform with
respect to any other processor, all previous ordinary
load and store accesses must be performed, and
special accesses are [sequentially] consistent with respect
to one another.

The point is this: acquires and releases are sequentially consistent¹ (all threads globally agree on the order in which acquires and releases happened.) All threads globally agree that the stuff that happens between an acquire and a release on a specific thread happened between the acquire and release. But normal loads and stores after a release are allowed to be moved (either by hardware or the compiler) above the release, and normal loads and stores before an acquire are allowed to be moved (either by hardware or the compiler) to after the acquire.

(Footnote 1: This is true for most implementations, but an overstatement for ISO C++ in general. Reader threads are allowed to disagree about the order of 2 stores done by 2 other threads. See Acquire/release semantics with 4 threads, and this answer for details of how C++ compiled for POWER CPUs demonstrates the difference in practice with release and acquire, but not seq_cst. But most CPUs do only get data between cores via coherent cache that means a global order does exist.)

In the C++ standard (I used the link to the Jan 2012 draft) the relevant section is 1.10 (pages 11 through 14).

The definition of happens-before is intended to be modeled after Lamport; Time, Clocks, and the Ordering of Events in a Distributed System, CACM, 21(7):558-565, Jul 1978. C++ acquires correspond to Lamport's receives, C++ releases correspond to Lamport's sends. Lamport placed a total order on the sequence of events within a single thread, where C++ has to allow a partial order (see Section 1.9, Paragraphs 13-15, page 10 for the C++ definition of sequenced-before.) Still, the sequenced-before ordering is pretty much what you would expect. Statements are sequenced in the order they are given in the program. Section 1.9, paragraph 14: "Every value computation and side eﬀect associated with a full-expression is sequenced before every value
computation and side eﬀect associated with the next full-expression to be evaluated."

The whole point of Section 1.10 is to say that a program that is data-race-free produces the same well defined value as if the program were run on a machine with a sequentially consistent memory and no compiler reordering. If there is a data race then the program has no defined semantics at all. If there is no data race then the compiler (or machine) is permitted to reorder operations that don't contribute to the illusion of sequential consistency.

Section 1.10, Paragraph 21 (page 14) says: A program is not data-race-free if there is a pair of accesses A and B from different threads to object X, at least one of those accesses has a side effect, and neither A happens-before B, nor B happens-before A. Otherwise the program is data-race-free.

Paragraphs 6-20 give a very careful definition of the happens-before relation. The key definition is Paragraph 12:

"An evaluation A happens before an evaluation B if:

A is sequenced before B, or
A inter-thread happens before B."

So if an acquire is sequenced before (in the same thread) pretty much any other statement, then the acquire must appear to happen before that statement. (Including if that statement performs a write.)

Likewise: if pretty much any statement is sequenced before (in the same thread) a release, then that statement must appear to happen before the release. (Including if that statement just does a value computation (read).)

The reason that the compiler is allowed to move other computations from after a release to before a release (or from before an acquire to after an acquire) is because of the fact that those operations specifically do not have an inter-thread happens before relationship (because they are outside the critical section). If they race the semantics are undefined, and if they don't race (because they aren't shared) then you can't tell exactly when they happened with regard to the synchronization.

Which is a very long way of saying: cppreference.com's definitions of acquire and release are dead wrong. Your example program has no data race condition, and PANIC can not occur.

How to understand RELAXED ORDERING in std::memory_order (C++)

You and a friend both agree that x=false and y=false. One day, you send him a letter telling him that x=true. The next day you send him a letter telling him that y=true. You definitely send him the letters in the right order.

Sometime later, your friend receives a letter from you saying that y=true. Now what does your friend know about x? He probably already received the letter that told him x=true. But maybe the postal system temporarily lost it and he'll receive it tomorrow. So for him, x=false and x=true are both valid possibilities when he receives the y=true letter.

So, back to the silicon world. Memory between threads has no guarantee at all that writes from other threads turn up in any particular order, and so the 'delayed x' is totally a possibility. All adding an atomic and using relaxed does is stop two threads racing on a single variable from becoming undefined behaviour. It makes no guarantees at all to ordering. Thats what the stronger orderings are for.

Or, in a slightly more crude way, behold my MSPaint skills:

Sample Image

In this case, the purple arrow which is the flow of 'x' from the first thread to the second thread comes too late, whereas the green arrow (y crossing over) happens fast.

Which fences exactly provided by std::memory_order in C++?

Anyone exactly knows about that?

Sure, there are a variety of sources, for example C++ Reference:

memory_order_relaxed — Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed.

memory_order_consume — A load operation with this memory order performs a consume operation on the affected memory location: no reads or writes in the current thread dependent on the value currently loaded can be reordered before this load. Writes to data-dependent variables in other threads that release the same atomic variable are visible in the current thread. On most platforms, this affects compiler optimizations only.

memory_order_acquire — A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread.

memory_order_release — A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic.

memory_order_acq_rel — A read-modify-write operation with this memory order is both an acquire operation and a release operation. No memory reads or writes in the current thread can be reordered before or after this store. All writes in other threads that release the same atomic variable are visible before the modification and the modification is visible in other threads that acquire the same atomic variable.

memory_order_seq_cst — A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order.

Please also have a look at atomic<> Weapons presentation by Herb Sutter, which explains a lot.

Also, I need know in which place compiler puts memory fence

This is architecture-dependent. On some architectures it is a no op, on some it is a instruction prefix, on some it will be a special instruction before/after the load/store.

There is a paper called "Memory Barriers: a Hardware View for Software Hackers", which analyses barriers on many architectures if you are interested.

For example: atomic.load(std::memory_order_acquire); means apply memory_order_acquire fence and load data atomically?

This is architecture-dependent too, but for acquire barrier I would say quite the opposite: we load a variable and then we make sure no further reads/writes go before the load, i.e. put a fence.

But on some platforms it could be a single processor instruction. For example on ARMs there is a load acquire (LDA) and store-release (STL) instructions.

Understanding memory_order_relaxed

An interesting observation is that, with your code, there is no actual concurrency; i.e. fun1 and fun2 run sequentially, the reason being that, under specific conditions (including calling std::async with the std::launch::async launch policy), the std::future object returned by std::async has its destructor block until the launched function call returns. Since you disregard the return object, its destructor is called before the end of the statement. Had you reversed the two statements in main() (i.e. launch fun2 before fun1), your program would have been caught in an infinite loop since fun1 would never run.

This std::future wait-upon-destruction behavior is somewhat controversial (even within the standards committee) and since I assume you didn't mean that, I will take the liberty to rewrite the 2 statements in main for (both examples) to:

auto tmp1 = std::async(std::launch::async, fun1);
auto tmp2 = std::async(std::launch::async, fun2);

This will defer the actual std::future return object destruction till the end of main so that fun1 and fun2 get to run asynchronously.

is it technically possible for fun2 to be in an infinite loop where it sees the value of ptr as nullptr even if the thread that sets ptr has finished running?

No, this is not possible with std::atomic (on a real platform, as was mentioned in the comments section). With a non-std::atomic variable, the compiler could (theoretically) have chosen to keep the value in register only, but a std::atomic is stored and cache coherency will propagate the value to other threads. Using std::memory_order_relaxed is fine here as long as you don't dereference the pointer.

Is it possible in the code above for fun2 to see the value of atomic i as 1 or is it assured that it will see the value 2?

It is guaranteed to see value 2 in variable x.

fun1 stores two different values to the same variable, but since there is a clear dependency, these are not reordered.

In fun1, the ptr.store with std::memory_order_release prevents the i.store(2) with std::memory_order_relaxed from moving down below its release barrier. In fun2, the ptr.load with std::memory_order_acquire prevents the i.load with std::memory_order_relaxed from moving up across its acquire barrier. This guarantees that x in fun2 will have value 2.

Note that by using std::memory_order_relaxed on all atomics, it would be possible to see x with value 0, 1 or 2, depending on the relative ordering of access to atomic variable i with regards to ptr.store and ptr.load.

What Do Each Memory_Order Mean