C++11 Introduced a Standardized Memory Model. What Does It Mean? and How Is It Going to Affect C++ Programming

C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?

First, you have to learn to think like a Language Lawyer.

The C++ specification does not make reference to any particular compiler, operating system, or CPU. It makes reference to an abstract machine that is a generalization of actual systems. In the Language Lawyer world, the job of the programmer is to write code for the abstract machine; the job of the compiler is to actualize that code on a concrete machine. By coding rigidly to the spec, you can be certain that your code will compile and run without modification on any system with a compliant C++ compiler, whether today or 50 years from now.

The abstract machine in the C++98/C++03 specification is fundamentally single-threaded. So it is not possible to write multi-threaded C++ code that is "fully portable" with respect to the spec. The spec does not even say anything about the atomicity of memory loads and stores or the order in which loads and stores might happen, never mind things like mutexes.

Of course, you can write multi-threaded code in practice for particular concrete systems – like pthreads or Windows. But there is no standard way to write multi-threaded code for C++98/C++03.

The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory.

Consider the following example, where a pair of global variables are accessed concurrently by two threads:

           Global
           int x, y;

Thread 1            Thread 2
x = 17;             cout << y << " ";
y = 37;             cout << x << endl;

What might Thread 2 output?

Under C++98/C++03, this is not even Undefined Behavior; the question itself is meaningless because the standard does not contemplate anything called a "thread".

Under C++11, the result is Undefined Behavior, because loads and stores need not be atomic in general. Which may not seem like much of an improvement... And by itself, it's not.

But with C++11, you can write this:

           Global
           atomic<int> x, y;

Thread 1                 Thread 2
x.store(17);             cout << y.load() << " ";
y.store(37);             cout << x.load() << endl;

Now things get much more interesting. First of all, the behavior here is defined. Thread 2 could now print 0 0 (if it runs before Thread 1), 37 17 (if it runs after Thread 1), or 0 17 (if it runs after Thread 1 assigns to x but before it assigns to y).

What it cannot print is 37 0, because the default mode for atomic loads/stores in C++11 is to enforce sequential consistency. This just means all loads and stores must be "as if" they happened in the order you wrote them within each thread, while operations among threads can be interleaved however the system likes. So the default behavior of atomics provides both atomicity and ordering for loads and stores.

Now, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here. But if your algorithm can tolerate out-of-order loads and stores; i.e., if it requires atomicity but not ordering; i.e., if it can tolerate 37 0 as output from this program, then you can write this:

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_relaxed);   cout << y.load(memory_order_relaxed) << " ";
y.store(37,memory_order_relaxed);   cout << x.load(memory_order_relaxed) << endl;

The more modern the CPU, the more likely this is to be faster than the previous example.

Finally, if you just need to keep particular loads and stores in order, you can write:

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_release);   cout << y.load(memory_order_acquire) << " ";
y.store(37,memory_order_release);   cout << x.load(memory_order_acquire) << endl;

This takes us back to the ordered loads and stores – so 37 0 is no longer a possible output – but it does so with minimal overhead. (In this trivial example, the result is the same as full-blown sequential consistency; in a larger program, it would not be.)

Of course, if the only outputs you want to see are 0 0 or 37 17, you can just wrap a mutex around the original code. But if you have read this far, I bet you already know how that works, and this answer is already longer than I intended :-).

So, bottom line. Mutexes are great, and C++11 standardizes them. But sometimes for performance reasons you want lower-level primitives (e.g., the classic double-checked locking pattern). The new standard provides high-level gadgets like mutexes and condition variables, and it also provides low-level gadgets like atomic types and the various flavors of memory barrier. So now you can write sophisticated, high-performance concurrent routines entirely within the language specified by the standard, and you can be certain your code will compile and run unchanged on both today's systems and tomorrow's.

Although to be frank, unless you are an expert and working on some serious low-level code, you should probably stick to mutexes and condition variables. That's what I intend to do.

For more on this stuff, see this blog post.

C11/C++11 memory model acquire, release, relaxed specifics

Thread 3's acquire syncs with Thread 2's release, which comes after Thread 2's acquire which syncs with Thread 1's release. Therefore, Thread 3 is guaranteed to see the value that Thread 1 set to x, correct?

Yes, this is correct. The acquire/release operations establish synchronize-with relations - i.e., store_release(a) synchronizes-with load_acquire(a) and store_release(b) synchronizes-with load_acquire(b). And load_acquire(a) is sequenced-before store_release(b). synchronize-with and sequenced-before are both part of the happens-before definition, and the happens-before relation is transitive. Therefore, store_relaxed(x, 1) happens-before load_relaxed(x).

Am I right in believing that according to the standard, there must always be a pair of non-relaxed atomic operations, one in each thread, in order for any kind of memory ordering at all to be guaranteed?

This is question is a bit too broad, but overall I would tend to say "yes". In general you have to ensure that there is a proper happens-before relation when operating on some (non-atomic) shared data. If one thread writes to some shared data and some other thread should read that data, you have to ensure that the write happens-before the read. There are different ways to achieve this - atomics with the correct memory orderings are just one way (although one could argue that almost all other methods (like std::mutex) also boil down to atomic operations).

Fences also have to be combined with other fences or atomic operations. Your example would work if super_duper_memory_fence() were a std::atomic_thread_fence(std::memory_order_release) and you put another std::atomic_thread_fence(std::memory_order_acquire) before your call to use_data.

For more details I can recommend this paper which I have co-authored: Memory Models for C/C++ Programmers

Does the C11 memory model really conflict with common optimizations?

Apparently, no one is both interested enough and confident enough to write an answer, so I guess I'll go ahead.

isn't that argument fatally flawed?

To the extent that the proof quoted from the paper is intended to demonstrate that a conforming C implementation is not permitted to perform the source-to-source transformation described in the question, or an equivalent, yes, the proof is flawed. The refutation presented in the question is sound.

There was some discussion in comments about how the refutation could be viewed as boiling down to anything being permissible in the event of undefined behavior. That is a valid perspective, and in no way does it undercut the argument. However, I think it's unnecessarily minimalistic.

Again, the key problem with the paper's proof is here:

the load of a can only return 0 (the initial value of a) because the
store a=1 does not happen before it (because it is in a different
thread that has not been synchronised with) and non-atomic loads must
return the latest write that happens before them.

The proof's error is that the language specification's requirement that a read of a must return the result of a write to a that "happened before" it is conditioned on the program being free of data races. This is an essential foundation for the whole model, not some kind of escape hatch. The program manifestly is not free of data races if in fact the read of a is performed, so the requirement is moot in that case. The read of a by thread 2 absolutely can observe the write by thread 1, and there is good reason to suppose that it might sometimes do so in practice.

To look at it another way, the proof chooses to focus on the write not happening before the read, but ignores the fact that the read also does not happen before the write.

Taking the relaxed atomic accesses into account does not change anything. It is plausible that in a real execution of the paper's three-threaded program, the implementation (for example) speculatively executes the relaxed load of x in thread 2 on the assumption that it will return 1, then reads from a the value written by thread 1, and as a result, executes the store to y. Because the atomic accesses are performed with relaxed semantics, the execution of thread 3 can read the value of y as 1 (or speculate that it will do so) and consequently perform the write to x. All speculations involved can then be confirmed correct, with the final result that a = x = y = 1. It is intentional that this seemingly paradoxical result is allowed by the "relaxed" memory order.

isn't it indeed valid for a C11 implementation to treat the original
three-threaded program as if it were the two-threaded program
consisting of threads 2' and 3?

At minimum, the paper's argument does not show otherwise, even if we -- with no basis in the specification -- construe the scope of the UB arising from the data race to be limited to whether the value read from a is its initial one or the one written by thread 1.

Implementations are given broad license to behave as they choose, so long as they produce observable behavior that is consistent with the behavior required of the abstract machine. The creation and execution of multiple threads of execution is not itself part of the observable behavior of a program, as that is defined by the specification. Therefore, yes, a program that performed the proposed transformation and then behaved accordingly, or one that otherwise behaved as if there were a happens before edge between the write to a and the read from a, would not be acting inconsistently with the specification.

How does C++20's memory model differ from that of C++11?

As @PeterM suggests, its' a (subjectively) minor change due to issues discovered ex-post-facto with the formalization of the C++11 memory model.

The old model was defined so that different regimes of memory access could be implemented on common architectures using more or less-costly sets of hardware instructions. Specifically, memory_order_acquire and memory_order_release were supposed to be implementable on ARM and Power CPU architectures using some kind of lightweight fence instructions. Unfortunately, it turns out that they can't (!); and this is also true for NVIDIA GPUs, although those weren't really targeted a decade back.

With this being the case, there were two options:

Implement to fit the standard - possible, but then performance would be pretty bad and that wasn't the idea.
Fix the standard to better accommodate these architectures (while not messing up the model completely)

Option 2 was apparently chosen.

For more details, read:

Lahav, Vafeiadis, Kang, Hur, Dreyer, Repairing Sequential Consistency in C/C++11.
Hans Boehm's C++-standard-committee paper P0668R5: Revising the C++ memory model.

C11/C++11 Memory Model

The memory model was developed for C++11, and adopted by C11. Lawrence Crowl did a lot of work to ensure that the interface for atomic operations was as close as possible. There were quite a few people involved, but you are right that Hans Boehm was one of them.
GCC currently (4.7) implements a reasonable approximation of the memory model. Certainly close enough that most programs won't be able tell the difference. I'm fairly sure that full conformance is on their plan, but don't know the timetable, as I'm not involved.

Strange results about C++11 memory model (Relaxed ordering)

This can depend on the type of processor you are running on.

x86 does not have a memory model as relaxed as other processors. In particular, stores will never be reordered with regards to other stores.

http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/ has more info on x86's memory model.

C++11 Introduced a Standardized Memory Model. What Does It Mean? and How Is It Going to Affect C++ Programming