What Formally Guarantees That Non-Atomic Variables Can't See Out-Of-Thin-Air Values and Create a Data Race Like Atomic Relaxed Theoretically Can

What formally guarantees that non-atomic variables can't see out-of-thin-air values and create a data race like atomic relaxed theoretically can?

The text of your question seems to be missing the point of the example and out-of-thin-air values. Your example does not contain data-race UB. (It might if x or y were set to 42 before those threads ran, in which case all bets are off and the other answers citing data-race UB apply.)

There is no protection against real data races, only against out-of-thin-air values.

I think you're really asking how to reconcile that mo_relaxed example with sane and well-defined behaviour for non-atomic variables. That's what this answer covers.

The note is pointing out a hole in the atomic `mo_relaxed` formalism, not warning you of a real possible effect on some implementations.

This gap does not (I think) apply to non-atomic objects, only to mo_relaxed.

They say However, implementations should not allow such behavior. – end note]. Apparently the standards committee couldn't find a way to formalize that requirement so for now it's just a note, but is not intended to be optional.

It's clear that even though this isn't strictly normative, the C++ standard intends to disallow out-of-thin-air values for relaxed atomic (and in general I assume). Later standards discussion, e.g. 2018's p0668r5: Revising the C++ memory model (which doesn't "fix" this, it's an unrelated change) includes juicy side-nodes like:

We still do not have an acceptable way to make our informal (since C++14) prohibition of out-of-thin-air results precise. The primary practical effect of that is that formal verification of C++ programs using relaxed atomics remains unfeasible. The above paper suggests a solution similar to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html . We continue to ignore the problem here ...

So yes, the normative parts of the standard are apparently weaker for relaxed_atomic than they are for non-atomic. This seems to be an unfortunately side effect of how they define the rules.

AFAIK no implementations can produce out-of-thin-air values in real life.

Later versions of the standard phrase the informal recommendation more clearly, e.g. in the current draft: https://timsong-cpp.github.io/cppwp/atomics.order#8

Implementations should ensure that no “out-of-thin-air” values are computed that circularly depend on their own computation.

...

[ Note: The recommendation [of 8.] similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:
   // Thread 1:
   r1 = x.load(memory_order::relaxed);
   if (r1 == 42) y.store(42, memory_order::relaxed);
   // Thread 2:
   r2 = y.load(memory_order::relaxed);
   if (r2 == 42) x.store(42, memory_order::relaxed);
— end note ]

(This rest of the answer was written before I was sure that the standard intended to disallow this for mo_relaxed, too.)

I'm pretty sure the C++ abstract machine does not allow r1 == r2 == 42.

Every possible ordering of operations in the C++ abstract machine operations leads to r1=r2=0 without UB, even without synchronization. Therefore the program has no UB and any non-zero result would violate the "as-if" rule.

Formally, ISO C++ allows an implementation to implement functions / programs in any way that gives the same result as the C++ abstract machine would. For multi-threaded code, an implementation can pick one possible abstract-machine ordering and decide that's the ordering that always happens. (e.g. when reordering relaxed atomic stores when compiling to asm for a strongly-ordered ISA. The standard as written even allows coalescing atomic stores but compilers choose not to). But the result of the program always has to be something the abstract machine could have produced. (Only the Atomics chapter introduces the possibility of one thread observing the actions of another thread without mutexes. Otherwise that's not possible without data-race UB).

I think the other answers didn't look carefully enough at this. (And neither did I when it was first posted). Code that doesn't execute doesn't cause UB (including data-race UB), and compilers aren't allowed to invent writes to objects. (Except in code paths that already unconditionally write them, like y = (x==42) ? 42 : y; which would obviously create data-race UB.)

For any non-atomic object, if don't actually write it then other threads might also be reading it, regardless of code inside not-executed if blocks. The standard allows this and doesn't allow a variable to suddenly read as a different value when the abstract machine hasn't written it. (And for objects we don't even read, like neighbouring array elements, another thread might even be writing them.)

Therefore we can't do anything that would let another thread temporarily see a different value for the object, or step on its write. Inventing writes to non-atomic objects is basically always a compiler bug; this is well known and universally agreed upon because it can break code that doesn't contain UB (and has done so in practice for a few cases of compiler bugs that created it, e.g. IA-64 GCC I think had such a bug at one point that broke the Linux kernel). IIRC, Herb Sutter mentioned such bugs in part 1 or 2 of his talk, atomic<> Weapons: The C++ Memory Model and Modern Hardware", saying that it was already usually considered a compiler bug before C++11, but C++11 codified that and made it easier to be sure.

Or another recent example with ICC for x86:
Crash with icc: can the compiler invent writes where none existed in the abstract machine?

In the C++ abstract machine, there's no way for execution to reach either y = r1; or x = r2;, regardless of sequencing or simultaneity of the loads for the branch conditions. x and y both read as 0 and neither thread ever writes them.

No synchronization is required to avoid UB because no order of abstract-machine operations leads to a data-race. The ISO C++ standard doesn't have anything to say about speculative execution or what happens when mis-speculation reaches code. That's because speculation is a feature of real implementations, not of the abstract machine. It's up to implementations (HW vendors and compiler writers) to ensure the "as-if" rule is respected.

It's legal in C++ to write code like if (global_id == mine) shared_var = 123; and have all threads execute it, as long as at most one thread actually runs the shared_var = 123; statement. (And as long as synchronization exists to avoid a data race on non-atomic int global_id). If things like this broke down, it would be chaos. For example, you could apparently draw wrong conclusions like reordering atomic operations in C++

Observing that a non-write didn't happen isn't data-race UB.

It's also not UB to run if(i<SIZE) return arr[i]; because the array access only happens if i is in bounds.

I think the "out of the blue" value-invention note only applies to relaxed-atomics, apparently as a special caveat for them in the Atomics chapter. (And even then, AFAIK it can't actually happen on any real C++ implementations, certainly not mainstream ones. At this point implementations don't have to take any special measures to make sure it can't happen for non-atomic variables.)

I'm not aware of any similar language outside the atomics chapter of the standard that allows an implementation to allow values to appear out of the blue like this.

I don't see any sane way to argue that the C++ abstract machine causes UB at any point when executing this, but seeing r1 == r2 == 42 would imply that unsynchronized read+write had happened, but that's data-race UB. If that can happen, can an implementation invent UB because of speculative execution (or some other reason)? The answer has to be "no" for the C++ standard to be usable at all.

For relaxed atomics, inventing the 42 out of nowhere wouldn't imply that UB had happened; perhaps that's why the standard says it's allowed by the rules? As far as I know, nothing outside the Atomics chapter of the standard allows it.

A hypothetical asm / hardware mechanism that could cause this

(Nobody wants this, hopefully everyone agrees that it would be a bad idea to build hardware like this. It seems unlikely that coupling speculation across logical cores would ever be worth the downside of having to roll back all cores when one detects a mispredict or other mis-speculation.)

For 42 to be possible, thread 1 has to see thread 2's speculative store and the store from thread 1 has to be seen by thread 2's load. (Confirming that branch speculation as good, allowing this path of execution to become the real path that was actually taken.)

i.e. speculation across threads: Possible on current HW if they ran on the same core with only a lightweight context switch, e.g. coroutines or green threads.

But on current HW, memory reordering between threads is impossible in that case. Out-of-order execution of code on the same core gives the illusion of everything happening in program order. To get memory reordering between threads, they need to be running on different cores.

So we'd need a design that coupled together speculation between two logical cores. Nobody does that because it means more state needs to rollback if a mispredict is detected. But it is hypothetically possible. For example an OoO SMT core that allows store-forwarding between its logical cores even before they've retired from the out-of-order core (i.e. become non-speculative).

PowerPC allows store-forwarding between logical cores for retired stores, meaning that threads can disagree about the global order of stores. But waiting until they "graduate" (i.e. retire) and become non-speculative means it doesn't tie together speculation on separate logical cores. So when one is recovering from a branch miss, the others can keep the back-end busy. If they all had to rollback on a mispredict on any logical core, that would defeat a significant part of the benefit of SMT.

I thought for a while I'd found an ordering that lead to this on single core of a real weakly-ordered CPUs (with user-space context switching between the threads), but the final step store can't forward to the first step load because this is program order and OoO exec preserves that.

T2: r2 = y; stalls (e.g. cache miss)
T2: branch prediction predicts that r2 == 42 will be true. ( x = 42 should run.
T2: x = 42 runs. (Still speculative; r2 = yhasn't obtained a value yet so ther2 == 42` compare/branch is still waiting to confirm that speculation).
a context switch to Thread 1 happens without rolling back the CPU to retirement state or otherwise waiting for speculation to be confirmed as good or detected as mis-speculation.
This part won't happen on real C++ implementations unless they use an M:N thread model, not the more common 1:1 C++ thread to OS thread. Real CPUs don't rename the privilege level: they don't take interrupts or otherwise enter the kernel with speculative instructions in flight that might need to rollback and redo entering kernel mode from a different architectural state.
T1: r1 = x; takes its value from the speculative x = 42 store
T1: r1 == 42 is found to be true. (Branch speculation happens here, too, not actually waiting for store-forwarding to complete. But along this path of execution, where the x = 42 did happen, this branch condition will execute and confirm the prediction).
T1: y = 42 runs.
this was all on the same CPU core so this y=42 store is after the r2=y load in program-order; it can't give that load a 42 to let the r2==42 speculation be confirmed. So this possible ordering doesn't demonstrate this in action after all. This is why threads have to be running on separate cores with inter-thread speculation for effects like this to be possible.

Note that x = 42 doesn't have a data dependency on r2 so value-prediction isn't required to make this happen. And the y=r1 is inside an if(r1 == 42) anyway so the compiler can optimize to y=42 if it wants, breaking the data dependency in the other thread and making things symmetric.

Note that the arguments about Green Threads or other context switch on a single core isn't actually relevant: we need separate cores for the memory reordering.

I commented earlier that I thought this might involve value-prediction. The ISO C++ standard's memory model is certainly weak enough to allow the kinds of crazy "reordering" that value-prediction can create to use, but it's not necessary for this reordering. y=r1 can be optimized to y=42, and the original code includes x=42 anyway so there's no data dependency of that store on the r2=y load. Speculative stores of 42 are easily possible without value prediction. (The problem is getting the other thread to see them!)

Speculating because of branch prediction instead of value prediction has the same effect here. And in both cases the loads need to eventually see 42 to confirm the speculation as correct.

Value-prediction doesn't even help make this reordering more plausible. We still need inter-thread speculation and memory reordering for the two speculative stores to confirm each other and bootstrap themselves into existence.

ISO C++ chooses to allow this for relaxed atomics, but AFAICT is disallows this non-atomic variables. I'm not sure I see exactly what in the standard does allow the relaxed-atomic case in ISO C++ beyond the note saying it's not explicitly disallowed. If there was any other code that did anything with x or y then maybe, but I think my argument does apply to the relaxed atomic case as well. No path through the source in the C++ abstract machine can produce it.

As I said, it's not possible in practice AFAIK on any real hardware (in asm), or in C++ on any real C++ implementation. It's more of an interesting thought-experiment into crazy consequences of very weak ordering rules, like C++'s relaxed-atomic. (Those ordering rules don't disallow it, but I think the as-if rule and the rest of the standard does, unless there's some provision that allows relaxed atomics to read a value that was never actually written by any thread.)

If there is such a rule, it would only be for relaxed atomics, not for non-atomic variables. Data-race UB is pretty much all the standard needs to say about non-atomic vars and memory ordering, but we don't have that.

Data race guarded by if (false)... what does the standard say?

The key term is "expression evaluation". Take the very simple example:

int a = 0;
for (int i = 0; i != 10; ++i) 
   ++a;

There's one expression ++a, but 10 evaluations. These are all ordered: the 5th evaluation happens-before the 6th evaluation. And the evaluations of ++a are interleaved with the evaluations of i!=10.

So, in

int a = 0;
for (int i = 0; i != 0; ++i) 
   ++a;

there are 0 evaluations. And by a trivial rewrite, that gets us

int a = 0;
if (false)
   ++a;

Now, if there are 10 evaluations of ++a, we need to worry for all 10 evaluations if they race with another thread (in more complex cases, the answer might vary - say if you start a thread when a==5). But if there are no evaluations at all of ++a, then there's clearly no racing evaluation.

Preventing of Out of Thin Air values with a memory barrier in C++

Related: my answer on What formally guarantees that non-atomic variables can't see out-of-thin-air values and create a data race like atomic relaxed theoretically can? explains in more details that the formal rules of the C++ relaxed atomic memory model don't exclude "out of thin air" values. But they do exclude them in a note. This is a problem only for formal verification of programs using mo_relaxed, not for real implementations. Even non-atomic variables are safe from this, if you avoid undefined behaviour (which you didn't in the code in this question).

You have data race Undefined Behaviour on x and y because they're non-atomic variables, so the C++11 standard has absolutely nothing to say about what's allowed to happen.

It would be relevant to look at this for older language standards without a formal memory model where people did threading anyway using volatile or plain int and compiler + asm barriers, where behaviour could depend on compilers working the way you expect in a case like this. But fortunately the bad old days of "happens to work on current implementations" threading are behind us.

Barriers are not helpful here with nothing to create synchronization; as @davmac explains, nothing requires the barriers to "line up" in the global order of operations. Think of a barrier as an operation that makes the current thread wait for some or all of its previous operations to become globally visible; barriers don't directly interact with other threads.

Out-of-thin-air values is one thing that can happen as a result of that undefined behaviour; the compiler is allowed to do software value-prediction on non-atomic variables, and invent writes to objects that will definitely be written anyway. If there was a release-store, or a relaxed store + a barrier, the compiler might not be allowed to invent writes before it, because that could create

In general from a C++11 language-lawyer perspective, there's nothing you can do to make your program safe (other than a mutex or hand-rolled locking with atomics to prevent one thread from reading x while the other is writing it.)

Relaxed atomics are sufficient to prevent the compiler from inventing writes without any other cost.

Except maybe defeating auto-vectorization and stuff, if you were counting on other uses of this variable being aggressively optimized.

atomic_int x = 0, y = 0
r1 = x.load(mo_relaxed)    | r2 = y.load(mo_relaxed)
 y.store(r1, mo_relaxed)   | x.store(r2, mo_relaxed)

Value-prediction could speculatively get a future value for r2 into the pipeline before thread 2 sees that value from y, but it can't actually become visible to other threads until the software or hardware knows for sure that the prediction was correct. (That would be inventing a write).

e.g. thread 2 is allowed to compile as

r2 = y.load(mo_relaxed);
if (r2 == 42) {                   // control dependency, not a data dependency
    x.store(42, mo_relaxed);
} else {
    x.store(r2, mo_relaxed);
}

But as I said, x = 42; can't become visible to other threads until it's non-speculative (hardware or software speculation), so value prediction can't invent values that other threads can see. The C++11 standard guarantees that atomics

I don't know / can't think of any mechanism by which a store of 42 could actually be visible to other threads before the y.load saw an actual 42. (i.e. LoadStore reordering of a load with a later dependent store). I don't think the C++ standard formally guarantees that, though. Maybe really aggressive inter-thread optimization if the compiler can prove that r2 will always be 42 in some cases, and remove even the control dependency?

An acquire-load or release-store would definitely be sufficient to block causality violations. This isn't quite mo_consume, because r2 is used as a value, not a pointer.

Are atomic objects protected against race conditions?

Yes, you are correct that non-atomic operations may still have race condition. If you have non-atomic operations that depend on the state of the atomic object without interference from other threads, you need to use another synchronization technique to maintain consistency.

Atomic operations on the atomic object will be consistent, but not race-free. Non-atomic operations using the atomic object are not race-free.

Concurrent reads on non-atomic variable

Concurrent reads on any variable, whether atomic or not, do not constitute a data race, because of the definition of conflicting evaluations, found in [intro.multithread]:

Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location.

Recently, this has moved to [intro.races] with a very subtle change in wording

Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.

The change from accesses to reads took place between draft n4296 and n4431. The splitting of the multithreading section took place between n4582 and n4604.

What do each memory_order mean?

The GCC Wiki gives a very thorough and easy to understand explanation with code examples.

(excerpt edited, and emphasis added)

IMPORTANT:

Upon re-reading the below quote copied from the GCC Wiki in the process of adding my own wording to the answer, I noticed that the quote is actually wrong. They got acquire and consume exactly the wrong way around. A release-consume operation only provides an ordering guarantee on dependent data whereas a release-acquire operation provides that guarantee regardless of data being dependent on the atomic value or not.

The first model is "sequentially consistent". This is the default mode used when none is specified, and it is the most restrictive. It can also be explicitly specified via memory_order_seq_cst. It provides the same restrictions and limitation to moving loads around that sequential programmers are inherently familiar with, except it applies across threads.

[...]

From a practical point of view, this amounts to all atomic operations acting as optimization barriers. It's OK to re-order things between atomic operations, but not across the operation. Thread local stuff is also unaffected since there is no visibility to other threads. [...] This mode also provides consistency across all threads.

The opposite approach is memory_order_relaxed. This model allows for much less synchronization by removing the happens-before restrictions. These types of atomic operations can also have various optimizations performed on them, such as dead store removal and commoning. [...] Without any happens-before edges, no thread can count on a specific ordering from another thread.

The relaxed mode is most commonly used when the programmer simply wants a variable to be atomic in nature rather than using it to synchronize threads for other shared memory data.

The third mode (memory_order_acquire / memory_order_release) is a hybrid between the other two. The acquire/release mode is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables. This allows for a relaxing of the synchronization required between independent reads of independent writes.

memory_order_consume is a further subtle refinement in the release/acquire memory model that relaxes the requirements slightly by removing the happens before ordering on non-dependent shared variables as well.

[...]

The real difference boils down to how much state the hardware has to flush in order to synchronize. Since a consume operation may therefore execute faster, someone who knows what they are doing can use it for performance critical applications.

Here follows my own attempt at a more mundane explanation:

A different approach to look at it is to look at the problem from the point of view of reordering reads and writes, both atomic and ordinary:

All atomic operations are guaranteed to be atomic within themselves (the combination of two atomic operations is not atomic as a whole!) and to be visible in the total order in which they appear on the timeline of the execution stream. That means no atomic operation can, under any circumstances, be reordered, but other memory operations might very well be. Compilers (and CPUs) routinely do such reordering as an optimization.

It also means the compiler must use whatever instructions are necessary to guarantee that an atomic operation executing at any time will see the results of each and every other atomic operation, possibly on another processor core (but not necessarily other operations), that were executed before.

Now, a relaxed is just that, the bare minimum. It does nothing in addition and provides no other guarantees. It is the cheapest possible operation. For non-read-modify-write operations on strongly ordered processor architectures (e.g. x86/amd64) this boils down to a plain normal, ordinary move.

The sequentially consistent operation is the exact opposite, it enforces strict ordering not only for atomic operations, but also for other memory operations that happen before or after. Neither one can cross the barrier imposed by the atomic operation. Practically, this means lost optimization opportunities, and possibly fence instructions may have to be inserted. This is the most expensive model.

A release operation prevents ordinary loads and stores from being reordered after the atomic operation, whereas an acquire operation prevents ordinary loads and stores from being reordered before the atomic operation. Everything else can still be moved around.

The combination of preventing stores being moved after, and loads being moved before the respective atomic operation makes sure that whatever the acquiring thread gets to see is consistent, with only a small amount of optimization opportunity lost.

One may think of that as something like a non-existent lock that is being released (by the writer) and acquired (by the reader). Except... there is no lock.

In practice, release/acquire usually means the compiler needs not use any particularly expensive special instructions, but it cannot freely reorder loads and stores to its liking, which may miss out some (small) optimization opportuntities.

Finally, consume is the same operation as acquire, only with the exception that the ordering guarantees only apply to dependent data. Dependent data would e.g. be data that is pointed-to by an atomically modified pointer.

Arguably, that may provide for a couple of optimization opportunities that are not present with acquire operations (since fewer data is subject to restrictions), however this happens at the expense of more complex and more error-prone code, and the non-trivial task of getting dependency chains correct.

It is currently discouraged to use consume ordering while the specification is being revised.

Concurrent threads and data race

While it is likely that you misunderstand some thing, your analysis is fine. The paper referenced overplays its thesis.

The UNIX and Linux kernels (amongst many others) themselves are large multi-threaded programs which operate with only (the equivalent of) library-based thread support. These large multi-threaded programs have exhibited a shocking performance, reliability and scalability from tiny pdp-based computers to massive supercomputers.

A Java-based OS was produced by Sun Labs to the open ridicule of all who had the opportunity to give it a whirl. It remains in an unmarked grave.

The secondary line of reasoning, that busy-waiting is more effective than locking primitives had been kicking around for at least a decade before that paper. Everybody loves lockless because it makes great benchmarks, whereas unbounded non-deterministic race condition scares people who want nice safe systems. The thing is, sometimes racy compare-and-swap (CAS) is good, sometimes it is bad. Some clever systems use an optimistic CAS to implement mutexes, leaving to opportunity for somewhat readable code, and good benchmarks.

Again, the bold statement of impossibility is hyperbolic, based on the idea that the compiler is capricious, so will make menacing assumptions and over-write memory at will. Thankfully, the “free and good enough” technologies slew these dragons.

In C11/C++11, possible to mix atomic/non-atomic ops on the same memory?

I think you're overlooking another case, the reverse order. Consider an initialized int whose storage is reused to create an std::atomic_int. All atomic operations happen after its ctor finishes, and therefore on initialized memory. But any concurrent, non-atomic access to the now-overwritten int has to be barred as well.

(I'm assuming here that the storage lifetime is sufficient and plays no role)

I'm not entirely sure because I think that the second access to int would be invalid anyway as the type of the accessing expression int doesn't match the object's type at the time (std::atomic<int>). However, "the object's type at the time" assumes a single linear time progression which doesn't hold in a multi-threaded environment. C++11 in general has that solved by making such assumptions about "the global state" Undefined Behavior per se, and the rule from the question appears to fit in that framework.

So perhaps rephrasing: if a single memory location contains an atomic object as well as a non-atomic object, and if the destruction of the earliest created (older) object is not sequenced-before the creation of the other (newer) object, then access to the older object conflicts with access to the newer object unless the former is scheduled-before the latter.

What Formally Guarantees That Non-Atomic Variables Can't See Out-Of-Thin-Air Values and Create a Data Race Like Atomic Relaxed Theoretically Can