Will Two Atomic Writes to Different Locations in Different Threads Always Be Seen in the Same Order by Other Threads

Will two atomic writes to different locations in different threads always be seen in the same order by other threads?

The updated¹ code in the question (with loads of x and y swapped in Thread 4) does actually test that all threads agree on a global store order.

Under the C++11 memory model, the outcome r1==1, r2==0, r3==2, r4==0 is allowed and in fact observable on POWER.

On x86 this outcome is not possible, because there "stores are seen in a consistent order by other processors". This outcome is also not allowed in a sequential consistent execution.

Footnote 1: The question originally had both readers read x then y. A sequentially consistent execution of that is:

-- Initially --
std::atomic<int> x{0};
std::atomic<int> y{0};

-- Thread 4 --
int r3 = x.load(std::memory_order_acquire);

-- Thread 1 --
x.store(1, std::memory_order_release);

-- Thread 3 --
int r1 = x.load(std::memory_order_acquire);
int r2 = y.load(std::memory_order_acquire);

-- Thread 2 --
y.store(2, std::memory_order_release);

-- Thread 4 --
int r4 = y.load(std::memory_order_acquire);

This results in r1==1, r2==0, r3==0, r4==2. Hence, this is not a weird outcome at all.

To be able to say that each reader saw a different store order, we need them to read in opposite orders to rule out the last store simply being delayed.

Will two relaxed writes to the same location in different threads always be seen in the same order by other threads?

No, such an outcome is not allowed. §1.10 [intro.multithread]/p8, 18 (quoting N3936/C++14; the same text is found in paragraphs 6 and 16 for N3337/C++11):

8 All modifications to a particular atomic object M occur in some
particular total order, called the modification order of M.

18 If a value computation A of an atomic object M happens before a
value computation B of M, and A takes its value from a side effect X
on M, then the value computed by B shall either be the value stored by
X or the value stored by a side effect Y on M, where Y follows X in
the modification order of M. [ Note: This requirement is known as
read-read coherence. —end note ]

In your code there are two side effects, and by p8 they occur in some particular total order. In Thread 3, the value computation to calculate the value to be stored in r1 happens before that of r2, so given r1 == 1 and r2 == 2 we know that the store performed by Thread 1 precedes the store performed by Thread 2 in the modification order of x. That being the case, Thread 4 cannot observe r3 == 2, r4 == 1 without running afoul of p18. This is regardless of the memory_order used.

There is a note in p21 (p19 in N3337) that is relevant:

[ Note: The four preceding coherence requirements effectively
disallow compiler reordering of atomic operations to a single object,
even if both operations are relaxed loads. This effectively makes the
cache coherence guarantee provided by most hardware available to C++
atomic operations. —end note ]

MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?

I'll answer for what happens on real implementations on real CPUs, because an answer based only on the standard can barely say anything useful about time or "immediacy".

MESI is just an implementation detail that ISO C++ doesn't have anything to say about. The guarantees provided by ISO C++ only involve order, not actual time. ISO C++ is intentionally non-specific to avoid assuming that it will execute on a "normal" CPU. An implementation on a non-coherent machine that required explicit flushes for store visibility might be theoretically possible (although probably horrible for performance of release / acquire and seq-cst operations)

C++ is non-specific enough about timing to even allow an implementation on a single-core cooperative multi-tasking system (no pre-emption), with the compiler inserting voluntary yields occasionally. (Infinite loops without any volatile accesses or I/O are UB). C++ on a system where only one thread can actually be executing at once is totally fine and possible, assuming you consider a scheduler timeslice to still be a "reasonable" amount of time. (Or less if you yield or otherwise block.)

Even the model of formalism ISO C++ uses to give the guarantees it does about ordering is very different from the way hardware ISAs define their memory models. C++ formal guarantees are purely in terms of happens-before and synchronizes-with, not "re"-ordering litmus tests or any kind of stuff like that. e.g. How to achieve a StoreLoad barrier in C++11? is impossible to answer for pure ISO C++ formalism. The "option C" in that Q&A serves to show just how weak the C++ guarantees are; that case with store then load of two different SC variables is not sufficient to imply happens-before based on it, according to the C++ formalism, even though there has to be a total order of all SC operations. But it is sufficient in real life on systems with coherent cache and only local (within each CPU core) memory reordering, even AArch64 where the SC load right after the SC store does still essentially give us a StoreLoad barrier.

when a thread A stores a value into an std::atomic

It depends what you mean by "doing" a store.

If you mean committing from the store buffer into L1d cache, then yes, that's the moment when a store becomes globally visible, on a normal machine that uses MESI to give all CPU cores a coherent view of memory.

Although note that on some ISAs, some other threads are allowed to see stores before they become globally visible via cache. (i.e. the hardware memory model may not be "multi-copy atomic", and allow IRIW reordering. POWER is the only example I know of that does this in real life. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for details on the HW mechanism: Store forwarding for retired aka graduated stores between SMT threads.)

If you mean executing locally so later loads in this thread can see it, then no. std::atomic can use a memory_order weaker than seq_cst.

All mainstream ISAs have memory-ordering rules weak enough to allow for a store buffer to decouple instruction execution from commit to cache. This also allows speculative out-of-order execution by giving stores somewhere private to live after execution, before we're sure that they were on the correct path of execution. (Stores can't commit to L1d until after the store instruction retires from the out-of-order part of the back end, and thus is known to be non-speculative.)

If you want to wait for your store to be visible to other threads before doing any later loads, use atomic_thread_fence(memory_order_seq_cst);. (Which on "normal" ISAs with standard choice of C++ -> asm mappings will compile to a full barrier).

On most ISAs, a seq_cst store (the default) will also stall all later loads (and stores) in this thread until the store is globally visible. But on AArch64, STLR is a sequential-release store and execution of later loads/stores doesn't have to stall unless / until a LDAR (acquire load) is about to execute while the STLR is still in the store buffer. This implements SC semantics as weakly as possible, assuming AArch64 hardware actually works that way instead of just treating it as a store + full barrier.

But note that only blocking later loads/stores is necessary; out-of-order exec of ALU instructions on registers can still continue. But if you were expecting some kind of timing effect due to dependency chains of FP operations, for example, that's not something you can depend on in C++.

Even if you do use seq_cst so nothing happens in this thread before the store is visible to others, that's still not instant. Inter-core latency on real hardware can be on the order of maybe 40ns on mainstream modern Intel x86, for example. (This thread doesn't have to stall that long on a memory barrier instruction; some of that time is the cache miss on the other thread trying to read the line that was invalidated by this core's RFO to get exclusive ownership.) Or of course much cheaper for logical cores that share the L1d cache of a physical core: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?

What does it mean that two store are seen in a consistent order by other processors?

It means no IRIW reordering (Independent Readers, Independent Writers; at least 4 separate cores, at least 2 each writers and readers). 2 readers will always agree on the order of any 2 stores performed other cores.

Weaker memory models don't guarantee this, for example ISO C++11 only guarantees it for seq_cst operations, not for acq_rel or any weaker orders.

A few hardware memory models allow it on paper, including ARM before ARMv8. But in practice it's very rare POWER hardware can actually violate this in practice: See my answer Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for an explanation of a hardware mechanism that can make it happen (store-forwarding between SMT "hyperthreads" on the same physical core making a store visible to some cores before it's globally visible).

x86 forbids this so communication between hyperthreads has to wait for commit to L1d cache, i.e. waiting for the store to be globally visible (thanks to MESI) before any other core can see it. What will be used for data exchange between threads are executing on one Core with HT?

c++ multithread atomic load/store

The reason is that reading via x.load(std::memory_order_relaxed) guarantees only that you never see x decrease within the same thread (in this example code). (It also guarantees that a thread writing to x will read that same value again in the next iteration.)

In general, different threads can read different values from the same variable at the same time. That is, there need not be a consistent "global state" that all threads agree on. The example output is supposed to demonstrate that: The first thread might still see y = 0 when it already wrote x = 4, while the second thread might still see x = 0 when it already writes y = 2. The standard allows this because real hardware may work that way: Consider the case when the threads are on different CPU cores, each with its own private L1 cache.

However, it is not possible that the second thread sees x = 5 and then later sees x = 2 - the atomic object always guarantees that there is a consistent global modification order (that is, all writes to the variable are observed to happen in the same order by all the threads).

But when using std::memory_order_relaxed there are no guarantees about when a thread finally does "see" those writes*, or how the observations of different threads relate to each other. You need stronger memory ordering to get those guarantees.

^{*In fact, a valid output would be all threads reading only 0 all the time, except the writer threads reading what they wrote the previous iteration to their "own" variable (and 0 for the others). On hardware that never flushed caches unless prompted, this might actually happen, and it would be fully compliant with the C++ standard!}

And I test the code in my PC,I can't reproduce the result like that.

The "example output" shown is highly artificial. The C++ standard allows for this output to happen. This means you can write efficient and correct multithreaded code even on hardware with no inbuilt guarantees on cache coherency (see above). But common hardware today (x86 in particular) brings a lot of guarantees that actually make certain behavior impossible to observe (including the output in the question).

Also, note that x, y and z are extremely likely to be adjacent (depends on the compiler), meaning they will likely all land on the same cache line. This will lead to massive performance degradation (look up "false sharing"). But since memory can only be transferred between cores at cache line granularity, this (together with the x86 coherency guarantees) makes it essentially impossible that an x86 CPU (which you most likely performed your tests with) reads outdated values of any of the variables. Allocating these values more than 1-2 cache lines apart will likely lead to more interesting/chaotic results.

Does boost atomic reference counting example contain a bug?

All modifications of a single atomic variable happen in a global modification order. It is not possible for two threads to disagree about this order.

The fetch_sub operation is an atomic read-modify-write operation and is required to always read the value of the atomic variable immediately before the modification from the same operation in the modification order.

So it is not possible for the second thread to read 2 when the first thread's fetch_sub was first in the modification order. The implementation must assure that such a cache incoherence cannot happen, if necessary with the help of locks if the hardware doesn't support this atomic access natively. (That is what the is_lock_free and is_always_lock_free members of the atomic are there to check for.)

This is all independent of the memory orders of the operations. These matter only for access to other memory locations than the atomic variable itself.

Will Two Atomic Writes to Different Locations in Different Threads Always Be Seen in the Same Order by Other Threads