Memory Model Ordering and Visibility

Memory model ordering and visibility?

If you like to deal with fences, then a.load(memory_order_acquire) is equivalent to a.load(memory_order_relaxed) followed by atomic_thread_fence(memory_order_acquire). Similarly, a.store(x,memory_order_release) is equivalent to a call to atomic_thread_fence(memory_order_release) before a call to a.store(x,memory_order_relaxed). memory_order_consume is a special case of memory_order_acquire, for dependent data only. memory_order_seq_cst is special, and forms a total order across all memory_order_seq_cst operations. Mixed with the others it is the same as an acquire for a load, and a release for a store. memory_order_acq_rel is for read-modify-write operations, and is equivalent to an acquire on the read part and a release on the write part of the RMW.

The use of ordering constraints on atomic operations may or may not result in actual fence instructions, depending on the hardware architecture. In some cases the compiler will generate better code if you put the ordering constraint on the atomic operation rather than using a separate fence.

On x86, loads are always acquire, and stores are always release. memory_order_seq_cst requires stronger ordering with either an MFENCE instruction or a LOCK prefixed instruction (there is an implementation choice here as to whether to make the store have the stronger ordering or the load). Consequently, standalone acquire and release fences are no-ops, but atomic_thread_fence(memory_order_seq_cst) is not (again requiring an MFENCE or LOCKed instruction).

An important effect of the ordering constraints is that they order other operations.

std::atomic<bool> ready(false);
int i=0;

void thread_1()
{
i=42;
ready.store(true,memory_order_release);
}

void thread_2()
{
while(!ready.load(memory_order_acquire)) std::this_thread::yield();
assert(i==42);
}

thread_2 spins until it reads true from ready. Since the store to ready in thread_1 is a release, and the load is an acquire then the store synchronizes-with the load, and the store to i happens-before the load from i in the assert, and the assert will not fire.

2) The second line in

atomicVar.store(42);
std::atomic_thread_fence(std::memory_order_seq_cst);

is indeed potentially redundant, because the store to atomicVar uses memory_order_seq_cst by default. However, if there are other non-memory_order_seq_cst atomic operations on this thread then the fence may have consequences. For example, it would act as a release fence for a subsequent a.store(x,memory_order_relaxed).

3) Fences and atomic operations do not work like mutexes. You can use them to build mutexes, but they do not work like them. You do not have to ever use atomic_thread_fence(memory_order_seq_cst). There is no requirement that any atomic operations are memory_order_seq_cst, and ordering on non-atomic variables can be achieved without, as in the example above.

4) No these are not equivalent. Your snippet without the mutex lock is thus a data race and undefined behaviour.

5) No your assert cannot fire. With the default memory ordering of memory_order_seq_cst, the store and load from the atomic pointer p work like the store and load in my example above, and the stores to the array elements are guaranteed to happen-before the reads.

memory_order_relaxed and visibility

Is it guaranteed that T2 sees the value 7 after the load?

Memory order is irrelevant here; atomic operations are atomic. So long as you have ensured that the write "happens-before" the read (which you stated to be true in the premise of your question), and there are no other intervening operations, T2 will read the value which was written by T1. This is the nature of atomic operations, and memory orders do not modify this.

What memory orders control is if T2 sees 7 (whether "happens-before" is ensured or not), whether or not it can access other data modified by T1 before it stored 7 into the atomic. And with relaxed memory ordering, T2 has no such guarantees.


Note: you changed your question from being about a situation where the load "happens after" the store, when the store is explicitly "synchronized" with the load, into a situation that is more nebulous. There is no "absolute time" as far as the C++ object model is concerned. All atomic operations on a particular atomic object happen in an order, but unless there is something which explicitly creates a "happens before/after" relationship between the two loads, then what value gets loaded cannot be known. It will be one of the two possibilities, but which one cannot be known.

What's the relationship/difference between Visibility and Ordering?

Yes, ordering and visibility are related issues.

  • Visibility is about whether / when one thread sees the results of memory writes performed by another thread

  • Ordering is about whether the order in which updates are seen by the second thread matches the (program) order in which the first thread wrote them.

The Java memory model doesn't directly address the ordering of writes. Any constraints on order are (as you hypothesize) a consequence of the visibility rules: these are specified in the JLS. This includes the case where you use volatile variables to effectively inhibit reordering.

It is also worth noting that the reordering of writes (from the perspective of a second thread) can happen whenever the JLS memory model does not require visibility. Correct synchronization will guarantee visibility at the point of synchronization.

Synchronising with mutex and relaxed memory order atomic

Your example code will work, and yes, you will see the updates. The relaxed ordering will give you the correct behavior. That said, it may not actually be optimal in terms of performance.

Let's look at a more concrete example, with the mutexes made explicit.

std::mutex m;
std::atomic<bool> updated;
foo shared;

void thr1() {
m.lock();
shared.write(new_data);
m.unlock();
updated.store(true, std::memory_order_relaxed);
}

void thr2() {
if (updated.load(std::memory_order_relaxed)) {
m.lock();
data = shared.read();
m.unlock();
}
}

Informal explanation

m.lock() is an acquire operation and m.unlock() is release. This means that nothing prevents updated.store(true) from "floating" up into the critical section, past m.unlock() and even shared.write(). At first glance this seems bad, because the whole point of the updated flag was to signal that shared.write() had finished. But no actual harm occurs in that case, because thr1 still holds the mutex m, and so if thr2 starts trying to read the shared data, it will just wait until thr1 drops it.

What would really be bad is if updated.store() were to float all the way up past m.lock(); then thr2 could potentially see updated.load() == true and take the mutex before thr1. However, this cannot happen because of the acquire semantics m.lock().

There could be some related issues in thr2 (a little more complicated because they would have to be speculative) but again we are saved by the same fact: the updated.load() can sink downward into the critical section, but not past it entirely (because m.unlock() is release).

But this is an instance where a stronger memory order on the updated operations, although seemingly more expensive, might actually improve performance. If the value true in updated becomes visible prematurely, then thr2 attempts to lock m while it is already locked by thr1, and so thr2 will have to block while it waits for m to become available. But if you changed to updated.store(true, std::memory_order_release) and updated.load(std::memory_order_acquire), then the value true in updated can only become visible after m is truly unlocked by thr1, and so the m.lock() in thr2 should always succeed immediately (ignoring contention by any other threads that might exist).



Proof

Okay, that was an informal explanation, but we know those are always risky when thinking about memory ordering. Let's give a proof from the formal rules of the C++ memory model. I will follow the C++20 standard because I have it handy, but I don't believe there are any significant relevant changes from C++17. See [intro.races] for definitions of the terms used here.

I claim that, if shared.read() executes at all, then shared.write(new_data) happens before it, and so by write-read coherence [intro.races p18] shared.read() will see the new data.

The lock and unlock operations on m are totally ordered [thread.mutex.requirements.mutex p5]. Consider two cases: either thr1's unlock precedes thr2's lock (Case I), or vice versa (Case II).

Case I

If thr1's unlock precedes thr2's lock in m's lock order, then there is no problem; they synchronize with each other [thread.mutex.requirements.mutex p11]. Since shared.write(new_data) is sequenced before thr1's m.unlock(), and thr2's m.lock() is sequenced before shared.read(), by chasing the definitions in [intro.races] we see that shared.write(new_data) indeed happens before shared.read().

Case II

Now suppose the contrary, that thr2's lock precedes thr1's unlock in m's lock order. Since locks and unlocks of the same mutex cannot interleave (that's the whole point of a mutex, to provide mutual exclusion), the lock total order on m must be as follows:

thr2: m.lock()
thr2: m.unlock()
thr1: m.lock()
thr1: m.unlock()

That means that thr2's m.unlock() synchronizes with thr1's m.lock(). Now updated.load() is sequenced before thr2 m.unlock(), and thr1 m.lock() is sequenced before updated.store(true), so it follows that updated.load() happens before updated.store(true). By read-write coherence [intro.races p17], updated.load() must not take its value from updated.store(true), but from some strictly earlier side effect in the modification order of updated; presumably its initialization to false.

We conclude that updated.load() must return false in this case. But if that were so, then thr2 would never have tried to lock the mutex in the first place. This is a contradiction, so Case II must never occur.

meaning of memory ordering obeys causality?

Obeying causality means that if thing A causes another thing B, and if you observe that B has happened, then you will also observe that A has happened. Consider the following code:

x = 0
y = 0

thread 1 thread 2

x = 10 r1 = y // #1
y = 25 r2 = x // #2

Assuming that all reads and writes are individually correct (i.e. atomic at the instruction level with no tearing, e.g. by using relaxed atomics in C++), consider what thread 2 thinks of the world at point #1.

  • if r1 is zero, then we know nothing. The execution of thread 1 could be anywhere; it might not have started yet, or it might already have completed but the changes aren't visible yet.

  • However, if r1 is 25, then by causality we know that r2 must be read as 10, because the only way that r1 could have read 25 is if thread 1 had executed both statements already, and the strong memory ordering of x86 guarantees that the effects of previous (causally preceding) stores are visible.

Note that this is not a general feature of all hardwares. In popular contemporary memory models (such as those of C++11, C11, Java and Go), we would say that the in above operations, the store y = 25 has "release ordering" and the load r1 = y has "acquire ordering", and the two operations "synchronize".



Related Topics



Leave a reply



Submit