For Purposes of Ordering, Is Atomic Read-Modify-Write One Operation or Two

Memory ordering or read-modify-write operation with (read/write)-only memory order

Using std::memory_order_acquire on an atomic RMW is an acquire-only operation.

Since it does not 'release' anything, it cannot be part of a synchronizes-with relationship with another atomic acquire operation (on the same atomic variable) that occurs later.

Therefore, the equivalent for the store part is std::memory_order_relaxed

Similar reasoning for using std::memory_order_release on an RMW. It does not acquire anything, so the ordering equivalent for the load part is 'relaxed'

Independent Read-Modify-Write Ordering

I haven't fully grokked your code yet, but the bolded question has a straightforward answer:

Can I always rely on RMW operations not being reordered - even if they affect different memory locations

No, you can't. Compile-time reordering of two relaxed RMWs in the same thread is very much allowed. (I think runtime reordering of two RMWs is probably impossible in practice on most CPUs. ISO C++ doesn't distinguish compile-time vs. run-time for this.)

But note that an atomic RMW includes both a load and a store, and both parts have to stay together. So any kind of RMW can't move earlier past an acquire operation, or later past a release operation.

Also, of course the RMW itself being a release and/or acquire operation can stop reordering in one or the other direction.


Of course, the C++ memory model isn't formally defined in terms of local reordering of access to cache-coherent shared memory, only in terms of synchronizing with another thread and creating a happens-before / after relationship. But if you ignore IRIW reordering (2 reader threads not agreeing on the order of two writer threads doing independent stores to different variables) it's pretty much 2 different ways to model the same thing.

If atomic_compare_exchange isn't atomic on its own thread, how can it implement a lock?

The C++ standard is a little vague on this point. But what they probably intended, and what happens in practice on many implementations, is that your non-atomic read can be reordered before the write. See For purposes of ordering, is atomic read-modify-write one operation or two?.

However, for the usual locking idiom, this is not a problem, and does not lead to a data race. In taking a lock, it's tempting to think of the essential operation as the store of the "locked" value, which tells other threads that the lock is ours and they can't have it. But actually, what's important is the load of the "unlocked" value. This proves that the lock belongs to our thread. The atomicity of the read-modify-write operation guarantees that no other thread will load that value until we reset it, and so we can safely enter the critical section, even if our store of the "locked" value doesn't become visible until some time later.

You could imagine this happening on an LL/SC architecture. Suppose the load-link is done and returns the "unlocked" value. The store-conditional hasn't been done yet, and it could fail, in case some other core accessed the lock variable and we lost our exclusive ownership of it. However, in the meantime, we can perform a speculative load of the non-atomic variable, with the understanding that it won't retire unless the SC succeeds. If the SC fails, the LL/SC pair will have to be redone, and in that case our non-atomic load will also be redone. So by the time we finally do get a success, the non-atomic load will be visible after the LL, but possibly before the SC.

To see how this follows from the C++ memory ordering rules, let's consider a more fleshed-out example:

std::atomic<int> lock{0};
int non_atomic;

void reader() {
int expected = 0;
if (lock.compare_exchange_strong(expected, 1,
std::memory_order_acquire, std::memory_order_relaxed)) {
int local = non_atomic;
// ...
lock.store(0, std::memory_order_release);
}
}

void writer() {
int expected = 0;
if (lock.compare_exchange_strong(expected, 2,
std::memory_order_acquire, std::memory_order_relaxed)) {
non_atomic = 42;
// ...
lock.store(0, std::memory_order_release);
}
}

The atomicity promise of a read-modify-write operation comes from C++20 [atomics.order p10]:

Atomic read-modify-write operations shall always read the last value (in the modification order) written
before the write associated with the read-modify-write operation.

The modification order of lock is a total order. The critical sections will only be entered if both compare-exchanges read the value 0. The 0 loaded by reader must be immediately followed in the modification order by a 1, and the 0 loaded by writer must be immediately followed by 2. So 0 has to occur twice. The modification order of lock must therefore either be 0 1 0 2 0 or 0 2 0 1 0.

If it's 0 1 0 2 0, then the second 0 must come from the release store in reader. It can't be from the store in writer, because that one must come after 2 in the modification order (because writers store of 0 is sequenced after its store of 2; this is write-write coherency, [intro.races p15]). So the acquire load in writer reads the second 0 in the order, which was put there by the release store in reader, and so they synchronize. This ensures that the access to non_atomic in reader happens before the access in writer, and there is no data race.

If it's 0 2 0 1 0, the logic is reversed, and the access in writer happens before the access in reader; again no data race.

Atomic operations, std::atomic and ordering of writes

I put your example on the Godbolt compiler explorer, and added some functions to read, increment, or combine (a+=b) two atomic variables. I also used a.store(1, memory_order_release); instead of a = 1; to avoid getting more ordering than needed, so it's just a simple store on x86.

See below for (hopefully correct) explanations. update: I had "release" semantics confused with just a StoreStore barrier. I think I fixed all the mistakes, but may have left some.


The easy question first:

Is the write to ‘a’ guaranteed to be an atomic?

Yes, any thread reading a will get either the old or the new value, not some half-written value. This happens for free on x86 and most other architectures with any aligned type that fits in a register. (e.g. not int64_t on 32bit.) Thus, on many systems, this happens to be true for b as well, the way most compilers would generate code.

There are some types of stores that may not be atomic on an x86, including unaligned stores that cross a cache line boundary. But std::atomic of course guarantees whatever alignment is necessary.

Read-modify-write operations are where this gets interesting. 1000 evaluations of a+=3 done in multiple threads at once will always produce a += 3000. You'd potentially get fewer if a wasn't atomic.

Fun fact: signed atomic types guarantee two's complement wraparound, unlike normal signed types. C and C++ still cling to the idea of leaving signed integer overflow undefined in other cases. Some CPUs don't have arithmetic right shift, so leaving right-shift of negative numbers undefined makes some sense, but otherwise it just feels like a ridiculous hoop to jump through now that all CPUs use 2's complement and 8bit bytes. </rant>


Is any other thread reading ‘a’ as 1 guaranteed to read ‘b’ as 2.

Yes, because of the guarantees provided by std::atomic.

Now we're getting into the memory model of the language, and the hardware it runs on.

C11 and C++11 have a very weak memory ordering model, which means the compiler is allowed to reorder memory operations unless you tell it not to. (source: Jeff Preshing's Weak vs. Strong Memory Models). Even if x86 is your target machine, you have to stop the compiler from re-ordering stores at compile time. (e.g. normally you'd want the compiler to hoist a = 1 out of a loop that also writes to b.)

Using C++11 atomic types gives you full sequential-consistency ordering of operations on them with respect to the rest of the program, by default. This means they're a lot more than just atomic. See below for relaxing the ordering to just what's needed, which avoids expensive fence operations.


Why does the MFENCE happen after the write to ‘a’ not before.

StoreStore fences are a no-op with x86's strong memory model, so the compiler just has to put the store to b before the store to a to implement the source code ordering.

Full sequential consistency also requires that the store be globally ordered / globally visible before any later loads in program order.

x86 can re-order stores after loads. In practice, what happens is that out-of-order execution sees an independent load in the instruction stream, and executes it ahead of a store that was still waiting on the data to be ready. Anyway, sequential-consistency forbids this, so gcc uses MFENCE, which is a full barrier, including StoreLoad (the only kind x86 doesn't have for free. (LFENCE/SFENCE are only useful for weakly-ordered operations like movnt.))

Another way to put this is the way the C++ docs use: sequential consistency guarantees that all threads see all changes in the same order. The MFENCE after every atomic store guarantees that this thread sees stores from other threads. Otherwise, our loads would see our stores before other thread's loads saw our stores. A StoreLoad barrier (MFENCE) delays our loads until after the stores that need to happen first.

The ARM32 asm for b=2; a=1; is:

# get pointers and constants into registers
str r1, [r3] # store b=2
dmb sy # Data Memory Barrier: full memory barrier to order the stores.
# I think just a StoreStore barrier here (dmb st) would be sufficient, but gcc doesn't do that. Maybe later versions have that optimization, or maybe I'm wrong.
str r2, [r3, #4] # store a=1 (a is 4 bytes after b)
dmb sy # full memory barrier to order this store wrt. all following loads and stores.

I don't know ARM asm, but what I've figured out so far is that normally it's op dest, src1 [,src2], but loads and stores always have the register operand first and the memory operand 2nd. This is really weird if you're used to x86, where a memory operand can be the source or dest for most non-vector instructions. Loading immediate constants also takes a lot of instructions, because the fixed instruction length only leaves room for 16b of payload for movw (move word) / movt (move top).


Release / Acquire

The release and acquire naming for one-way memory barriers comes from locks:

  • One thread modifies a shared data structure, then releases a lock. The unlock has to be globally visible after all the loads/stores to data it's protecting. (StoreStore + LoadStore)
  • Another thread acquires the lock (read, or RMW with a release-store), and must do all loads/stores to the shared data structure after the acquire becomes globally visible. (LoadLoad + LoadStore)

Note that std:atomic uses these names even for standalone fences which are slightly different from load-acquire or store-release operations. (See atomic_thread_fence, below).

Release/Acquire semantics are stronger than what producer-consumer requires. That just requires one-way StoreStore (producer) and one-way LoadLoad (consumer), without LoadStore ordering.

A shared hash table protected by a readers/writers lock (for example) requires an acquire-load / release-store atomic read-modify-write operation to acquire the lock. x86 lock xadd is a full barrier (including StoreLoad), but ARM64 has load-acquire/store-release version of load-linked/store-conditional for doing atomic read-modify-writes. As I understand it, this avoids the need for a StoreLoad barrier even for locking.


Using weaker but still sufficient ordering

Writes to std::atomic types are ordered with respect to every other memory access in source code (both loads and stores), by default. You can control what ordering is imposed with std::memory_order.

In your case, you only need your producer to make sure stores become globally visible in the correct order, i.e. a StoreStore barrier before the store to a. store(memory_order_release) includes this and more. std::atomic_thread_fence(memory_order_release) is just a 1-way StoreStore barrier for all stores. x86 does StoreStore for free, so all the compiler has to do is put the stores in source order.

Release instead of seq_cst will be a big performance win, esp. on architectures like x86 where release is cheap/free. This is even more true if the no-contention case is common.

Reading atomic variables also imposes full sequential consistency of the load with respect to all other loads and stores. On x86, this is free. LoadLoad and LoadStore barriers are no-ops and implicit in every memory op. You can make your code more efficient on weakly-ordered ISAs by using a.load(std::memory_order_acquire).

Note that the std::atomic standalone fence functions confusingly reuse the "acquire" and "release" names for StoreStore and LoadLoad fences that order all stores (or all loads) in at least the desired direction. In practice, they will usually emit HW instructions that are 2-way StoreStore or LoadLoad barriers. This doc is the proposal for what became the current standard. You can see how memory_order_release maps to a #LoadStore | #StoreStore on SPARC RMO, which I assume was included partly because it has all the barrier types separately. (hmm, the cppref web page only mentions ordering stores, not the LoadStore component. It's not the C++ standard, though, so maybe the full standard says more.)


memory_order_consume isn't strong enough for this use-case. This post talks about your case of using a flag to indicate that other data is ready, and talks about memory_order_consume.

consume would be enough if your flag was a pointer to b, or even a pointer to a struct or array. However, no compiler knows how to do the dependency tracking to make sure it puts thing in the proper order in the asm, so current implementations always treat consume as acquire. This is too bad, because every architecture except DEC alpha (and C++11's software model) provide this ordering for free. According to Linus Torvalds, only a few Alpha hardware implementations actually could have this kind of reordering, so the expensive barrier instructions needed all over the place were pure downside for most Alphas.

The producer still needs to use release semantics (a StoreStore barrier), to make sure the new payload is visible when the pointer is updated.

It's not a bad idea to write code using consume, if you're sure you understand the implications and don't depend on anything that consume doesn't guarantee. In the future, once compilers are smarter, your code will compile without barrier instructions even on ARM/PPC. The actual data movement still has to happen between caches on different CPUs, but on weak memory model machines, you can avoid waiting for any unrelated writes to be visible (e.g. scratch buffers in the producer).

Just keep in mind that you can't actually test memory_order_consume code experimentally, because current compilers are giving you stronger ordering than the code requests.

It's really hard to test any of this experimentally anyway, because it's timing-sensitive. Also, unless the compiler re-orders operations (because you failed to tell it not to), producer-consumer threads will never have a problem on x86. You'd need to test on an ARM or PowerPC or something to even try to look for ordering problems happening in practice.


references:

  • https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67458: I reported the gcc bug I found with b=2; a.store(1, MO_release); b=3; producing a=1;b=3 on x86, rather than b=3; a=1;

  • https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461: I also reported the fact that ARM gcc uses two dmb sy in a row for a=1; a=1;, and x86 gcc could maybe do with fewer mfence operations. I'm not sure if an mfence between each store is needed to protect a signal handler from making wrong assumptions, or if it's just a missing optimization.

  • The Purpose of memory_order_consume in C++11 (already linked above) covers exactly this case of using a flag to pass a non-atomic payload between threads.

  • What StoreLoad barriers (x86 mfence) are for: a working sample program that demonstrates the need: http://preshing.com/20120515/memory-reordering-caught-in-the-act/

  • Data-dependency barriers (only Alpha needs explicit barriers of this type, but C++ potentially needs them to prevent the compiler doing speculative loads): http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#360
  • Control-dependency barriers: http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#592

  • Doug Lea says x86 only needs LFENCE for data that was written with "streaming" writes like movntdqa or movnti. (NT = non-temporal). Besides bypassing the cache, x86 NT loads/stores have weakly-ordered semantics.

  • http://preshing.com/20120913/acquire-and-release-semantics/

  • http://preshing.com/20120612/an-introduction-to-lock-free-programming/ (pointers to books and other stuff he recommends).

  • Interesting thread on realworldtech about whether barriers everywhere or strong memory models are better, including the point that data-dependency is nearly free in HW, so it's dumb to skip it and put a large burden on software. (The thing Alpha (and C++) doesn't have, but everything else does). Go back a few posts from that to see Linus Torvalds' amusing insults, before he got around to explaining more detailed / technical reasons for his arguments.

Why it's termed read-modify-write but not read-write?

Why it's termed read-modify-write but not read-write?

Because that is exactly the sequence of events on a typical architecture such as X86.

  1. read: The value is read from a memory location (cache) into a CPU register
  2. modify: The value is incremented inside the CPU register
  3. write: The updated register value is written back to memory (cache).

In order to create the perception of atomicity, the cache line is locked between the read and the write operation.

For example, incrementing a C++ atomic variable:

g.fetch_add(1);

Is compiled by gcc into:

0x00000000004006c0 <+0>:     lock addl $0x1,0x200978(%rip)        # 0x601040 <g>

Despite being a single instruction, addl is not atomic by itself.
The lock prefix is necessary to guarantee that the updated register value is written back to the cache line before it can be accessed by other cores
(the store buffer is flushed, but bypassed for RMW operations).

The MESI cache coherence protocol ensures that all cores observe the updated memory value after the lock on the cache line has been released.
This guarantees that all threads observe the latest value in the modification order which is required for RMW operations by the C++ standard.

Can't relaxed atomic fetch_add reorder with later loads on x86, like store can?

The standard allows 00, but you'll never get it on x86 (without compile-time reordering). The only way to implement atomic RMW on x86 involves a lock prefix, which is a "full barrier", which is strong enough for seq_cst.

In C++ terms, atomic RMWs are effectively promoted to seq_cst when compiling for x86. (Only after possible compile-time ordering is nailed down - e.g. non-atomic loads / stores can reorder / combine across a relaxed fetch_add, and so can other relaxed operations, and one-way reordering with acquire or release operations. Although compilers are less likely to reorder atomic ops with each other since they don't combine them, and doing so is one of the main reasons for compile-time reordering.)

In fact, most compilers implement a.store(1, mo_seq_cst) by using xchg (which has an implicit lock prefix), because it's faster than mov + mfence on modern CPUs, and turning 0 into 1 with lock add as the only write to each object is exactly identical. Fun fact: with just store and load, your code will compile to the same asm as https://preshing.com/20120515/memory-reordering-caught-in-the-act/, so the discussion there applies.


ISO C++ allows the whole relaxed RMW to reorder with the relaxed load, but normal compilers won't do that at compile-time for no reason. (A DeathStation 9000 C++ implementation could/would). So you've finally found a case where it would be useful to test on a different ISA. The ways in which an atomic RMW (or even parts of it) can reorder at run-time depend a lot on the ISA.


An LL/SC machine that needs a retry loop to implement fetch_add (for example ARM, or AArch64 before ARMv8.1) may be able to truly implement a relaxed RMW that can reorder at run-time because anything stronger than relaxed would require barriers. (Or acquire / release versions of the instructions like AArch64 ldaxr / stlxr vs. ldxr/stxr). So if there's an asm difference between relaxed and acq and/or rel (and sometimes seq_cst is also different), it's likely that difference is necessary and preventing some run-time reordering.

Even a single-instruction atomic operation might be able to be truly relaxed on AArch64; I haven't investigated. Most weakly-ordered ISAs have traditionally used LL/SC atomics, so I might just be conflating those.

In an LL/SC machine, the store side of an LL/SC RMW can even reorder with later loads separately from the load, unless they're both seq_cst. For purposes of ordering, is atomic read-modify-write one operation or two?


To actually see 00, both loads would have to happen before the store part of the RMW was visible in the other thread. And yes, the HW reordering mechanism in an LL/SC machine would I think be pretty similar to reordering a plain store.

What is the use case for the atomic exchange (read-write) operation?

Your typical spinlock:

std::atomic<bool> lock;  // initialize to false

{ // some critical section, trying to get the lock:

while (lock.exchange(true)) { } // now we have the lock

/* do stuff */

lock = false; // release lock
}

See Herb Sutter's wait-free queue for a real-world application.



Related Topics



Leave a reply



Submit