C++ Memory Barriers for Atomics

Is a memory barrier required to read a value that is atomically modified?

No, you don't need barriers, but your code is broken anyway if readers and writers call these functions in different threads. Especially if a reader calls the read function in a loop.

TL:DR: use C++11 std::atomic<long> m_value with return m_value++ in the increment and return m_value in the reader. That will give you sequential consistency in a data-race-free program: execution will be work as if threads ran with some interleaving of source order. (Unless you violate the rules and have other non-atomic shared data.) You definitely want to return a value from Increment, if you want threads doing increments to ever know what value they produced. Doing a separate load after is totally broken for use cases like int sequence_num = shared_counter++; where another thread's increment could be visible between count++; tmp = count;.

If you don't need such strong ordering with respect to operations on other objects in the same thread as the reader/writer, return m_value.load(std::memory_order_acquire) is enough for most uses, and m_value.fetch_add(1, std::memory_order_acq_rel). Very few programs actually need StoreLoad barriers anywhere; atomic RMWs can't actually reorder very much even with acq_rel. (On x86, those will both compile the same as if you'd used seq_cst.)

You can't force ordering between threads; the load either sees the value or it doesn't, depending on whether the reading thread saw the invalidate from the writer before or after it took / tried to take a value for the load. The whole point of threads is that they don't run in lock-step with each other.



Data-race UB:

A loop reading m_value can hoist the load out of the loop since it's not atomic (or even volatile as a hack). This is data-race UB, compilers will break your reader. See this and Multithreading program stuck in optimized mode but runs normally in -O0

Barriers aren't the problem/solution here, just forcing re-checking of memory (or the cache-coherent view of memory that the current CPU sees; actual CPU caches like L1d and L2 are not a problem for this). That's not what barriers really do; they order this thread's access to coherent cache. C++ threads only run across cores with coherent caches.

But seriously don't roll your own atomics without a very compelling reason. When to use volatile with multi threading? pretty much never. That answer explains cache coherency and that you don't need barriers to avoid seeing stale values.

In many real-world C++ implementations, something like std::atomic_thread_fence() will also be a "compiler barrier" that forces the compiler to reload even non-atomic vars from memory, even without volatile, but that's an implementation detail. So it may happen to work well enough, on some compilers for some ISAs. And still isn't fully safe against the compiler inventing multiple loads; See the LWN article Who's afraid of a big bad optimizing compiler? for examples with details; primarily aimed at how the Linux kernel rolls its own atomics with volatile, which is de-facto supported by GCC/clang.



"latest value"

Beginners often panic over this, and think that RMW operations are somehow better because of the way they're specified. Since they're a read + write tied together, and there is a modification order for every memory location separately, RMW operations necessarily have to wait for write access to a cache line, and that means serializing all writes and RMWs on a single location.

Plain loads of atomic variables are still guaranteed (by real implementations) to see values promptly. (ISO C++ only suggests that values should be seen in finite time, and promptly, but of course real implementations can do much better because they run on cache-coherent CPU hardware.)

There's no such thing as "immediate" between two threads; either a load in another thread sees a value stored, or it ran before the store became visible to other threads and didn't. With thread scheduling and so on, it's always possible that a thread will load a value but then not use it for a long time; it was fresh when it was loaded.

So this is pretty much irrelevant for correctness, and all that's left is worrying about inter-thread latency. That could in some cases be helped by barriers (to reduce contention from later memory operations, not from actively flushing your stores faster, barriers just wait for that to happen the normal way). So that's usually a very minor effect, and not a reason to use extra barriers.

See MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?. And see my comments on https://github.com/dotnet/runtime/issues/67330#issuecomment-1083539281 and Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees? Often no, and if it does, not by much.

Certainly not enough to be worth slowing down the reader with lots of extra barriers just to make it look at this atomic variable later than other atomic variables, if you didn't need that ordering for correctness. Or slowing down the writer to make it sit there doing nothing to maybe let it complete an RFO a few cycles sooner instead of getting other useful work done.

If your use of threading is so bottlenecked on inter-core latency that it was worth it, that's probably a sign you need to rethink your design.

Without barriers or ordering, just std::atomic with memory_order_relaxed, you'll still normally see data on other cores within maybe 40 nanoseconds (on a modern x86 desktop/laptop), about the same as if both threads were using atomic RMWs. And it's not possible for it to be delayed for any significant amount of time, like a microsecond maybe if you create lots of contention for lots of earlier stores so they all take a long time to commit before this one can. You definitely don't have to worry about going a long time with stores not being visible. (This of course only applies to atomic or hand-rolled atomics with volatile. Plain non-volatile loads may only check once at the start of a loop, and then never again. That's why they're unusable to multithreading.)

Are memory barriers necessary for atomic reference counting shared immutable data?

On x86, it will turn into a lock prefixed assembly instruction, like LOCK XADD.

Being a single instruction, it is non-interruptible. As an added "feature", the lock prefix results in a full memory barrier:

"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.

A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64, because mfence is slower on many CPUs even when it's guaranteed to be available, like in 64-bit mode. (Does lock xchg have the same behavior as mfence?)

So you have a full fence on x86 as an added bonus, whether you like it or not. :-)

On PPC, it is different. An LL/SC pair - lwarx & stwcx - with a subtraction inside can be used to load the memory operand into a register, subtract one, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted (meaning it will fail and retry).

It also does not mean an automatic full fence.

This does not however compromise the atomicity of the counter in any way.

It just means that in the x86 case, you happen to get a fence as well, "for free".

On PPC, one can insert a (partial or) full fence by emitting a (lw)sync instruction.

All in all, explicit memory barriers are not necessary for the atomic counter to work properly.

C++ Memory Barriers for Atomics

Both MemoryBarrier (MSVC) and _mm_mfence (supported by several compilers) provide a hardware memory fence, which prevents the processor from moving reads and writes across the fence.

The main difference is that MemoryBarrier has platform specific implementations for x86, x64 and IA64, where as _mm_mfence specifically uses the mfence SSE2 instruction, so it's not always available.

On x86 and x64 MemoryBarrier is implemented with a xchg and lock or respectively, and I have seen some claims that this is faster than mfence. However my own benchmarks show the opposite, so apparently it's very much dependent on processor model.

Another difference is that mfence can also be used for ordering non-temporal stores/loads (movntq etc).

GCC also has __sync_synchronize which generates a hardware fence.

asm volatile ("" : : : "memory") in GCC and _ReadWriteBarrier in MSVC only provide a compiler level memory fence, preventing the compiler from reordering memory accesses. That means the processor is still free to do reordering.

Compiler fences are generally used in combination with operations that have some kind of implicit hardware fence. E.g. on x86/x64 all stores have a release fence and loads have an acquire fence, so you just need a compiler fence when implementing load-acquire and store-release.

c++ atomic: would function call act as memory barrier?

A compiler barrier is not the same thing as a memory barrier. A compiler barrier prevents the compiler from moving code across the barrier. A memory barrier (loosely speaking) prevents the hardware from moving reads and writes across the barrier. For atomics you need both, and you also need to ensure that values don't get torn when read or written.

Understanding std::atomic memory barriers

No, but it's equivalent to this:

#include <iostream>
#include <atomic>
int main()
{
std::atomic<int> a;
int n = load();
std::atomic_thread_fence(std::memory_order_release);
a.store (12345, std::memory_order_relaxed);
n=100;
}

(although the value is different than what you did up there). There must be an atomic store inside the fence. Check the conditions here under "fence-fence synchronization" or "fence-atomic synchronization". Although you're not setting any constrains on storing a, it will be within the memory_order_release, and so will n. That's what a fence does.

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

Basically no significant effect on inter-core latency, and definitely never worth using "blindly" without careful profiling, if you suspect there might be any contention from later loads missing in cache.

It's a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait for something that was already going to happen on its own, before doing later loads and/or stores. For a full barrier, blocking later loads and stores until the store buffer is drained.
Size of store buffers on Intel hardware? What exactly is a store buffer?

In the bad old days before std::atomic, compiler barriers were one way to stop the compiler from keeping values in registers (private to a CPU core / thread, not coherent), but that's a compilation issue not asm. CPUs with non-coherent caches are possible in theory (where std::atomic would need to do explicit flushing to make a store visible), but in practice no implementation runs std::thread across cores with non-coherent caches.


If I don't use fences, how long could it take a core to see another core's writes? is highly related, I've written basically this answer at least a few times before. (But this looks like a good place for an answer specifically about this, without getting into the weeds of which barriers do what.)


There might be some very minor secondary effects of blocking later loads that could maybe compete with RFOs (for this core to get exclusive access to a cache line to commit a store). The CPU always tries to drain the store buffer as fast as possible (by committing to L1d cache). As soon as a store commits to L1d cache, it becomes globally visible to all other cores. (Because they're coherent; they'd still have to make a share request...)

Getting the current core to write-back some store data to L3 cache (especially in shared state) could reduce the miss penalty if the load on another core happens somewhat after this store commits. But there are no good ways to do that. Creating a conflict miss in L1d and L2 maybe, if producer performance is unimportant other than creating low latency for the next read.

On x86, Intel Tremont (low power Silvermont series) will introduce cldemote (_mm_cldemote) that writes back a line as far as an outer cache, but not all the way to DRAM. (clwb could possibly help, but does force the store to go all the way to DRAM. Also, the Skylake implementation is just a placeholder and works like clflushopt.)

  • Is there any way to write for Intel CPU direct core-to-core communication code?
  • How to force cpu core to flush store buffer in c?
  • x86 MESI invalidate cache line latency issue
  • Force a migration of a cache line to another core (not possible)

Fun fact: non-seq_cst stores/loads on PowerPC can store-forward between logical cores on the same physical core, making stores visible to some other cores before they become globally visible to all other cores. This is AFAIK the only real hardware mechanism for threads to not agree on a global order of stores to all objects. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. On other ISAs, including ARMv8 and x86, it's guaranteed that stores become visible to all other cores at the same time (via commit to L1d cache).


For loads, CPUs already prioritize demand loads over any other memory accesses (because of course execution has to wait for them.) A barrier before a load could only delay it.

That might happen to be optimal by coincidence of timing, if that makes it see the store it was waiting for instead of going "too soon" and seeing the old cached boring value. But there's generally no reason to assume or ever predict that a pause or barrier could be a good idea ahead of a load.

A barrier after a load shouldn't help either. Later loads or stores might be able to start, but out-of-order CPUs generally do stuff in oldest-first priority so later loads probably can't fill up all the outstanding load buffers before this load gets a chance to get its load request sent off-core (assuming a cache miss because another core stored recently.)

I guess I could imagine a benefit to a later barrier if this load address wasn't ready for a while (pointer-chasing situation) and the max number of off-core requests were already in-flight when the address did become known.

Any possible benefit is almost certainly not worth it; if there was that much useful work independent of this load that it could fill up all the off-core request buffers (LFBs on Intel) then it might well not be on the critical path and it's probably a good thing to have those loads in flight.

std::atomic: Does the memory barrier hold up when task loops around?

No amount of barriers can help you avoid data-race UB if you begin another write of the non-atomic variables right after the release-store.

It will always be possible (and likely) for some non-atomic writes to a,b, and c to be "happening" while your reader is reading those variables, therefore in the C abstract machine you have data-race UB. (In your example, from unsynced write+read of a, unsynced write+write of b, and the write+read of b, and write+write of c.)

Also, even without loops, your example would still not safely avoid data-race UB, because your TaskB accesses a,b, and c unconditionally after the flag.load. So you do that stuff whether or not you observe the data_ready = 1 signal from the writer saying that the vars are ready to be read.

Of course in practice on real implementations, repeatedly writing the same data is unlikely to cause problems here, except that the value read for b will depend on how the compiler optimizes. But that's because your example also writes.

Mainstream CPUs don't have hardware race detection, so it won't actually fault or something, and if you did actually wait for flag==1 and then just read, you would see the expected values even if the writer was running more assignments of the same values. (A DeathStation 9000 could implement those assignments by storing something else in that space temporarily so the bytes in memory are actually changing, not stable copies of the values before the first release-store, but that's not something that you'd expect a real compiler to do. I wouldn't bet on it, though, and this seems like an anti-pattern).


This is why lock-free queues use multiple array elements, or why a seqlock doesn't work this way. (A seqlock can't be implemented both safely and efficiently in ISO C++ because it relies on reading maybe-torn data and then detecting tearing; if you use narrow-enough relaxed atomics for the chunks of data, you're hurting efficiency.)

The whole idea of wanting to write again, maybe before a reader has finished reading, sounds a lot like you should be looking into the idea of a SeqLock. https://en.wikipedia.org/wiki/Seqlock and see the other links in my linked answer in the last paragraph.

C - volatile and memory barriers in lockless shared memory access?

  1. is my understanding correct and complete ?

Yeah, it looks that way, except for not mentioning that C11 <stdatomic.h> made all this obsolete for almost all purposes.

There are more bad/weird things that can happen without volatile (or better, _Atomic) that you didn't list: the LWN article Who's afraid of a big bad optimizing compiler? goes into detail about things like inventing extra loads (and expecting them both to read the same value). It's aimed at Linux kernel code, where C11 _Atomic isn't how they do things.

Other than the Linux kernel, new code should pretty much always use <stdatomic.h> instead of rolling your own atomics with volatile and inline asm for RMWs and barriers. But that does continue to work because all real-world CPUs that we run threads across have coherent shared memory, so making a memory access happen in the asm is enough for inter-thread visibility, like memory_order_relaxed. See When to use volatile with multi threading? (basically never, except in the Linux kernel or maybe a handful of other codebases that already have good implementations of hand-rolled stuff).

In ISO C11, it's data-race undefined behaviour for two threads to do unsynchronized read+write on the same object, but mainstream compilers do define the behaviour, just compiling the way you'd expect so hardware guarantees or lack thereof come into play.


Other than that, yeah, looks accurate except for your final question 2: there are use-cases for memory_order_relaxed atomics, which is like volatile with no barriers, e.g. an exit_now flag.

or are there are cases where only using barriers suffice ?

No, unless you get lucky and the compiler happens to generate correct asm anyway.

Or unless other synchronization means this code only runs while no other threads are reading/writing the object. (C++20 has std::atomic_ref<T> to handle the case where some parts of the code need to do atomic accesses to data, but other parts of your program don't, and you want to let them auto-vectorize or whatever. C doesn't have any such thing yet, other than using plain variables with/without GNU C __atomic_load_n() and other builtins, which is how C++ headers implement std::atomic<T>, and which is the same underlying support that C11 _Atomic compiles to. Probably also the C11 functions like atomic_load_explicit defined in stdatomic.h, but unlike C++, _Atomic is a true keyword not defined in any header.)

Synchronization with C++ atomic memory fence

Your code is overcomplicated. a=0 never changes so it always reads as 0. You might as well just have atomic<int> b=0; and only a single load that just return b.load.

Assume t1 has finished f1 and then t2 just started f2, will t2 see b incremented?

There's no way for you to detect that this is how the timing worked out, unless you put t1.join() ahead of std::thread t2(f2); construction. That would require that everything in thread 2 is sequenced after everything in thread 1. (I think even without a seq_cst fence at the end of f1, but that doesn't hurt. I think thread.join makes sure everything done inside a thread is visible after thread.join)

But yes, that ordering can happen by chance, and then of course it works.

There's no guarantee that's even a meaningful condition in C++ terms.

But sure for most (all?) real implementations it's something that can happen. And a thread_fence(mo_seq_cst) will compile to a full barrier that blocks that thread until the store commits (becomes globally visible to all threads). So execution can't leave f1 until reads from other threads can see the updated value of b. (The C++ standard defines ordering and fences in terms of creating synchronizes-with relationships, not in terms of compiling to full barriers that flush the store buffer. The standard doesn't mention a store buffer or StoreLoad reordering or any of the CPU memory-order things.)

Given the synthetic condition, the threads actually are ordered wrt. each other, and it works just like if everything had been done in a single thread.


The loads in diff() aren't ordered wrt. each other because they're both mo_relaxed. But a is never modified by any thread so the only question is whether b.load() can happen before the thread even started, before the f1 store is visible. In real implementations it can't because of what "and then t2 just started f2" means. If it could load the old value, then you wouldn't be able to say "and then", so it's almost a tautology.

The thread_fence(seq_cst) before the loads doesn't really help anything. I guess it stops b.load() from reordering with the thread-startup machinery.



Related Topics



Leave a reply



Submit