Does Hardware Memory Barrier Make Visibility of Atomic Operations Faster in Addition to Providing Necessary Guarantees

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

Basically no significant effect on inter-core latency, and definitely never worth using "blindly" without careful profiling, if you suspect there might be any contention from later loads missing in cache.

It's a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait for something that was already going to happen on its own, before doing later loads and/or stores. For a full barrier, blocking later loads and stores until the store buffer is drained.
Size of store buffers on Intel hardware? What exactly is a store buffer?

In the bad old days before std::atomic, compiler barriers were one way to stop the compiler from keeping values in registers (private to a CPU core / thread, not coherent), but that's a compilation issue not asm. CPUs with non-coherent caches are possible in theory (where std::atomic would need to do explicit flushing to make a store visible), but in practice no implementation runs std::thread across cores with non-coherent caches.


If I don't use fences, how long could it take a core to see another core's writes? is highly related, I've written basically this answer at least a few times before. (But this looks like a good place for an answer specifically about this, without getting into the weeds of which barriers do what.)


There might be some very minor secondary effects of blocking later loads that could maybe compete with RFOs (for this core to get exclusive access to a cache line to commit a store). The CPU always tries to drain the store buffer as fast as possible (by committing to L1d cache). As soon as a store commits to L1d cache, it becomes globally visible to all other cores. (Because they're coherent; they'd still have to make a share request...)

Getting the current core to write-back some store data to L3 cache (especially in shared state) could reduce the miss penalty if the load on another core happens somewhat after this store commits. But there are no good ways to do that. Creating a conflict miss in L1d and L2 maybe, if producer performance is unimportant other than creating low latency for the next read.

On x86, Intel Tremont (low power Silvermont series) will introduce cldemote (_mm_cldemote) that writes back a line as far as an outer cache, but not all the way to DRAM. (clwb could possibly help, but does force the store to go all the way to DRAM. Also, the Skylake implementation is just a placeholder and works like clflushopt.)

  • Is there any way to write for Intel CPU direct core-to-core communication code?
  • How to force cpu core to flush store buffer in c?
  • x86 MESI invalidate cache line latency issue
  • Force a migration of a cache line to another core (not possible)

Fun fact: non-seq_cst stores/loads on PowerPC can store-forward between logical cores on the same physical core, making stores visible to some other cores before they become globally visible to all other cores. This is AFAIK the only real hardware mechanism for threads to not agree on a global order of stores to all objects. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. On other ISAs, including ARMv8 and x86, it's guaranteed that stores become visible to all other cores at the same time (via commit to L1d cache).


For loads, CPUs already prioritize demand loads over any other memory accesses (because of course execution has to wait for them.) A barrier before a load could only delay it.

That might happen to be optimal by coincidence of timing, if that makes it see the store it was waiting for instead of going "too soon" and seeing the old cached boring value. But there's generally no reason to assume or ever predict that a pause or barrier could be a good idea ahead of a load.

A barrier after a load shouldn't help either. Later loads or stores might be able to start, but out-of-order CPUs generally do stuff in oldest-first priority so later loads probably can't fill up all the outstanding load buffers before this load gets a chance to get its load request sent off-core (assuming a cache miss because another core stored recently.)

I guess I could imagine a benefit to a later barrier if this load address wasn't ready for a while (pointer-chasing situation) and the max number of off-core requests were already in-flight when the address did become known.

Any possible benefit is almost certainly not worth it; if there was that much useful work independent of this load that it could fill up all the off-core request buffers (LFBs on Intel) then it might well not be on the critical path and it's probably a good thing to have those loads in flight.

Does std::atomic gurantee that store() operation is propagated immediately (almost) to other threads with load()?

The C++ standard says only this [C++20 intro.progress p18]:

An implementation should ensure that the last value (in modification order) assigned by an atomic or
synchronization operation will become visible to all other threads in a finite period of time.

Technically this is only a "should", and "finite time" is not very specific. But the C++ standard is broad enough that you can't expect them to specify a particular number of cycles or nanoseconds or what have you.

In practice, you can expect that a call to any atomic store function, even with memory_order_relaxed, will cause an actual machine store instruction to be executed. The value will not just be left in a register. After that, it's out of the compiler's hands and up to the CPU.

(Technically, if you had two or more stores in succession to the same object, with a bounded amount of other work done in between, the compiler would be allowed to optimize out all but the last one, on the basis that you couldn't have been sure anyway that any given load would happen at the right instant to see one of the other values. In practice I don't believe that any compilers currently do so.)

A reasonable expectation for typical CPU architectures is that the store will become globally visible "without unnecessary delay". The store may go into the core's local store buffer. The core will process store buffer entries as quickly as it can; it does not just let them sit there to age like a fine wine. But they still could take a while. For instance, if the cache line is currently held exclusive by another core, you will have to wait until it is released before your store can be committed.

Using stronger memory ordering will not speed up the process; the machine is already making its best efforts to commit the store. Indeed, a stronger memory ordering may actually slow it down; if the store was made with release ordering, then it must wait for all earlier stores in the buffer to commit before it can itself be committed. On strongly-ordered architectures like x86, every store is automatically release, so the store buffer always remains in strict order; but on a weakly ordered machine, using relaxed ordering may allow your store to "jump the queue" and reach L1 cache sooner than would otherwise have been possible.

What C++11 atomic operations/memory orders guarantees freshness?

There is nothing that will guarantee that: everything is about ordering. Even memory_order_seq_cst just guarantees that things happen in a single total order. In theory, the compiler/library/cpu could schedule every load from cancel_store at the end of the program.

There is a general statement in 29.3p13 that

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

But there is no specification on what constitutes a "reasonable amount of time".

So: memory_order_relaxed should be just fine, but memory_order_seq_cst may work better on some platforms, as the cache line may be reloaded sooner.

Is a memory barrier required to read a value that is atomically modified?

No, you don't need barriers, but your code is broken anyway if readers and writers call these functions in different threads. Especially if a reader calls the read function in a loop.

TL:DR: use C++11 std::atomic<long> m_value with return m_value++ in the increment and return m_value in the reader. That will give you sequential consistency in a data-race-free program: execution will be work as if threads ran with some interleaving of source order. (Unless you violate the rules and have other non-atomic shared data.) You definitely want to return a value from Increment, if you want threads doing increments to ever know what value they produced. Doing a separate load after is totally broken for use cases like int sequence_num = shared_counter++; where another thread's increment could be visible between count++; tmp = count;.

If you don't need such strong ordering with respect to operations on other objects in the same thread as the reader/writer, return m_value.load(std::memory_order_acquire) is enough for most uses, and m_value.fetch_add(1, std::memory_order_acq_rel). Very few programs actually need StoreLoad barriers anywhere; atomic RMWs can't actually reorder very much even with acq_rel. (On x86, those will both compile the same as if you'd used seq_cst.)

You can't force ordering between threads; the load either sees the value or it doesn't, depending on whether the reading thread saw the invalidate from the writer before or after it took / tried to take a value for the load. The whole point of threads is that they don't run in lock-step with each other.



Data-race UB:

A loop reading m_value can hoist the load out of the loop since it's not atomic (or even volatile as a hack). This is data-race UB, compilers will break your reader. See this and Multithreading program stuck in optimized mode but runs normally in -O0

Barriers aren't the problem/solution here, just forcing re-checking of memory (or the cache-coherent view of memory that the current CPU sees; actual CPU caches like L1d and L2 are not a problem for this). That's not what barriers really do; they order this thread's access to coherent cache. C++ threads only run across cores with coherent caches.

But seriously don't roll your own atomics without a very compelling reason. When to use volatile with multi threading? pretty much never. That answer explains cache coherency and that you don't need barriers to avoid seeing stale values.

In many real-world C++ implementations, something like std::atomic_thread_fence() will also be a "compiler barrier" that forces the compiler to reload even non-atomic vars from memory, even without volatile, but that's an implementation detail. So it may happen to work well enough, on some compilers for some ISAs. And still isn't fully safe against the compiler inventing multiple loads; See the LWN article Who's afraid of a big bad optimizing compiler? for examples with details; primarily aimed at how the Linux kernel rolls its own atomics with volatile, which is de-facto supported by GCC/clang.



"latest value"

Beginners often panic over this, and think that RMW operations are somehow better because of the way they're specified. Since they're a read + write tied together, and there is a modification order for every memory location separately, RMW operations necessarily have to wait for write access to a cache line, and that means serializing all writes and RMWs on a single location.

Plain loads of atomic variables are still guaranteed (by real implementations) to see values promptly. (ISO C++ only suggests that values should be seen in finite time, and promptly, but of course real implementations can do much better because they run on cache-coherent CPU hardware.)

There's no such thing as "immediate" between two threads; either a load in another thread sees a value stored, or it ran before the store became visible to other threads and didn't. With thread scheduling and so on, it's always possible that a thread will load a value but then not use it for a long time; it was fresh when it was loaded.

So this is pretty much irrelevant for correctness, and all that's left is worrying about inter-thread latency. That could in some cases be helped by barriers (to reduce contention from later memory operations, not from actively flushing your stores faster, barriers just wait for that to happen the normal way). So that's usually a very minor effect, and not a reason to use extra barriers.

See MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?. And see my comments on https://github.com/dotnet/runtime/issues/67330#issuecomment-1083539281 and Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees? Often no, and if it does, not by much.

Certainly not enough to be worth slowing down the reader with lots of extra barriers just to make it look at this atomic variable later than other atomic variables, if you didn't need that ordering for correctness. Or slowing down the writer to make it sit there doing nothing to maybe let it complete an RFO a few cycles sooner instead of getting other useful work done.

If your use of threading is so bottlenecked on inter-core latency that it was worth it, that's probably a sign you need to rethink your design.

Without barriers or ordering, just std::atomic with memory_order_relaxed, you'll still normally see data on other cores within maybe 40 nanoseconds (on a modern x86 desktop/laptop), about the same as if both threads were using atomic RMWs. And it's not possible for it to be delayed for any significant amount of time, like a microsecond maybe if you create lots of contention for lots of earlier stores so they all take a long time to commit before this one can. You definitely don't have to worry about going a long time with stores not being visible. (This of course only applies to atomic or hand-rolled atomics with volatile. Plain non-volatile loads may only check once at the start of a loop, and then never again. That's why they're unusable to multithreading.)

Does it make sense to use a relaxed load followed by a conditional fence, if I don't always need acquire semantics?

If most checks of done find it not-done, and happens in a throughput-sensitive part of your program, yes this could make sense, even on ISAs where a separate barrier costs more. Perhaps a use-case like an exit-now flag that also signals some data or a pointer a thread will want. You check often, but the great majority of the time you don't exit and don't need later operations to wait for this load to complete.


This is a win on some ISAs (where a load(acquire) is already a load+barrier), but on others it's usually worse, especially if the case we care about most (the "fast path") is the one that loads value. (On ISAs where a fence(acquire) is more expensive than a load(acquire), especially 32-bit ARM with ARMv8 new instructions: lda is just an acquire load, but a fence is still a dmb ish full barrier.)

If the !done case is common and there's other work to do, then it's maybe worth considering the tradeoff, since std::memory_order_consume is not currently usable for its intended purpose. (See below re: memory dependency ordering solving this specific case without any barrier.)

For other common ISAs, no, it wouldn't make sense because it would make the "success" case slower, maybe much slower if it ended up with a full barrier. If that's the normal fast-path through the function, that would obviously be terrible.


On x86 there's no difference: fence(acquire) is a no-op, and load(acquire) uses the same asm as load(relaxed). That's why we say x86's hardware memory model is "strongly ordered". Most other mainstream ISAs aren't like this.

For some ISAs this is pure win in this case. For ISAs that implement done.load(acquire) with a plain load and then the same barrier instruction fence(acquire) would use (like RISC-V, or 32-bit ARM without ARMv8 instructions). They have to branch anyway, so it's just about where we place the barrier relative to the branch. (Unless they choose to unconditionally load value and branchlessly select, like MIPS movn, which is allowed because they already load another member of that class Worker object so it's known to be a valid pointer to a full object.)


AArch64 can do acquire loads quite cheaply, but an acquire barrier would be more expensive. (And would happen on what would normally be the fast path; speeding up the "failure" path is normally not important.).

Instead of a barrier, a 2nd load, this time with acquire, could possibly be better. If the flag can only change from 0 to 1, you don't even need to re-check its value; accesses to the same atomic object are ordered within the same thread.

(I had a Godbolt link with some examples for many ISAs, but a browser restart ate it.)



Memory dependency order could solve this problem with no barriers

Unfortunately std::memory_order_consume is temporarily deprecated, otherwise you could have the best of both worlds for this case, by creating an &value pointer with a data-dependency on done.load(consume). So the load of value (if done at all) would be dependency-ordered after the load from done, but other independent later loads wouldn't have to wait.

e.g. if ( (tmp = done.load(consume)) ) and return (&value)[tmp-1]. This is easy in asm, but without fully working consume support, compilers would optimize out the use of tmp in the side of the branch that can only be reached with tmp = true.

So the only ISA that actually needs to make this barrier tradeoff in asm is Alpha, but due to C++ limitations we can't easily take advantage of the hardware support that other ISAs offer.

If you're willing to use something that will work in practice despite not having guarantees, use std::atomic<int *> done = nullptr; and do a release-store of &value instead of =true. Then in the reader, do a relaxed load, and if(tmp) { return *tmp; } else { return -1; }. If the compiler can't prove that the only non-null pointer value is &value, it will need to keep the data dependency on the pointer load. (To stop it from proving that, perhaps include a set member function that stores an arbitrary pointer in done, which you never call.)

See C++11: the difference between memory_order_relaxed and memory_order_consume for details, and a link to Paul E. McKenney's CppCon 2016 talk where he explains what consume was supposed to be for, and how Linux RCU does use the kind of thing I suggested, with effectively relaxed loads and depending on the compiler to make asm with data dependencies. (Which requires being careful not to write things where it can optimize away the data dependency.)

Why set the stop flag using `memory_order_seq_cst`, if you check it with `memory_order_relaxed`?

mo_relaxed is fine for both load and store of a stop flag

There's also no meaningful latency benefit to stronger memory orders, even if latency of seeing a change to a keep_running or exit_now flag was important.

IDK why Herb thinks stop.store shouldn't be relaxed; in his talk, his slides have a comment that says // not relaxed on the assignment, but he doesn't say anything about the store side before moving on to "is it worth it".

Of course, the load runs inside the worker loop, but the store runs only once, and Herb really likes to recommend sticking with SC unless you have a performance reason that truly justifies using something else. I hope that wasn't his only reason; I find that unhelpful when trying to understand what memory order would actually be necessary and why. But anyway, I think either that or a mistake on his part.


The ISO C++ standard doesn't say anything about how soon stores become visible or what might influence that, just two should recommendations: Section 6.9.2.3 Forward progress

18. An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

And 33.5.4 Order and consistency [atomics.order] covering only atomics, not mutexes etc.:

11. Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

Another thread can loop arbitrarily many times before its load actually sees this store value, even if they're both seq_cst, assuming there's no other synchronization of any kind between them. Low inter-thread latency is a performance issue, not correctness / formal guarantee.

And non-infinite inter-thread latency is apparently only a "should" QOI (quality of implementation) issue. :P Nothing in the standard suggests that seq_cst would help on an implementation where store visibility could be delayed indefinitely, although one might guess that could be the case, e.g. on a hypothetical implementation with explicit cache flushes instead of cache coherency. (Although such an implementation is probably not practically usable in terms of performance with CPUs anything like what we have now; every release and/or acquire operation would have to flush the whole cache.)

On real hardware (which uses some form of MESI cache coherency), different memory orders for store or load don't make stores visible sooner in real time, they just control whether later operations can become globally visible while still waiting for the store to commit from the store buffer to L1d cache. (After invalidating any other copies of the line.)

Stronger orders, and barriers, don't make things happen sooner in an absolute sense, they just delay other things until they're allowed to happen relative to the store or load. (This is the case on all real-world CPUs AFAIK; they always try to make stores visible to other cores ASAP anyway, so the store buffer doesn't fill up, and

See also (my similar answers on):

  • Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?
  • If I don't use fences, how long could it take a core to see another core's writes?
  • memory_order_relaxed and visibility
  • Thread synchronization: How to guarantee visibility of writes (it's a non-issue on current real hardware)

The second Q&A is about x86 where commit from the store buffer to L1d cache is in program order. That limits how far past a cache-miss store execution can get, and also any possible benefit of putting a release or seq_cst fence after the store to prevent later stores (and loads) from maybe competing for resources. (x86 microarchitectures will do RFO (read for ownership) before stores reach the head of the store buffer, and plain loads normally compete for resources to track RFOs we're waiting for a response to.) But these effects are extremely minor in terms of something like exiting another thread; only very small scale reordering.



because who cares if the thread stops with a slightly bigger delay.

More like, who cares if the thread gets more work done by not making loads/stores after the load wait for the check to complete. (Of course, this work will get discarded if it's in the shadow of a a mis-speculated branch on the load result when we eventually load true.) The cost of rolling back to a consistent state after a branch mispredict is more or less independent of how much already-executed work had happened beyond the mispredicted branch. And it's a stop flag so the total amount of wasted work costing cache/memory bandwidth for other CPUs is pretty minimal.

That phrasing makes it sound like an acquire load or release store would actually get the the store seen sooner in absolute real time, rather than just relative to other code in this thread. (Which is not the case).

The benefit is more instruction-level and memory-level parallelism across loop iterations when the load produces a false. And simply avoiding running extra instructions on ISAs where an acquire or especially an SC load needs extra instructions, especially expensive 2-way barrier instructions (like PowerPC isync/sync or especially ARMv7 dmb ish full barrier even for acquire), not like ARMv8 ldapr or x86 mov acquire-load instructions. (Godbolt)


BTW, Herb is right that the dirty flag can also be relaxed, only because of the thread.join sync between the reader and any possible writer. Otherwise yeah, release / acquire.

But in this case, dirty only needs to be atomic<> at all because of possible simultaneous writers all storing the same value, which ISO C++ still deems data-race UB. e.g. because of the theoretical possibility of hardware race-detection that traps on conflicting non-atomic accesses. (Or a software implementations like clang -fsanitize=thread)


Fun fact: C++20 introduced std::stop_token for use as a stop or keep_running flag.

Guarantees on memory ordering and proper programming practice

Mastery of these kinds of operations is essential for building SMP operating systems and for communicating with certain kinds of hardware. The Linux kernel documentation provides an excellent overview of the subject along with the specific solutions used by the kernel. I highly recommend taking a look at their memory-barriers.txt file.

Does atomic read guarantees reading of the latest value?

When we talk about memory access on modern architectures, we usually ignore the "exact location" the value is read from.

A read operation can fetch data from the cache (L0/L1/...), the RAM or even the hard-drive (e.g. when the memory is swapped).

These keywords tell the compiler which assembly operations to use when accessing the data.

volatile

A keyword that tells the compiler to always read the variable's value from memory, and never from the register.

This "memory" can still be the cache, but, in case that this "address" in the cache is considered "dirty", meaning that the value has changed by a different processor, the value will be reloaded.

This ensures we never read a stale value.

Clarification: According to the standard, if the volatile type is not a primitive, whose read/write operations are atomic (in regard to the assembly instructions that read/write it) by nature, one might possibly read an intermediate value (the writer managed to write only half of the bytes by the time the reader read it). However, modern implementations do not behave this way.

atomic

When the compiler sees a load (read) operation, it basically does the exact same thing it would have done for a volatile value.

So, what is the difference???

The difference is cross-CPU write operations.
When working with a volatile variable, if CPU 1 sets the value, and CPU 2 reads it, the reader might read an old value.

But, how can that be? The volatile keyword promises that we won't read a stale value!

Well, that's because the writer didn't publish the value! And though the reader tries to read it, it reads the old one.

When the compiler stumbles upon a store (write) operation for an atomic variable it:

  • Sets the value atomically in memory
  • Announces that the value has changed

After the announcement, all the CPUs will know that they should re-read the value of the variable because their caches will be marked "dirty".

This mechanism is very similar to operations performed on files. When your application writes to a file on the hard-drive, other applications may or may not see the new information, depending on whether or not your application flushed the data to the hard-drive.

If the data wasn't flushed, then it merely resides somewhere in your application's caches and visible only itself. Once you flush it, anyone who opens the file will see the new state.

Clarification: Common modern compiler & cache implementations ensure correct publishing of volatile writes as well. However, this is NOT a reason to prefer that over std::atomic. For example, just like some comments pointed out, Linux's atomic read and writes for x86_64 are implemented using volatiles.

Is memory barrier or atomic operation required in a busy-wait loop?

  1. Is volatile enough here or are there any architectures or compilers where memory or compiler barrier or atomic operation is required in the while loop?

will the volatile code see the change. Yes, but not necessarily as quickly as if there was a memory barrier. At some point, some form of synchronization will occur, and the new state will be read from the variable, but there are no guarantees on how much has happened elsewhere in the code.

1.1 According to C++ standards?

From cppreference : memory_order

It is the memory model and memory order which defines the generalized hardware that the code needs to work on. For a message to pass between threads of execution, an inter-thread-happens-before relationship needs to occur. This requires either...

  • A synchronizes-with B
  • A has a std::atomic operation before B
  • A indirectly synchronizes with B (through X).
  • A is sequenced before X which inter-thread happens before B
  • A interthread happens before X and X interthread happens before B.

As you are not performing any of those cases there will be forms of your program where on some current hardware, it may fail.

In practice, the end of a time-slice will cause the memory to become coherent, or any form of barrier on the non-spinlock thread will ensure that the caches are flushed.

Not sure on the causes of the volatile read getting the "current value".



Related Topics



Leave a reply



Submit