Acquire/Release Semantics with Non-Temporal Stores on X64

Acquire/release semantics with non-temporal stores on x64

I might be wrong about some things in this answer (proof-reading welcome from people that know this stuff!). It's based on reading the docs and Jeff Preshing's blog, not actual recent experience or testing.

Linus Torvalds strongly recommends against trying to invent your own locking, because it's so easy to get it wrong. It's more of an issue when writing portable code for the Linux kernel, rather than something that's x86-only, so I feel brave enough to try to sort things out for x86.

The normal way to use NT stores is to do a bunch of them in a row, like as part of a memset or memcpy, then an SFENCE, then a normal release store to a shared flag variable: done_flag.store(1, std::memory_order_release).

Using a movnti store to the synchronization variable will hurt performance. You might want to use NT stores into the Foo it points to, but evicting the pointer itself from cache is perverse. (movnt stores evict the cache line if it was in cache to start with; see vol1 ch 10.4.6.2
Caching of Temporal vs. Non-Temporal Data).

The whole point of NT stores is for use with Non-Temporal data, which won't be used again (by any thread) for a long time if ever. The locks that control access to shared buffers, or the flags that producers/consumers use to mark data as read, are expected to be read by other cores.

Your function names also don't really reflect what you're doing.

x86 hardware is extremely heavily optimized for doing normal (not NT) release-stores, because every normal store is a release-store. The hardware has to be good at it for x86 to run fast.

Using normal stores/loads only requires a trip to L3 cache, not to DRAM, for communication between threads on Intel CPUs. Intel's large inclusive L3 cache works as a backstop for cache-coherency traffic. Probing the L3 tags on a miss from one core will detect the fact that another core has the cache line in the Modified or Exclusive state. NT stores would require synchronization variables to go all the way out to DRAM and back for another core to see it.

Memory ordering for NT streaming stores

movnt stores can be reordered with other stores, but not with older reads.

Intel's x86 manual vol3, chapter 8.2.2 (Memory Ordering in P6 and More Recent Processor Families):

Reads are not reordered with other reads.

Writes are not reordered with older reads. (note the lack of exceptions).

Writes to memory are not reordered with other writes, with the following exceptions:

streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ,
MOVNTDQ, MOVNTPS, and MOVNTPD); and

string operations (see Section 8.2.4.1). (note: From my reading of the docs, fast string and ERMSB ops still implicitly have a StoreStore barrier at the start/end. There's only potential reordering between the stores within a single rep movs or rep stos.)

... stuff about clflushopt and the fence instructions

update: There's also a note (in 8.1.2.2 Software Controlled Bus Locking) that says:

Do not implement semaphores using the WC memory type. Do not perform non-temporal stores to a cache line containing a location used to implement a semaphore.

This may just be a performance suggestion; they don't explain whether it can cause a correctness problem. Note that NT stores are not cache-coherent, though (data can sit in the line fill buffer even if conflicting data for the same line is present somewhere else in the system, or in memory). Maybe you could safely use NT stores as a release-store that synchronizes with regular loads, but would run into problems with atomic RMW ops like lock add dword [mem], 1.

Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order.

To block reordering with earlier stores, we need an SFENCE instruction, which is a StoreStore barrier even for NT stores. (And is also a barrier to some kinds of compile-time reordering, but I'm not sure if it blocks earlier loads from crossing the barrier.) Normal stores don't need any kind of barrier instruction to be release-stores, so you only need SFENCE when using NT stores.

For loads: The x86 memory model for WB (write-back, i.e. "normal") memory already prevents LoadStore reordering even for weakly-ordered stores, so we don't need an LFENCE for its LoadStore barrier effect, only a LoadStore compiler barrier before the NT store.
In gcc's implementation at least, std::atomic_signal_fence(std::memory_order_release) is a compiler-barrier even for non-atomic loads/stores, but atomic_thread_fence is only a barrier for atomic<> loads/stores (including mo_relaxed). Using an atomic_thread_fence still allows the compiler more freedom to reorder loads/stores to non-shared variables. See this Q&A for more.

// The function can't be called release_store unless it actually is one (i.e. includes all necessary barriers)
// Your original function should be called relaxed_store
void NT_release_store(const Foo* f) {
   // _mm_lfence();  // make sure all reads from the locked region are already globally visible.  Not needed: this is already guaranteed
   std::atomic_thread_fence(std::memory_order_release);  // no insns emitted on x86 (since it assumes no NT stores), but still a compiler barrier for earlier atomic<> ops
   _mm_sfence();  // make sure all writes to the locked region are already globally visible, and don't reorder with the NT store
   _mm_stream_si64((long long int*)&gFoo, (int64_t)f);
}

This stores to the atomic variable (note the lack of dereferencing &gFoo). Your function stores to the Foo it points to, which is super weird; IDK what the point of that was. Also note that it compiles as valid C++11 code.

When thinking about what a release-store means, think about it as the store that releases the lock on a shared data structure. In your case, when the release-store becomes globally visible, any thread that sees it should be able to safely dereference it.

To do an acquire-load, just tell the compiler you want one.

x86 doesn't need any barrier instructions, but specifying mo_acquire instead of mo_relaxed gives you the necessary compiler-barrier. As a bonus, this function is portable: you'll get any and all necessary barriers on other architectures:

Foo* acquire_load() {
    return gFoo.load(std::memory_order_acquire);
}

You didn't say anything about storing gFoo in weakly-ordered WC (uncacheable write-combining) memory. It's probably really hard to arrange for your program's data segment to be mapped into WC memory... It would be a lot easier for gFoo to simply point to WC memory, after you mmap some WC video RAM or something. But if you want acquire-loads from WC memory, you probably do need LFENCE. IDK. Ask another question about that, because this answer mostly assumes you're using WB memory.

Note that using a pointer instead of a flag creates a data dependency. I think you should be able to use gFoo.load(std::memory_order_consume), which doesn't require barriers even on weakly-ordered CPUs (other than Alpha). Once compilers are sufficiently advanced to make sure they don't break the data dependency, they can actually make better code (instead of promoting mo_consume to mo_acquire. Read up on this before using mo_consume in production code, and esp. be careful to note that testing it properly is impossible because future compilers are expected to give weaker guarantees than current compilers in practice do.

Initially I was thinking that we did need LFENCE to get a LoadStore barrier. ("Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions". This in turn prevents them from passing (becoming globally visible before) reads that are before the LFENCE).

Note that LFENCE + SFENCE is still weaker than a full MFENCE, because it's not a StoreLoad barrier. SFENCE's own documentation says it's ordered wrt. LFENCE, but that table of the x86 memory model from Intel manual vol3 doesn't mention that. If SFENCE can't execute until after an LFENCE, then sfence / lfence might actually be a slower equivalent to mfence, but lfence / sfence / movnti would give release semantics without a full barrier. Note that the NT store could become globally visible after some following loads/stores, unlike a normal strongly-ordered x86 store.)

Related: NT loads

In x86, every load has acquire semantics, except for loads from WC memory. SSE4.1 MOVNTDQA is the only non-temporal load instruction, and it isn't weakly ordered when used on normal (WriteBack) memory. So it's an acquire-load, too (when used on WB memory).

Note that movntdq only has a store form, while movntdqa only has a load form. But apparently Intel couldn't just call them storentdqa and loadntdqa. They both have a 16B or 32B alignment requirement, so leaving off the a doesn't make a lot of sense to me. I guess SSE1 and SSE2 had already introduced some NT stores already using the mov... mnemonic (like movntps), but no loads until years later in SSE4.1. (2nd-gen Core2: 45nm Penryn).

The docs say MOVNTDQA doesn't change the ordering semantics for the memory type it's used on.

... An implementation
may also make use of the non-temporal hint associated with this instruction if the memory source is WB (write
back) memory type.

A processor’s implementation of the non-temporal hint does not override the effective memory type semantics, but
the implementation of the hint is processor dependent. For example, a processor implementation may choose to
ignore the hint and process the instruction as a normal MOVDQA for any memory type.

In practice, current Intel mainsream CPUs (Haswell, Skylake) seem to ignore the hint for PREFETCHNTA and MOVNTDQA loads from WB memory. See Do current x86 architectures support non-temporal loads (from "normal" memory)?, and also Non-temporal loads and the hardware prefetcher, do they work together? for more details.

Also, if you are using it on WC memory (e.g. copying from video RAM, like in this Intel guide):

Because the WC protocol uses a weakly-ordered memory consistency model, an MFENCE or locked instruction
should be used in conjunction with MOVNTDQA instructions if multiple processors might reference the same WC
memory locations or in order to synchronize reads of a processor with writes by other agents in the system.

That doesn't spell out how it should be used, though. And I'm not sure why they say MFENCE rather than LFENCE for reading. Maybe they're talking about a write-to-device-memory, read-from-device-memory situation where stores have to be ordered with respect to loads (StoreLoad barrier), not just with each other (StoreStore barrier).

I searched in Vol3 for movntdqa, and didn't get any hits (in the whole pdf). 3 hits for movntdq: All the discussion of weak ordering and memory types only talks about stores. Note that LFENCE was introduced long before SSE4.1. Presumably it's useful for something, but IDK what. For load ordering, probably only with WC memory, but I haven't read up on when that would be useful.

LFENCE appears to be more than just a LoadLoad barrier for weakly-ordered loads: it orders other instructions too. (Not the global-visibility of stores, though, just their local execution).

From Intel's insn ref manual:

Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruc-
tion begins execution until LFENCE completes.

...

Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute until
the LFENCE completes.

The entry for rdtsc suggests using LFENCE;RDTSC to prevent it from executing ahead of previous instructions, when RDTSCP isn't available (and the weaker ordering guarantee is ok: rdtscp doesn't stop following instructions from executing ahead of it). (CPUID is a common suggestion for a serializing the instruction stream around rdtsc).

Acquire/Release semantics

The complete quote is:

If we let both threads run and find that r1 == 1, that serves as
confirmation that the value of A assigned in Thread 1 was passed
successfully to Thread 2. As such, we are guaranteed that r2 == 42.

The aquire-release semantics only guarantee that

A = 42 doesn't happen after Ready = 1 in thread 1
r2 = A doesn't happen before r1 = Ready in thread 2

So the value of r1 has to be checked in thread 2 to be sure that A has been written by thread 1. The scenario in the question can indeed happen, but r1 will be 0 in that case.

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Much of this has been covered in other Q&As (especially the later C++ How is release-and-acquire achieved on x86 only using MOV?), but I'll give a summary here. Still, good question, it's useful to collect this all in one place.

On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW speculatively loads earlier than allowed and then checks that speculation. (Potentially resulting in a memory-order mis-speculation pipeline nuke.) To track this, Intel calls the combination of load and store buffers the "Memory Order Buffer".

Weakly-ordered ISAs don't have to speculate, they can just load in any order.

x86 store ordering is maintained by only letting stores commit from the store buffer to L1d in program order.

On Intel CPUs at least, a store-buffer entry is allocated for a store when it issues (from the front-end into the ROB + RS). All uops need to have a ROB entry allocated for them, but some uops also need to have other resources allocated, like load or store buffer entries, RAT entries for registers they read/write, and so on.

So I think the store buffer itself is ordered. When a store-address or store-data uop executes, it merely writes an address or data into its already-allocated store-buffer entry. Since commit (freeing SB entries) and allocate are both in program order, I assume it's physically a circular buffer with a head and tail, like the ROB. (And unlike the RS).

Avoiding LoadStore is basically free: a load can't retire until it's executed (taken data from the cache). A store can't commit until after it retires. In-order retirement automatically means that all previous loads are done before a store is "graduated" and ready for commit.

A weakly-ordered uarch that can in practice do load-store reordering might scoreboard loads as well as tracking them in the ROB: let them retire once they're known to be non-faulting but, even if the data hasn't arrived.

This seems more likely on an in-order core, but IDK. So you could have a load that's retired but the register destination will still stall if anything tries to read it before the data actually arrives. We know that in-order cores do in practice work this way, not requiring loads to complete before later instructions can execute. (That's why software-pipelining using lots of registers is so valuable on such cores, e.g. to implement a memcpy. Reading a load result right away on an in-order core destroys memory parallelism.)

How is load->store reordering possible with in-order commit? goes into this more deeply, for in-order vs. out-of-order.

Barrier instructions

The only barrier instruction that does anything for regular stores is mfence which in practice stalls memory ops (or the whole pipeline) until the store buffer is drained. Are loads and stores the only instructions that gets reordered? covers the Skylake-with-updated-microcode behaviour of acting like lfence as well.

lfence mostly exists for the microarchitectural effect of blocking later instructions from even issuing until all previous instructions have left the out-of-order back-end (retired). The use-cases for lfence fo memory ordering are nearly non-existent.

C++ How is release-and-acquire achieved on x86 only using MOV?
How is the transitivity/cumulativity property of memory barriers implemented micro-architecturally?
How many memory barriers instructions does an x86 CPU have?
How can I experience "LFENCE or SFENCE can not pass earlier read/write"
Does lock xchg have the same behavior as mfence?
Does the Intel Memory Model make SFENCE and LFENCE redundant?
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths goes into a lot of detail about how LFENCE stops execution of later instructions, and what that means for performance.
When should I use _mm_sfence _mm_lfence and _mm_mfence high-level languages have weaker memory models than x86, so you sometimes only need a barrier that compiles to no asm instructions. Using _mm_sfence() when you haven't used any NT stores just makes your code slower for no reason than atomic_thread_fence(mo_release).

What happens with a non-temporal store if the data is already in cache?

All of the behaviors you describe are sensible implementations of a non-temporal store. In practice, on modern x86 CPUs, the actual semantics are that there's no effect on the L1 cache but the L2 (and higher-level caches, if any) will not evict a cache line to store the non-temporal fetch results.

There is no data race because the caches are hardware coherent. This coherence is not effected in any way by the decision to evict a cache line.

Non-temporal stores of portions of a packed double vector using SSE/AVX

There is a good reason to avoid using partial register stores with the non-temporal hint. If you try to scatter many small pieces of data to completely unrelated memory locations, CPU's write combining buffers overflow and you get just a usual write through the caches (probably with additional performance cost).

The correct way to use write combining (non-temporal hint) is to fill up the entire cache line. So it is usual to combine data pieces to a complete register, then write it at once with MOVNTDQ.

Acquire/release semantics with 4 threads

You are thinking in terms of sequential consistency, the strongest (and default) memory order. If this memory order is used, all accesses to atomic variables constitute a total order, and the assertion indeed cannot be triggered.

However, in this program, a weaker memory order is used (release stores and acquire loads). This means, by definition that you cannot assume a total order of operations. In particular, you cannot assume that changes become visible to other threads in the same order. (Only a total order on each individual variable is guaranteed for any atomic memory order, including memory_order_relaxed.)

The stores to x and y occur on different threads, with no synchronization between them. The loads of x and y occur on different threads, with no synchronization between them. This means it is entirely allowed that thread c sees x && ! y and thread d sees y && ! x. (I'm just abbreviating the acquire-loads here, don't take this syntax to mean sequentially consistent loads.)

Bottom line: Once you use a weaker memory order than sequentially consistent, you can kiss your notion of a global state of all atomics, that is consistent between all threads, goodbye. Which is exactly why so many people recommend sticking with sequential consistency unless you need the performance (BTW, remember to measure if it's even faster!) and are certain of what you are doing. Also, get a second opinion.

Now, whether you will get burned by this, is a different question. The standard simply allows a scenario where the assertion fails, based on the abstract machine that is used to describe the standard requirements. However, your compiler and/or CPU may not exploit this allowance for one reason or another. So it is possible that for a given compiler and CPU, you may never see that the assertion is triggered, in practice. Keep in mind that a compiler or CPU may always use a stricter memory order than the one you asked for, because this can never introduce violations of the minimum requirements from the standard. It may only cost you some performance – but that is not covered by the standard anyway.

UPDATE in response to comment: The standard defines no hard upper limit on how long it takes for one thread to see changes to an atomic by another thread. There is a recommendation to implementers that values should become visible eventually.

There are sequencing guarantees, but the ones pertinent to your example do not prevent the assertion from firing. The basic acquire-release guarantee is that if:

Thread e performs a release-store to an atomic variable x
Thread f performs an acquire-load from the same atomic variable
Then if the value read by f is the one that was stored by e, the store in e synchronizes-with the load in f. This means that any (atomic and non-atomic) store in e that was, in this thread, sequenced before the given store to x, is visible to any operation in f that is, in this thread, sequenced after the given load. [Note that there are no guarantees given regarding threads other than these two!]

So, there is no guarantee that f will read the value stored by e, as opposed to e.g. some older value of x. If it doesn't read the updated value, then also the load does not synchronize with the store, and there are no sequencing guarantees for any of the dependent operations mentioned above.

I liken atomics with lesser memory order than sequentially consistent to the Theory of Relativity, where there is no global notion of simultaneousness.

PS: That said, an atomic load cannot just read an arbitrary older value. For example, if one thread performs periodic increments (e.g. with release order) of an atomic<unsigned> variable, initialized to 0, and another thread periodically loads from this variable (e.g. with acquire order), then, except for eventual wrapping, the values seen by the latter thread must be monotonically increasing. But this follows from the given sequencing rules: Once the latter thread reads a 5, anything that happened before the increment from 4 to 5 is in the relative past of anything that follows the read of 5. In fact, a decrease other than wrapping is not even allowed for memory_order_relaxed, but this memory order does not make any promises to the relative sequencing (if any) of accesses to other variables.

Does the C++11 standard formally define acquire, release, and consume operations?

There's an informal summarized definition given in one of the notes:

performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A.

Besides that, the behavior of acquire and release operations is fully defined in 1.10, specifically in how they contribute to happens-before relationships. Any definition apart from behavior is useless.

Can atomic_thread_fence(acquire) prevent previous loads being reordered after itself?

After reading your question more carefully, looks like your modernescpp link is making the same mistake that Preshing debunked in https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/ - fences are 2-way barriers, otherwise they'd be useless.

A relaxed load followed by an acquire fence is at least as strong as an acquire load. Anything in this thread after the acquire fence happens after the load, thus it can synchronize-with a release store (or a release fence + relaxed store) in another thread.

But will it prevent store/loads before this fence being reordered after itself?

Stores no, it's only an acquire fence.

Loads, yes. In terms of a memory model where there is coherent shared cache/memory, and we're limiting local reordering of access to that, an acquire fence blocks LoadLoad and LoadStore reordering. https://preshing.com/20130922/acquire-and-release-fences/

(This is not the way ISO C++'s formalism defines things. It works in terms of happens-before rules that order things relative to a load that saw a value from a store. In those terms, a relaxed load followed by an acquire fence can create a happens-before relationship with a release-sequence, so later code in this thread sees everything that happened before the store in the other thread.)