When Should I Use _Mm_Sfence _Mm_Lfence and _Mm_Mfence

When should I use _mm_sfence _mm_lfence and _mm_mfence

Caveat: I'm no expert in this. I'm still trying to learn this myself. But since no one has replied in the past two days, it seems experts on memory fence instructions are not plentiful. So here's my understanding ...

Intel is a weakly-ordered memory system. That means your program may execute

array[idx+1] = something
idx++

but the change to idx may be globally visible (e.g. to threads/processes running on other processors) before the change to array. Placing sfence between the two statements will ensure the order the writes are sent to the FSB.

Meanwhile, another processor runs

newestthing = array[idx]

may have cached the memory for array and has a stale copy, but gets the updated idx due to a cache miss.
The solution is to use lfence just beforehand to ensure the loads are synchronized.

This article or this article may give better info

Why does _mm_mfence() produce counts for the ALL_LOADS perf event?

_mm_mfence() compiles to just the mfence instruction, which is not a load or store, architecturally speaking

One or more of the uops that it decodes to may microarchitecturally run on a load port and get counted as a load, though.

What CPU are you using? If Skylake, I assume you have updated microcode so mfence costs more than Agner Fog's tables list it as. (and it blocks out-of-order exec of non-memory uops, like lfence. See Are loads and stores the only instructions that gets reordered? Apparently some Intel CPUs before Skylake didn't do that for mfence.)

How can I experience LFENCE or SFENCE can not pass earlier read/write

Related: Does the Intel Memory Model make SFENCE and LFENCE redundant?

sfence has no real effect unless you're using NT stores¹. If you NT-store data and then a pointer to that data (or a "ready" flag), a reader can see the old value for the data even if they see the new pointer / flag value. sfence can be used to ensure that the two stores become observable in program order.

lfence is useless for memory ordering unless you're doing NT loads from a WC memory region (like video RAM). You'll have a very hard time creating a case where commenting it out creates a detectable different in memory ordering.

The main use for lfence is to serialize execution, not memory. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths

Since you asked about C not just asm, there's a related answer about when you should use _mm_sfence() and other intrinsics. When should I use _mm_sfence _mm_lfence and _mm_mfence (usually you really only need asm("" ::: "memory"); unless NT stores are in flight, because blocking compile-time reordering gives you acq / rel ordering without any runtime barrier instructions.)

Footnote 1: That's true for normal WB (WriteBack) memory cacheability settings. In user-space under a normal OS, that's what you always have unless you did something very special.

For other memory types (MTRR or PAT settings): NT stores on uncacheable memory have no special effect, and are still strongly ordered. NT stores on WC, WB, or WT memory (or normal stores to WC memory) are weakly ordered and make it useful to use sfence before storing a buffer_ready flag for another thread.

SSE4.1 movntdqa loads from WB memory are not weakly ordered. Unlike stores, it doesn't override the memory type's ordering semantics. On current CPUs, nothing special happens at all on WB memory; they're just a less-efficient movdqa laod. Only use them on WC memory.

Does the Intel Memory Model make SFENCE and LFENCE redundant?

Right, LFENCE and SFENCE are not useful in normal code because x86's acquire / release semantics for regular stores make them redundant unless you're using other special instructions or memory types.

The only fence that matters for normal lockless code is the full barrier (including StoreLoad) from a locked instruction, or a slow MFENCE. Prefer xchg for sequential-consistency stores over mov+mfence. Are loads and stores the only instructions that gets reordered? because it's faster.

Does `xchg` encompass `mfence` assuming no non-temporal instructions? (yes, even with NT instructions, as long as there's no WC memory.)

Jeff Preshing's Memory Reordering Caught in the Act article is an easier-to-read description of the same case Bartosz's post talks about, where you need a StoreLoad barrier like MFENCE. Only MFENCE will do; you can't construct MFENCE out of SFENCE + LFENCE. (Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?)

If you had questions after reading the link you posted, read Jeff Preshing's other blog posts. They gave me a good understanding of the subject. :) Although I think I found the tidbit about SFENCE/LFENCE normally being a no-op in Doug Lea's page. Jeff's posts didn't consider NT loads/stores.

Related: When should I use _mm_sfence _mm_lfence and _mm_mfence (my answer and @BeeOnRope's answer are good. I wrote this answer a lot longer ago than that answer, so parts of this answer are showing my inexperience years ago. My answer there considers the C++ intrinsics and C++ compile-time memory order, which is not at all the same thing as x86 asm runtime memory ordering. But you still don't want _mm_lfence().)

SFENCE is only relevant when using movnt (Non-Temporal) streaming stores, or working with memory regions with a type set to something other than the normal Write-Back. Or with clflushopt, which is kind of like a weakly-ordered store. NT stores bypass the cache as well as being weakly ordered. x86's normal memory model is strongly ordered, other than NT stores, WC (write-combining) memory, and ERMSB string ops (see below)).

LFENCE is only useful for memory ordering with weakly-ordered loads, which are very rare. (Or possible for LoadStore ordering with regular loads before NT stores?)

NT loads (movntdqa) from WB memory are still strongly ordered, even on a hypothetical future CPU that doesn't ignore the NT hint; the only way to do weakly-ordered loads on x86 is when reading from weakly-ordered memory (WC), and then I think only with movntdqa. This doesn't happen by accident in "normal" programs, so you only have to worry about this if you mmap video RAM or something.

(The primary use-case for lfence is not memory ordering at all, it's for serializing instruction execution, e.g. for Spectre mitigation, or with RDTSC. See Is LFENCE serializing on AMD processors? and the "linked questions" sidebar for that question.)

Memory ordering in C++, and how it maps to x86 asm

I got curious about this a couple weeks ago, and posted a fairly detailed answer to a recent question:
Atomic operations, std::atomic<> and ordering of writes. I included lots of links to stuff about the memory model of C++ vs. hardware memory models.

If you're writing in C++, using std::atomic<> is an excellent way to tell the compiler what ordering requirements you have, so it doesn't reorder your memory operations at compile time. You can and should use weaker release or acquire semantics where appropriate, instead of the default sequential consistency, so the compiler doesn't have to emit any barrier instructions at all on x86. It just has to keep the ops in source order.

On a weakly ordered architecture like ARM or PPC, or x86 with movnt, you need a StoreStore barrier instruction between writing a buffer and setting a flag to indicate the data is ready. Also, the reader needs a LoadLoad barrier instruction between checking the flag and reading the buffer.

Not counting movnt, x86 already has LoadLoad barriers between every load, and StoreStore barriers between every store. (LoadStore ordering is also guaranteed). MFENCE is all 4 kinds of barriers, including StoreLoad, which is the only barrier x86 doesn't do by default. MFENCE makes sure loads don't use old prefetched values from before other threads saw your stores and potentially did stores of their own. (As well as being a barrier for NT store ordering and load ordering.)

Fun fact: x86 lock-prefixed instructions are also full memory barriers. They can be used as a substitute for MFENCE in old 32bit code that might run on CPUs not supporting it. lock add [esp], 0 is otherwise a no-op, and does the read/modify/write cycle on memory that's very likely hot in L1 cache and already in the M state of the MESI coherency protocol.

SFENCE is a StoreStore barrier. It's useful after NT stores to create release semantics for a following store.

LFENCE is almost always irrelevant as a memory barrier because the only weakly-ordered load

a LoadLoad and also a LoadStore barrier. (loadNT / LFENCE / storeNT prevents the store from becoming globally visible before the load. I think this could happen in practice if the load address was the result of a long dependency chain, or the result of another load that missed in cache.)

ERMSB string operations

Fun fact #2 (thanks @EOF): The stores from ERMSB (Enhanced rep movsb/rep stosb on IvyBridge and later) are weakly-ordered (but not cache-bypassing). ERMSB builds on regular Fast-String Ops (wide stores from the microcoded implementation of rep stos/movsb that's been around since PPro).

Intel documents the fact that ERMSB stores "may appear to execute out of order" in section 7.3.9.3 of their Software Developers Manual, vol1. They also say

"Order-dependent code should write to a discrete semaphore variable
after any string operations to allow correctly ordered data to be seen
by all processors"

They don't mention any barrier instructions being necessary between the rep movsb and the store to a data_ready flag.

The way I read it, there's an implicit SFENCE after rep stosb / rep movsb (at least a fence for the string data, probably not other in-flight weakly ordered NT stores). Anyway, the wording implies that a write to the flag / semaphore becomes globally visible after all the string-move writes, so no SFENCE / LFENCE is needed in code that fills a buffer with a fast-string op and then writes a flag, or in code that reads it.

(LoadLoad ordering always happens, so you always see data in the order that other CPUs made it globally visible. i.e. using weakly-ordered stores to write a buffer doesn't change the fact that loads in other threads are still strongly ordered.)

summary: use a normal store to write a flag indicating that a buffer is ready. Don't have readers just check the last byte of the block written with memset/memcpy.

I also think ERMSB stores prevent any later stores from passing them, so you still only need SFENCE if you're using movNT. i.e. the rep stosb as a whole has release semantics wrt. earlier instructions.

There's a MSR bit that can be cleared to disable ERMSB for the benefit of new servers that need to run old binaries that writes a "data ready" flag as part of a rep stosb or rep movsb or something. (In that case I guess you get the old fast-string microcode that may use an efficient cache protocol, but does make all the stores appear to other cores in order).

When Should I Use _Mm_Sfence _Mm_Lfence and _Mm_Mfence