How to Force Cache Coherency on a Multicore X86 Cpu

Can I force cache coherency on a multicore x86 CPU?

volatile only forces your code to re-read the value, it cannot control where the value is read from. If the value was recently read by your code then it will probably be in cache, in which case volatile will force it to be re-read from cache, NOT from memory.

There are not a lot of cache coherency instructions in x86. There are prefetch instructions like prefetchnta, but that doesn't affect the memory-ordering semantics. It used to be implemented by bringing the value to L1 cache without polluting L2, but things are more complicated for modern Intel designs with a large shared inclusive L3 cache.

x86 CPUs use a variation on the MESI protocol (MESIF for Intel, MOESI for AMD) to keep their caches coherent with each other (including the private L1 caches of different cores). A core that wants to write a cache line has to force other cores to invalidate their copy of it before it can change its own copy from Shared to Modified state.

You don't need any fence instructions (like MFENCE) to produce data in one thread and consume it in another on x86, because x86 loads/stores have acquire/release semantics built-in. You do need MFENCE (full barrier) to get sequential consistency. (A previous version of this answer suggested that clflush was needed, which is incorrect).

You do need to prevent compile-time reordering, because C++'s memory model is weakly-ordered. volatile is an old, bad way to do this; C++11 std::atomic is a much better way to write lock-free code.

Memory barriers force cache coherency?

As I understand, synchronization primitives won't affect cache coherency at all. Cache is French for hidden, it's not supposed to be visible to the user. A cache coherency protocol should work without the programmer's involvement.

Synchronization primitives will affect the memory ordering, which is well defined and visible to the user through the processor's ISA.

A good source with detailed information is A Primer on Memory Consistency and Cache Coherence from the Synthesis Lectures on Computer Architecture collection.

EDIT: To clarify your doubt

The Wikipedia statement is slightly wrong. I think the confusion might come from the terms memory consistency and cache coherency. They don't mean the same thing.

The volatile keyword in C means that the variable is always read from memory (as opposed to a register) and that the compiler won't reorder loads/stores around it. It doesn't mean the hardware won't reorder the loads/stores. This is a memory consistency problem. When using weaker consistency models the programmer is required to use synchronization primitives to enforce a specific ordering. This is not the same as cache coherency. For example, if thread 1 modifies location A, then after this event thread 2 loads location A, it will receive an updated (consistent) value. This should happen automatically if cache coherency is used. Memory ordering is a different problem. You can check out the famous paper Shared Memory Consistency Models: A Tutorial for more information. One of the better known examples is Dekker's Algorithm which requires sequential consistency or synchronization primitives.

EDIT2: I would like to clarify one thing. While my cache coherency example is correct, there is a situation where memory consistency might seem to overlap with it. This when stores are executed in the processor but delayed going to the cache (they are in a store queue/buffer). Since the processor's cache hasn't received an updated value, the other caches won't either. This may seem like a cache coherency problem but in reality it is not and is actually part of the memory consistency model of the ISA. In this case synchronization primitives can be used to flush the store queue to the cache. With this in mind, the Wikipedia text that you highlighted in bold is correct but this other one is still slightly wrong: The keyword volatile does not guarantee a memory barrier to enforce cache-consistency. It should say: The keyword volatile does not guarantee a memory barrier to enforce memory consistency.

What cache coherence solution do modern x86 CPUs use?

MESI is defined in terms of snooping a shared bus, but no, modern CPUs don't actually work that way. MESI states for each cache line can be tracked / updated with messages and a snoop filter (basically a directory) to avoid broadcasting those messages, which is what Intel (MESIF) and AMD (MOESI) actually do.

e.g. the shared inclusive L3 cache in Intel CPUs (before Skylake server) lets L3 tags act as a snoop filter; as well as tracking the MESI state, they also record which core # (if any) has a private copy of a line. Which cache mapping technique is used in intel core i7 processor?

For example, a Sandybridge-family CPU with a ring bus (modern client chips, server chips up to Broadwell). Core #0 reads a line. That line is in Modified state on core #1.

read misses in L1d and L2 cache on core #0, resulting in is sending a request on the ring bus to the L3 slice that contains that line (indexing via a hash function on some physical address bits)
That slice of L3 gets the message, checks its tags. If it found tag = Shared at this point, the response could go back over the bidirectional ring bus with the data.
Otherwise, L3 tags tell it that core #1 has exclusive ownership of a line: Exclusive, may have been promoted to Modified = dirty.
L3 cache logic in that slice of L3 will generate a message to ask core #1 to write back that line.
The message arrives at the ring bus stop for core #1, and gets its L2 or L1d to write back that line.
IDK if one ring bus message can be read directly by Core #0 as well as the relevant slice of L3 cache, or if the message might have to go all the way to the L3 slice and then to core #0 from there. (Worst case distance = basically all the way around the ring, instead of half, for a bidirectional ring.)

This is super hand-wavy; do not take my word for it on the exact details, but the general concept of sending messages like share-request, RFO, or write-back, is the right mental model. BeeOnRope has an answer that with a similar breakdown into steps that covers uops and the store buffer, as well as MESI / RFO.

In a similar case, core #1 could have silently dropped the line without having modified it, if it had only gotten Exclusive ownership but never written it. (Loads that miss in cache default to loading into Exclusive state so a separate store won't have to do an RFO for the same line). In that case I assume it the core that doesn't have the line after all has to send a message back to indicate that. Or maybe it sends a message directly to one of the memory controllers that are also on the ring bus, instead of a round trip back to the L3 slice to force it to do that.

Obviously stuff like this can be happening in parallel for every core. (And each core can have multiple outstanding requests it's waiting for: memory level parallelism within a single core. On Intel, L2 superqueue has 16 entries on some microarchitectures, while there are 10 or 12 L1 LFBs.)

Quad-socket and higher systems have snoop filters between sockets; dual-socket Intel systems with E5-xxxx CPUs of Broadwell and earlier did just spam snoops to each other over the QPI links. (Unless you used a quad-socket-capable CPU (E7-xxxx) in a dual-socket system). Multi-socket is hard because missing in local L3 doesn't necessarily mean it's time to hit DRAM; the / an other socket might have the line modified.

Also related:

https://www.realworldtech.com/sandy-bridge/ Kanter's SnB write-up covers some about Intel's ring bus design, IIRC, although it's mostly about the internals of each core. The shared inclusive L3 was new in Nehalem (when Intel started using the "core i7" brand name), https://www.realworldtech.com/nehalem/
Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? - more hops on the ring bus for Intel CPUs with more cores hurts L3 and DRAM latency and therefore bandwidth = max-concurrency / latency.
What is the benefit of the MOESI cache coherency protocol over MESI? some more links.

x86 LOCK question on multi-core CPUs

It's about locking the memory bus for that address. The Intel 64 and IA-32 Architectures Software Developer's Manual - Volume 3A: System Programming Guide, Part 1 tells us:

7.1.4 Effects of a LOCK Operation on Internal Processor Caches.

For the Intel486 and Pentium processors, the LOCK# signal is always
asserted on the bus during a LOCK
operation, even if the area of memory
being locked is cached in the
processor.

For the P6 and more recent processor
families, if the area of memory being
locked during a LOCK operation is
cached in the processor that is
performing the LOCK operation as
write-back memory and is completely
contained in a cache line, the
processor may not assert the LOCK#
signal on the bus. Instead, it will
modify the memory location internally
and allow [its] cache coherency
mechanism to insure that the operation
is carried out atomically. This
operation is called "cache locking."
The cache coherency mechanism
automatically prevents two or more
processors that have the same area of
memory from simultaneously modifying
data in that area. (emphasis added)

Here we learn that the P6 and newer chips are smart enough to determine if they really have to block off the bus or can just rely on intelligent caching. I think this is a neat optimization.

I discussed this more in my blog post "How Do Locks Lock?"

cache coherency of a shared boolean value c++11

No, you can't say that with certainty because no standard provides that guarantee. When you "synthesize" a guarantee by combining information from lots of places and ultimately rely on your inability to think of any way it can fail, you do not have certainty.

There are lots of examples of people who thought they had certainty in this way and then things failed in ways they could not think of. That said, I can't think of any way this could fail either, but I wouldn't rely on it.

Note that you should not expect memory operations to provide any ordering, just that a change in one thread will eventually be visible in another. In particular, you cannot assume that a thread that sees a change to a particular boolean will see any memory operations that appear prior to that one in the code. Compilers and CPUs are free to, and do in practice, reorder memory operations.

So even if it was guaranteed, you couldn't use it for much. Even using it for a boolean to shut a thread down like while (!shutdown) do_work(); in one thread and shutdown = true; in another is risky. If the compiler can prove that do_work() cannot modify shutdown, it can optimize out the check of shutdown and the loop may not ever terminate.

Does processor stall during cache coherence operation

All modern ISAs use (a variant of) MESI for cache coherency. This maintains coherency at all times of the shared view of memory (through cache) that all processors have.

See for example Can I force cache coherency on a multicore x86 CPU? It's a common misconception that stores go into cache while other cores still have old copies of the cache line, and then "cache coherence" has to happen.

But that's not the case: to modify a cache line, a CPU needs to have exclusive ownership of the line (Modified or Exclusive state of MESI). This is only possible after receiving responses to a Read For Ownership that invalidates all other copies of the cache line, if it was in Shared or Invalid state before. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for example.

However, memory models allow local reordering of stores and loads. Sequential consistency would be too slow, so CPUs always allow at least StoreLoad reordering. See also Is mov + mfence safe on NUMA? for lots of details about the TSO (total store order) memory model used on x86. Many other ISAs use an even weaker model.

For an unsynchronized reader in this case, there are three possibilities if both are running on separate cores

load(a) happens on core#2 before the cache line is invalidated, so it reads the old value and thus effectively happens before the a=1 store in the global order. The load can hit in L1d cache.
load(a) happens after core#1 has committed the store to its L1d cache, and hasn't written back yet. Core#2's read request triggers Core#2 to write-back to shared a shared level of cache (e.g. L3), and puts the line into Shared state. The load will definitely miss in L1d.
load(a) happens after write-back to memory or at least L3 has already happened, so it doesn't have to wait for core#1 to write-back. The load will miss in L1d unless hardware prefetch has brought it back in for some reason. But usually that only happens as part of sequential accesses (e.g. to an array).

So yes, the load will stall if the other core has already committed it to cache before this core tries to load it.

See also Size of store buffers on Intel hardware? What exactly is a store buffer? for more about the effect of the store buffer on everything, including memory reordering.

It doesn't matter here because you havea write-only producer and a read-only consumer. The producer core doesn't wait for its store to become globally visible before continuing, and it can see its own store right away, before it becomes globally visible. It does matter when you have each thread looking at stores done by the other thread; then you need barriers, or sequentially-consistent atomic operations (which compilers implement with barriers). See https://preshing.com/20120515/memory-reordering-caught-in-the-act

See also
Can num++ be atomic for 'int num'? for how atomic RMW works with MESI, that's instructive to understanding the concept. (e.g. that an atomic RMW can work by having a core hang on to a cache line in Modified state, and delay responding to RFO or requests to share it until the write part of the RMW has committed.)

Eliding cache snooping for thread-local memory

Most modern processors use a directory coherence protocol to maintain coherence between all the cores in the same NUMA node and another directory coherence protocol to maintain coherence between all NUMA nodes and IO hubs that are in the same coherence domain, where each NUMA node could be an active socket, part of an active socket, or a node controller. A brief introduction to coherence in real processors can be fount at: Cache coherency(MESI protocol) between different levels of cache namely L1, L2 and L3.

Directory coherence protocols significantly reduce the need for broadcasting snoops because they provide additional coherence state per cache line to basically track who may possibly have a copy of the line. Unnecessary snoops can still occur in the following cases:

A line gets silently evicted from a core or NUMA node without notifying the directory controller.
The directory state may be protected with an error detection code. If the state is deemed corrupted, a broadcast is required.
Depending on the microarchitecture, the in-memory directory may not have the capability of tracking cache lines per NUMA node but rather at the granularity of "any other NUMA node."

The cost of unnecessary snooping is not just extra energy consumption, but also latency because a request cannot be considered to have completed non-speculatively unless all the coherence transactions have completed. This can significantly increase the time to complete a request, which in turn limits bandwdith because each outstanding request consumes certain hardware resources.

You don't have to worry about unnecessary snoops to cache lines storing thread-local variables as long as there are truly being used as thread-local and the thread that owns these variables rarely migrates between physical cores.

Are snoop requests sent to all the cores in a multi node setup?

But if a cache line is in I (INVALID) state to begin with while none
of the other cores have it in their L1/L2, once the cache line is
requested from home agent will the load request be also broadcasted to
other local cores?

This is an implementation detail and is not part of the QPI specification. On all Intel processors starting with Nehalem, whether the L3 cache is inclusive or non-inclusive, each caching agent on the on-die interconnect has an inclusive directory for tracking the cache lines that it owns (i.e., whose physical address is mapped to it). So a snoop is never broadcasted to all local cores unless the directory indicates that all of them need to be snooped. On a miss in the L3 cache, the request is sent to the home agent of the target cache line.

will the load request be broadcasted to cores on a different node
also?

This is also an implementation detail. It depends on the coherence mode. If the processor supports memory-level coherence directory and if that directory is enabled, then there is no need to broadcast for every request. Some processors support opportunistic broadcast (OSB). If OSB is enabled, the home agent may speculatively broadcasts a snoop if bandwdith is available. This is done in parallel with the directory lookup operation. If the directory lookup result indicates that there is no need to snoop other NUMA nodes, the home agent sends the requested data back without waiting for the snoop responses, thereby reducing latency.