Intel CPU Cache Policy

This is not something you can query from CPUID or such, nor can you configure your CPU to do one or the other, thus there exists no tool for querying. What you can query is the cache associativity, the cache line size, and the cache size, for example via /proc/cpuinfo.

All Intel-compatible CPUs during the last one/two decades used a write-back strategy for caches (which presumes fetching a cache line first to allow partial writes). Of course that's the theory, reality is slighly more complex than that.

Virtually all processors (your model included) have one or several forms of write combining (or fill buffers as Intel calls it since Merom), and all but the most antique Intel-compatible CPUs support uncached writes from SSE registers (which again uses a form of write-combining). And then of course, there are things like on-chip cache coherence protocols and snoop filtering and other mechanisms to ensure cache coherency both between cores of one processor and between different processors in a multi-processor system.

Nevertheless -- the general cache policy is still write-back.

Which cache mapping technique is used in intel core i7 processor?

Direct-mapped caches are basically never used in modern high-performance CPUs. The power savings are outweighed by the large advantage in hit rate for a set-associative cache of the same size, with only a bit more complexity in the control logic. Transistor budgets are very large these days.

It's very common for software to have at least a couple arrays that are a multiple of 4k apart from each other, which would create conflict misses in a direct-mapped cache. (Tuning code with more than a couple arrays can involve skewing them to reduce conflict misses, if a loop needs to iterate through all of them at once)

Modern CPUs are so fast that DRAM latency is over 200 core clock cycles, which is too big even for powerful out-of-order execution CPUs to hide very well on a cache miss.

Multi-level caches are essential (and used is all high-performance CPUs) to give the low latency (~4 cycles) / high throughput for the hottest data (e.g. up to 2 loads and 1 store per clock, with a 128, 256 or even 512-bit path between L1D cache and vector load/store execution units), while still being large enough to cache a reasonable sized working set. It's physically impossible to build one very large / very fast / highly-associative cache that performs as well as current multi-level caches for typical workloads; speed-of-light delays when data has to physically travel far are a problem. The power cost would be prohibitive as well. (In fact, power / power density is a major limiting factor for modern CPUs, see Modern Microprocessors: A 90-Minute Guide!.)

All levels of cache (except the uop cache) are physically indexed / physically tagged in all the x86 CPUs I'm aware of. L1D caches in most designs take their index bits from below the page offset, and thus are also VIPT allowing TLB lookup to happen in parallel with tag fetch, but without any aliasing problems. Thus, caches don't need to be flushed on context switches or anything. (See this answer for more about multi-level caches in general and the VIPT speed trick, and some cache parameters of some actual x86 CPUs.)

The private (per-core) L1D / L1I and L2 caches are traditional set-associative caches, often 8-way or 4-way for the small/fast caches. Cache line size is 64 bytes on all modern x86 CPUs. The data caches are write-back. (Except on AMD Bulldozer-family, where L1D is write-through with a small 4kiB write-combining buffer.)

http://www.7-cpu.com/ has good cache organization / latency numbers, and bandwidth, and TLB organization / performance numbers, for various microarchitectures, including many x86, like Haswell.

The "L0" decoded-uop cache in Intel Sandybridge-family is set-associative and virtually addressed. Up to 3 blocks of up to 6 uops can cache decode results from instructions in a 32-byte block of machine code. Related: Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs. (A uop cache is a big advance for x86: x86 instructions are variable-length and hard to decode fast / in parallel, so caching the internal decode results as well as the machine code (L1I$) has significant power and throughput advantages. Powerful decoders are still needed, because the uop cache isn't large; it's most effective in loops (including medium to large loops). This avoids the Pentium4 mistake (or limitation based on transitor size at the time) of having weak decoders and relying on the trace cache.)

Modern Intel (and AMD, I assume) L3 aka LLC aka last-level caches use an indexing function that isn't just a range of address bits. It's a hash function that better distributes things to reduce collisions from fixed strides. According to Intel my cache should be 24-way associative though its 12-way, how is that?.

From Nehalem onwards, Intel has used a large inclusive shared L3 cache, which filters coherency traffic between cores. i.e. when one core reads data which is in Modified state in L1d of another core, L3 tags say which core, so an RFO (Read For Ownership) can be sent only to that core, instead of broadcast. How are the modern Intel CPU L3 caches organized?. The inclusivity property is important, because it means no private L2 or L1 cache can have a copy of a cache line without L3 knowing about it. If it's in Exclusive or Modified state in a private cache, L3 will have Invalid data for that line, but the tags will still say which core might have a copy. Cores that definitely don't have a copy don't need to be sent a message about it, saving power and bandwidth over the internal links between cores and L3. See Why On-Chip Cache Coherence Is Here to Stay for more details about on-chip cache coherency in Intel "i7" (i.e. Nehalem and Sandybridge-family, which are different architectures but do use the same cache hierarchy).

Core2Duo had a shared last-level cache (L2), but was slow at generating RFO (Read-For-Ownership) requests on L2 misses. So bandwidth between cores with a small buffer that fits in L1d is as slow as with a large buffer that doesn't fit in L2 (i.e. DRAM speed). There's a fast range of sizes when the buffer fits in L2 but not L1d, because the writing core evicts its own data to L2 where the other core's loads can hit without generating an RFO request. (See Figure 3.27: Core 2 Bandwidth with 2 Threads in Ulrich Drepper's "What Every Programmer Should Know about Memory". (Full version here).

Skylake-AVX512 has larger per-core L2 (1MiB instead of 256k), and smaller L3 (LLC) slices per core. It's no longer inclusive. It uses a mesh network instead of a ring bus to connect cores to each other. See this AnandTech article (but it has some inaccuracies in the microarchitectural details on other pages, see the comment I left).

From Intel® Xeon® Processor Scalable Family Technical Overview
Due to the non-inclusive nature of LLC, the absence of a cache line in LLC does not indicate that the line is not present in private caches of any of the cores. Therefore, a snoop filter is used to keep track of the location of cache lines in the L1 or MLC of cores when it is not allocated in the LLC. On the previous-generation CPUs, the shared LLC itself took care of this task.

This "snoop-filter" is only useful if it can't have false negatives. It's ok to send an invalidate or RFO (MESI) to a core that doesn't have a copy of a line. It's not ok to let a core keep a copy of a line when another core is requesting exclusive access to it. So it may be a tag-inclusive tracker that knows which cores might have copies of which line, but which doesn't cache any data.

Or maybe the snoop filter can still be useful without being strictly inclusive of all L2 / L1 tags. I'm not an expert on multi-core / multi-socket snoop protocols. I think the same snoop filter may also help filter snoop requests between sockets. (In Broadwell and earlier, only quad-socket and higher Xeons have a snoop filter for inter-core traffic; dual-socket-only Broadwell Xeon and earlier don't filter snoop requests between the two sockets.)

AMD Ryzen uses separate L3 caches for clusters of cores, so data shared across many cores has to be duplicated in the L3 for each cluster. Also importantly, writes from a core in one cluster take longer to be visible to a core in another cluster, with the coherency requests having to go over an interconnect between clusters. (Similar to between sockets in a multi-socket Intel system, where each CPU package has its own L3.)

So this gives us NUCA (Non-Uniform Cache Access), analogous to the usual NUMA (Non-Uniform Memory Access) that you get in a multi-socket system where each processor has a memory controller built-in, and accessing local memory is faster than accessing memory attached to another socket.

Recent Intel multi-socket systems have configurable snoop modes so in theory you can tune the NUMA mechanism to work best for the workload you're running. See Intel's page about Broadwell-Xeon for a table + description of the available snoop modes.

Another advance / evolution is an adaptive replacement policy in the L3 on IvyBridge and later. This can reduce pollution when some data has temporal locality but other parts of the working set are much larger. (i.e. looping over a giant array with standard LRU replacement will evict everything, leaving L3 cache only caching data from the array that won't be touched again soon. Adaptive replacement tries to mitigate that problem.)

How can the L1, L2, L3 CPU caches be turned off on modern x86/amd64 chips?

The Intel's manual 3A, Section 11.5.3, provides an algorithm to globally disable the caches:

11.5.3 Preventing Caching

To disable the L1, L2, and L3 caches after they have been enabled and have received cache fills, perform the following steps:

Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1 and the NW flag to 0.

Flush all caches using the WBINVD instruction.

Disable the MTRRs and set the default memory type to uncached or set all MTRRs for the uncached memory
type (see the discussion of the discussion of the TYPE field and the E flag in Section 11.11.2.1,
“IA32_MTRR_DEF_TYPE MSR”).

The caches must be flushed (step 2) after the CD flag is set to ensure system memory coherency. If the caches are
not flushed, cache hits on reads will still occur and data will be read from valid cache lines.

The intent of the three separate steps listed above addresses three distinct requirements: (i) discontinue new data
replacing existing data in the cache (ii) ensure data already in the cache are evicted to memory, (iii) ensure subsequent memory references observe UC memory type semantics. Different processor implementation of caching
control hardware may allow some variation of software implementation of these three requirements. See note below.

NOTES
Setting the CD flag in control register CR0 modifies the processor’s caching behaviour as indicated
in Table 11-5, but setting the CD flag alone may not be sufficient across all processor families to
force the effective memory type for all physical memory to be UC nor does it force strict memory
ordering, due to hardware implementation variations across different processor families. To force
the UC memory type and strict memory ordering on all of physical memory, it is sufficient to either
program the MTRRs for all physical memory to be UC memory type or disable all MTRRs.

For the Pentium 4 and Intel Xeon processors, after the sequence of steps given above has been
executed, the cache lines containing the code between the end of the WBINVD instruction and
before the MTRRS have actually been disabled may be retained in the cache hierarchy. Here, to remove code from the cache completely, a second WBINVD instruction must be executed after the
MTRRs have been disabled.

That's a long quote but it boils down to this code

;Step 1 - Enter no-fill mode
mov eax, cr0
or eax, 1<<30        ; Set bit CD
and eax, ~(1<<29)    ; Clear bit NW
mov cr0, eax

;Step 2 - Invalidate all the caches
wbinvd

;All memory accesses happen from/to memory now, but UC memory ordering may not be enforced still.  

;For Atom processors, we are done, UC semantic is automatically enforced.

xor eax, eax
xor edx, edx
mov ecx, IA32_MTRR_DEF_TYPE    ;MSR number is 2FFH
wrmsr

;P4 only, remove this code from the L1I
wbinvd

most of which is not executable from user mode.

AMD's manual 2 provides a similar algorithm in section 7.6.2

7.6.2 Cache Control Mechanisms

The AMD64 architecture provides a number of mechanisms for controlling the cacheability of memory. These are described in the following sections.

Cache Disable. Bit 30 of the CR0 register is the cache-disable bit, CR0.CD. Caching is enabled
when CR0.CD is cleared to 0, and caching is disabled when CR0.CD is set to 1. When caching is
disabled, reads and writes access main memory.

Software can disable the cache while the cache still holds valid data (or instructions). If a read or write
hits the L1 data cache or the L2 cache when CR0.CD=1, the processor does the following:

Writes the cache line back if it is in the modified or owned state.

Invalidates the cache line.

Performs a non-cacheable main-memory access to read or write the data.

If an instruction fetch hits the L1 instruction cache when CR0.CD=1, some processor models may read
the cached instructions rather than access main memory. When CR0.CD=1, the exact behavior of L2
and L3 caches is model-dependent, and may vary for different types of memory accesses.

The processor also responds to cache probes when CR0.CD=1. Probes that hit the cache cause the
processor to perform Step 1. Step 2 (cache-line invalidation) is performed only if the probe is
performed on behalf of a memory write or an exclusive read.

Writethrough Disable. Bit 29 of the CR0 register is the not writethrough disable bit, CR0.NW. In
early x86 processors, CR0.NW is used to control cache writethrough behavior, and the combination of
CR0.NW and CR0.CD determines the cache operating mode.

[...]

In implementations of the AMD64 architecture, CR0.NW is not used to qualify the cache operating
mode established by CR0.CD.

This translates to this code (very similar to the Intel's one):

;Step 1 - Disable the caches
mov eax, cr0
or eax, 1<<30
mov cr0, eax

;For some models we need to invalidated the L1I
wbinvd

;Step 2 - Disable speculative accesses
xor eax, eax
xor edx, edx
mov ecx, MTRRdefType  ;MSR number is 2FFH
wrmsr

Caches can also be selectively disabled at:

Page level, with the attribute bits PCD (Page Cache Disable) [Only for Pentium Pro and Pentium II].

When both are clear the MTTR of relevance is used, if PCD is set the aching
Page level, with the PAT (Page Attribute Table) mechanism.

By filling the IA32_PAT with caching types and using the bits PAT, PCD, PWT as a 3-bit index it's possible to select one the six caching types (UC-, UC, WC, WT, WP, WB).
Using the MTTRs (fixed or variable).

By setting the caching type to UC or UC- for specific physical areas.

Of these options only the page attributes can be exposed to user mode programs (see for example this).

What cache invalidation algorithms are used in actual CPU caches?

As hivert said - it's hard to get a clear picture on the specific algorithm, but one can deduce some of the information according to hints or clever reverse engineering.

You didn't specify which CPU you mean, each one can have a different policy (actually even within the same CPU different cache levels may have different policies, not to mention TLBs and other associative arrays which also may have such policies). I did find a few hints about Intel (specifically Ivy bridge), so we'll use this as a benchmark for industry level "standards" (which may or may not apply elsewhere).

First, Intel presented some LRU related features here -
http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-1-Microprocessor/HC24.28.117-HotChips_IvyBridge_Power_04.pdf

Slide 46 mentioned "Quad-Age LRU" - this is apparently an age based LRU that assigned some "age" to each line according to its predicted importance. They mention that prefetches get middle age, so demands are probably allocated at a higher age (or lower, whatever survives longest), and all lines likely age gradually, so the oldest gets replaced first. Not as good as perfect "fifo-like" LRU, but keep in mind that most caches don't implement that, but rather a complicated pseudo-LRU solution, so this might be an improvement.

Another interesting mechanism mentioned there, which goes the extra mile beyond classic LRU, is adaptive fill policy. There's a pretty good analysis here - http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ , but in a nutshell (if the blog is correct, and he does seem to make a good match with his results), the cache dynamically chooses between two LRU policies, trying to decide whether the lines are going to be reused or not (and should be kept or not).

I guess this could answer to some extent your question on multiple LRU schemes. Implementing several schemes is probably hard and expensive in terms of HW, but when you have some policy that's complicated enough to have parameters, it's possible to use tricks like dynamic selection, set dueling , etc..

CPU cache inhibition

x86 has no way to do a store that bypasses or writes through L1D/L2 but not L3. There are NT stores which bypass all cache. Anything that forces a write-back to L3 also forces write-back all the way to memory. (e.g. a clwb instruction). Those are designed for non-volatile RAM use cases, or for non-coherent DMA, where it's important to get data committed to actual RAM.

There's also no way to do a load that bypasses L1D (except from USWC memory with SSE4.1 movntdqa, but it's not "special" on other memory types). prefetchNTA can bypass L2, according to Intel's optimization manual.

Prefetch on the core doing the read should be useful to trigger write-back from other core into L3, and transfer into your own L1D. But that's only useful if you have the address ready before you want to do the load. (Dozens of cycles for it to be useful.)

Intel CPUs use a shared inclusive L3 cache as a backstop for on-chip cache coherency. 2-socket has to snoop the other socket, but Xeons that support more than 2P have snoop filters to track cache lines that move around.

When you read a line that was recently written by another core, it's always Invalid in your L1D. L3 is tag-inclusive, and its tags have extra info to track which core has the line. (This is true even if the line is in M state in an L1D somewhere, which requires it to be Invalid in L3, according to normal MESI.) Thus, after your cache-miss checks L3 tags, it triggers a request to the L1 that has the line to write it back to L3 cache (and maybe to send it directly to the core than wants it).

Skylake-X (Skylake-AVX512) doesn't have an inclusive L3 (It has a bigger private L2 and a smaller L3), but it still has a tag-inclusive structure to track which core has a line. It also uses a mesh instead of ring, and L3 latency seems to be significantly worse than Broadwell.

Possibly useful: map the latency-critical part of your shared memory region with a write-through cache policy. IDK if this patch ever made it into the mainline Linux kernel, but see this patch from HP: Support Write-Through mapping on x86. (The normal policy is WB.)

Also related: Main Memory and Cache Performance of
Intel Sandy Bridge and AMD Bulldozer, an in-depth look at latency and bandwidth on 2-socket SnB, for cache lines in different starting states.

For more about memory bandwidth on Intel CPUs, see Enhanced REP MOVSB for memcpy, especially the Latency Bound Platforms section. (Having only 10 LFBs limits single-core bandwidth).

Related: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? has some experimental results for having one thread spam writes to a location while another thread reads it.

Note that the cache miss itself isn't the only effect. You also get a lot of machine_clears.memory_ordering from mis-speculation in the core doing the load. (x86's memory model is strongly ordered, but real CPUs speculatively load early and abort in the rare case where the cache line becomes invalid before the load was supposed to have "happened".

How does the CPU cache affect the performance of a C program

The plots show the combination of several complex low-level effects (mainly cache trashing & prefetching issues). I assume the target platform is a mainstream modern processor with cache lines of 64 bytes (typically a x86 one).

I can reproduce the problem on my i5-9600KF processor. Here is the resulting plot:

performance plot

First of all, when nj is small, the gap between fetched address (ie. strides) is small and cache lines are relatively efficiently used. For example, when nj = 1, the access is contiguous. In this case, the processor can efficiently prefetch the cache lines from the DRAM so to hide its high latency. There is also a good spatial cache locality since many contiguous items share the same cache line. When nj=2, only half the value of a cache line is used. This means the number of requested cache line is twice bigger for the same number of operations. That being said the time is not much bigger due to the relatively high latency of adding two floating-point numbers resulting in a compute-bound code. You can unroll the loop 4 times and use 4 different sum variables so that (mainstream modern) processors can add multiple values in parallel. Note that most processors can also load multiple values from the cache per cycle. When nj = 4 a new cache line is requested every 2 cycles (since a double takes 8 bytes). As a result, the memory throughput can become so big that the computation becomes memory-bound. One may expect the time to be stable for nj >= 8 since the number of requested cache line should be the same, but in practice processors prefetch multiple contiguous cache lines so not to pay the overhead of the DRAM latency which is huge in this case. The number of prefetched cache lines is generally between 2 to 4 (AFAIK such prefetching strategy is disabled on Intel processors when the stride is bigger than 512, so when nj >= 64. This explains why the timings are sharply increasing when nj < 32 and they become relatively stable with 32 <= nj <= 256 with exceptions for peaks.

The regular peaks happening when nj is a multiple of 16 are due to a complex cache effect called cache thrashing. Modern cache are N-way associative with N typically between 4 and 16. For example, here are statistics on my i5-9600KF processors:

Cache 0: L1 data cache,        line size 64,  8-ways,    64 sets, size 32k 
Cache 1: L1 instruction cache, line size 64,  8-ways,    64 sets, size 32k 
Cache 2: L2 unified cache,     line size 64,  4-ways,  1024 sets, size 256k 
Cache 3: L3 unified cache,     line size 64, 12-ways, 12288 sets, size 9216k

This means that two fetched values from the DRAM with the respective address A1 and A2 can results in conflicts in my L1 cache if (A1 % 32768) / 64 == (A2 % 32768) / 64. In this case, the processor needs to choose which cache line to replace from a set of N=8 cache lines. There are many cache replacement policy and none is perfect. Thus, some useful cache line are sometime evicted too early resulting in additional cache misses required later. In pathological cases, many DRAM locations can compete for the same cache lines resulting in excessive cache misses. More information about this can be found also in this post.

Regarding the nj stride, the number of cache lines that can be effectively used in the L1 cache is limited. For example, if all fetched values have the same address modulus the cache size, then only N cache lines (ie. 8 for my processor) can actually be used to store all the values. Having less cache lines available is a big problem since the prefetcher need a pretty large space in the cache so to store the many cache lines needed later. The smaller the number of concurrent fetches, the lower memory throughput. This is especially true here since the latency of fetching 1 cache line from the DRAM is about several dozens of nanoseconds (eg. ~70 ns) while its bandwidth is about dozens of GiB/s (eg. ~40 GiB/s): dozens of cache lines (eg. ~40) should be fetched concurrently so to hide the latency and saturate the DRAM.

Here is the simulation of the number of cache lines that can be actually used in my L1 cache regarding the value of the nj:

 nj  #cache-lines
  1      512
  2      512
  3      512
  4      512
  5      512
  6      512
  7      512
  8      512
  9      512
 10      512
 11      512
 12      512
 13      512
 14      512
 15      512
 16      256    <----
 17      512
 18      512
 19      512
 20      512
 21      512
 22      512
 23      512
 24      512
 25      512
 26      512
 27      512
 28      512
 29      512
 30      512
 31      512
 32      128    <----
 33      512
 34      512
 35      512
 36      512
 37      512
 38      512
 39      512
 40      512
 41      512
 42      512
 43      512
 44      512
 45      512
 46      512
 47      512
 48      256    <----
 49      512
 50      512
 51      512
 52      512
 53      512
 54      512
 55      512
 56      512
 57      512
 58      512
 59      512
 60      512
 61      512
 62      512
 63      512
 64       64    <----
==============
 80      256
 96      128
112      256
128       32
144      256
160      128
176      256
192       64
208      256
224      128
240      256
256       16
384       32
512        8
1024       4

We can see that the number of available cache lines is smaller when nj is a multiple of 16. In this case, the prefetecher will preload data into cache lines that are likely evicted early by subsequent fetched (done concurrently). Loads instruction performed in the code are more likely to result in cache misses when the number of available cache line is small. When a cache miss happen, the value need then to be fetched again from the L2 or even the L3 resulting in a slower execution. Note that the L2 cache is also subject to the same effect though it is less visible since it is larger. The L3 cache of modern x86 processors makes use of hashing to better distributes things to reduce collisions from fixed strides (at least on Intel processors and certainly on AMD too though AFAIK this is not documented).

Here is the timings on my machine for some peaks:

  32 4.63600000e-03 4.62298020e-03 4.06400000e-03 4.97300000e-03
  48 4.95800000e-03 4.96994059e-03 4.60400000e-03 5.59800000e-03
  64 5.01600000e-03 5.00479208e-03 4.26900000e-03 5.33100000e-03
  96 4.99300000e-03 5.02284158e-03 4.94700000e-03 5.29700000e-03
 128 5.23300000e-03 5.26405941e-03 4.93200000e-03 5.85100000e-03
 192 4.76900000e-03 4.78833663e-03 4.60100000e-03 5.01600000e-03
 256 5.78500000e-03 5.81666337e-03 5.77600000e-03 6.35300000e-03
 384 5.25900000e-03 5.32504950e-03 5.22800000e-03 6.75800000e-03
 512 5.02700000e-03 5.05165347e-03 5.02100000e-03 5.34400000e-03
1024 5.29200000e-03 5.33059406e-03 5.28700000e-03 5.65700000e-03

As expected, the timings are overall bigger in practice for the case where the number of available cache lines is much smaller. However, when nj >= 512, the results are surprising since they are significantly faster than others. This is the case where the number of available cache lines is equal to the number of ways of associativity (N). My guess is that this is because Intel processors certainly detect this pathological case and optimize the prefetching so to reduce the number of cache misses (using line-fill buffers to bypass the L1 cache -- see below).

Finally, for large nj stride, a bigger nj should results in higher overheads mainly due to the translation lookaside buffer (TLB): there are more page addresses to translate with bigger nj and the number of TLB entries is limited. In fact this is what I can observe on my machine: timings tends to slowly increase in a very stable way unlike on your target platform.

I cannot really explain this very strange behavior yet.
Here is some wild guesses:

The OS could tend to uses more huge pages when nj is large (so to reduce de overhead of the TLB) since wider blocks are allocated. This could result in more concurrency for the prefetcher as AFAIK it cannot cross page
boundaries. You can try to check the number of allocated (transparent) huge-pages (by looking AnonHugePages in /proc/meminfo in Linux) or force them to be used in this case (using an explicit memmap), or possibly by disabling them. My system appears to make use of 2 MiB transparent huge-pages independently of the nj value.

Intel CPU Cache Policy