Why Does Linux Perf Use Event L1D.Replacement for "L1 Dcache Misses" on X86

Hardware cache events and perf

User @Margaret points towards a reasonable answer in the comments - read the kernel source to see the mapping for the PMU events.

We can check arch/x86/events/intel/core.c for the event definitions. I don't actually know if "core" here refers to the Core architecture, of just that this is the core fine with most definitions - but in any case it's the file you want to look at.

The key part is this section, which defines skl_hw_cache_event_ids:

static __initconst const u64 skl_hw_cache_event_ids
                [PERF_COUNT_HW_CACHE_MAX]
                [PERF_COUNT_HW_CACHE_OP_MAX]
                [PERF_COUNT_HW_CACHE_RESULT_MAX] =
{
 [ C(L1D ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = 0x81d0,  /* MEM_INST_RETIRED.ALL_LOADS */
        [ C(RESULT_MISS)   ] = 0x151,   /* L1D.REPLACEMENT */
    },
    [ C(OP_WRITE) ] = {
        [ C(RESULT_ACCESS) ] = 0x82d0,  /* MEM_INST_RETIRED.ALL_STORES */
        [ C(RESULT_MISS)   ] = 0x0,
    },
    [ C(OP_PREFETCH) ] = {
        [ C(RESULT_ACCESS) ] = 0x0,
        [ C(RESULT_MISS)   ] = 0x0,
    },
},
...

Decoding the nested initializers, you get that the L1D-dcahe-load corresponds to MEM_INST_RETIRED.ALL_LOAD and L1-dcache-load-misses to L1D.REPLACEMENT.

We can double check this with perf:

$ ocperf stat -e mem_inst_retired.all_loads,L1-dcache-loads,l1d.replacement,L1-dcache-load-misses,L1-dcache-loads,mem_load_retired.l1_hit head -c100M /dev/zero > /dev/null

 Performance counter stats for 'head -c100M /dev/zero':

        11,587,793      mem_inst_retired_all_loads                                   
        11,587,793      L1-dcache-loads                                             
            20,233      l1d_replacement                                             
            20,233      L1-dcache-load-misses     #    0.17% of all L1-dcache hits  
        11,587,793      L1-dcache-loads                                             
        11,495,053      mem_load_retired_l1_hit                                     

       0.024322360 seconds time elapsed

The "Hardware Cache" events show exactly the same values as using the underlying PMU events we guessed at by checking the source.

Why won't perf report dcache-store-misses?

Perf prints <not supported> for generic events which were requested by user or by default event set (in perf stat) which are not mapped to real hardware PMU events on current hardware. Your hardware have no exact match to L1-dcache-store-misses generic event so perf informs you that your request sudo perf stat -e L1-dcache-load-misses,L1-dcache-store-misses ./progB can't be fully implemented on current machine.

Your cpu is "Product formerly Kaby Lake" which has skylake PMU according to linux kernel file arch/x86/events/intel/core.c:

#L4986
case INTEL_FAM6_KABYLAKE:
    memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids));

Line 420 of this file is the cache event mapping (generic perf event name to real hw pmu event code) for skylake pmu - skl_hw_cache_event_ids, and your l1d load/store miss are [ C(L1D ) ] - [ C(OP_READ) ] / [ C(OP_WRITE) ] - [ C(RESULT_MISS) ] fields of this strange data structure (= 0 means not mapped, and skl_hw_cache_extra_regs L525 has additional umask settings for events):

static ... const... skl_hw_cache_event_ids ... =
{
 [ C(L1D ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = 0x81d0,  /* MEM_INST_RETIRED.ALL_LOADS */
        [ C(RESULT_MISS)   ] = 0x151,   /* L1D.REPLACEMENT */
    },
    [ C(OP_WRITE) ] = {
        [ C(RESULT_ACCESS) ] = 0x82d0,  /* MEM_INST_RETIRED.ALL_STORES */
        [ C(RESULT_MISS)   ] = 0x0,
    }, ...
 },

So, for SkyLake L1d misses are defined for loads (op_read) as and not defined for stores (op_write). And L1d accesses are defined for both operations.

These generic events were probably created long time ago, when hardware had some PMU event to implement them. For example, Core 2 PMU has mapping for these events, arch/x86/events/intel/core.c line 1254 core2_hw_cache_event_ids const - l1d read miss is L1D_CACHE_LD.I_STATE, l1d write miss is L1D_CACHE_ST.I_STATE. perf subsystem in kernel just had to keep many generic event names, added in old versions, to have compatibility.

You should check output of sudo perf list cache command to select supported events for your CPU and its PMU. This command (in recent perf tool versions) will output only mapped generic names and will also print hardware-specific event names. You also should check Intel SDM, optimization and perfcounters manuals to get understanding about how the load and stores are implemented and which PMU events you should use to count hardware events.

While L1d store miss are not available on your cpu, you should think about what is the store miss and how it is implemented. Probably, this request will be passed to some next level of cache/memory hierarchy, for example it will become L2 store access. perf generic event set is ugly (was introduced in the era of 2 level cache in Core2) and has only L1 and LLC (last level cache) cache events. Not sure how LLC is mapped in the current era of shared L3, is it L2 or L3 (skylake's llc = L3). But intel-specific events should work.

Perf shows L1-dcache-load-misses in a block with no memory access

The event L1-dcache-load-misses is mapped to L1D.REPLACEMENT on Sandy Bridge and later microarchitectures (or mapped to a similar event on older microarchitectures). This event doesn't support precise sampling, which means that a sample can point to an instruction that couldn't have generated the event being sampled on. (Note that L1-dcache-load-misses is not supported on any current Atom.)

Starting with Linux 3.11 running on a Haswell+ or Silvermont+ microarchitecture, samples can be captured with eventing instruction pointers by specifying a sampling event that meets the following two conditions:

The event supports precise sampling. You can use, for example, any of the events that represent memory uop or instruction retirement. The exact names and meaning of the events depends on the microarchtiecture. Refer to the Intel SDM Volume 3 for more information. There is no event that supports precise sampling and has the same exact meaning as L1D.REPLACEMENT. On processors that support Extended PEBS, only a subset of PEBS events support precise sampling.
The precise sampling level is enabled on the event. In Linux perf, this can be done by appending ":pp" to the event name or raw event encoding, or "pp" after the terminating slash of a raw event specified in the PMU syntax. For example, on Haswell, the event mem_load_uops_retired.l1_miss:pp can be specified to Linux perf.

With such an event, when the event counter overflows, the PEBS hardware is armed, which means that it's now looking for the earliest possible opportunity to collect a precise sample. When there is at least one instruction that will cause an event during this window of time, the PEBS hardware will be eventually triggered by one of these instructions with bias toward high-latency instructions. When the instruction that triggeres PEBS retires, the PEBS microcode routine will execute and captures a PEBS record, which contains among other things the IP of the instruction that triggered PEBS (which is different from the architectural IP). The instruction pointer (IP) used by perf to display the results is this eventing IP. (I noticed there can be a negligible number of samples pointing to instructions that couldn't have caused the event.)

On older mircroarchitecures (before Haswell and Silvermont), the "pp" precise sampling level is also supported. PEBS on these processors will only capture the architectural event, which points to the static instruction that immediately follows the PEBS triggering instruction in program order. Linux perf uses LBR, if possible, which contains source-target IP pairs to determine if that captured IP is a target of a jump. If that was the case, it will add the source IP as the eventing IP to the sample record.

Some microarchitectures support one or more events with better sampling distribution (how much better depends on the microarchitecture, the event, the counter, and the instructions being executed at the time in which the counter is about to overflow). In Linux perf, precise distribution can be enabled, if supported, by specifying the precise level "ppp."

How does Linux perf calculate the cache-references and cache-misses events

The built-in perf events that you are interested in are mapping to the following hardware performance monitoring events on your processor:

  523,288,816      cache-references        (architectural event: LLC Reference)                             
  205,331,370      cache-misses            (architectural event: LLC Misses) 
  237,794,728      L1-dcache-load-misses   L1D.REPLACEMENT
3,495,080,007      L1-dcache-loads         MEM_INST_RETIRED.ALL_LOADS
2,039,344,725      L1-dcache-stores        MEM_INST_RETIRED.ALL_STORES                     
  531,452,853      L1-icache-load-misses   ICACHE_64B.IFTAG_MISS
   77,062,627      LLC-loads               OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
   27,462,249      LLC-load-misses         OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
   15,039,473      LLC-stores              OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
    3,829,429      LLC-store-misses        OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)

All of these events are documented in the Intel manual Volume 3. For more information on how to map perf events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.

But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn't it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.

LLC-load-misses and LLC-store-misses count only cacheable data read requests and RFO requests, respectively, that miss in the L3 cache. LLC-load-misses also includes reads for page walking. Both exclude hardware and software prefetching. (The difference compared to Haswell is that some types of prefetch requests are counted.)

cache-misses also includes prefetch requests and code fetch requests that miss in the L3 cache. All of these events only count core-originating requests. They include requests from uops irrespective of whether end up retiring and irrespective of the source of the response. It's unclear to me how a prefetch promoted to demand is counted.

Overall, I think cache-misses is always larger than LLC-load-misses + LLC-store-misses and cache-references is always larger than LLC-loads + LLC-stores.

The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores

It's only guaranteed that cache-reference is larger than cache-misses because the former counts requests irrespective of whether they miss the L3. It's normal for L1-dcache-loads to be larger than cache-reference because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it's not necessarily always the case because of hardware prefetches.

The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.

No, it's a trap. They are not easy to understand.

PMC to count if software prefetch hit L1 cache

The rdpmc is not ordered with the events that may occur before it or after it in program order. A fully serializing instruction, such as cpuid, is required to obtain the desired ordering guarantees with respect to prefetcht0. The code should be as follows:

xor  %eax, %eax         # CPUID leaf eax=0 should be fast.  Doing this before each CPUID might be a good idea, but omitted for clarity
cpuid
xorl %ecx, %ecx
rdpmc
movl %eax, %edi         # save RDPMC result before CPUID overwrites EAX..EDX
cpuid
prefetcht0 (%rsi)
cpuid
xorl %ecx, %ecx
rdpmc
testl %eax, %edi        # CPUID doesn't affect FLAGS
cpuid

Each of the rdpmc instructions are sandwiched between cpuid instructions. This ensures that any events and only these events that occur between the two rdpmc instructions are counted.

The prefetch operation of the prefetcht0 instruction may either be ignored or performed. If it was performed, it may either hit in a cache line that is in a valid state in the L1D or not. These are the cases that have to be considered.

The sum of L2_RQSTS.SWPF_HIT and L2_RQSTS.SWPF_MISS cannot be used to count or derive the number of prefetcht0 hits in the L1D, but their sum can be subtracted from SW_PREFETCH_ACCESS.T0 to get an upper bound on the number of prefetcht0 hits in the L1D. With the properly serialized sequence shown above, I think the only case where a non-ignored prefetcht0 doesn't hit in the L1D and is not counted by the sum SWPF_HIT+SWPF_MISS is if the software prefetch operation hits in an LFB allocated for a hardware prefetch.

L1-DCACHE-LOAD-MISSES is just another name for L1D.REPLACEMENT. The event code and umask you've shown for L1-DCACHE-LOAD-MISSES is incorrect. The L1D.REPLACEMENT event only occurs if the prefetch operation misses in the L1D (which causes a request to be sent to the L2) and causes a valid line in the L1D to be replaced. Usually most fills cause a replacement, but the event still cannot be used to distinguish between a prefetcht0 that hits in the L1D, a prefetcht0 that hits in an LFB allocated for a hardware prefetch, and an ignored prefetcht0.

The event LOAD_HIT_PREFETCH.SWPF occurs when a demand load hits in an LFB allocated for a software prefetch. This is obviously not useful here.

The event L1D_PEND_MISS.PENDING (event=0x48, umask=0x01) should work. According to the documentation, this event increments the counter by the number of pending L1D misses every cycle. I think it works for demand loads and prefetches. This is really an approximation, so it may count even if there are zero pending L1D misses. But I think it can still be used to determine with very high confidence whether a single prefetcht0 missed in the L1D by following these steps:

First, add the line uint64_t value = *(volatile uint64_t*)addr; just before the inline assembly. This is to increase the probability to near 100% that the line to be prefetched is in the L1D.
Second, measure the minimum increment of L1D_PEND_MISS.PENDING for a prefetcht0 that is very highly likely to hit in the L1D.
Run the experiment many times to build high confidence that the minimum increment is highly stable to the extent the the same exact value is observed in almost every run.
Comment out the line added in the first step so that the prefetcht0 misses and check that the event count change is always or almost always larger than the minimum increment measured previously.

So far, I've only been concerned with making a distinction between a prefetch that hits in the L1D and a non-ignored prefetch that misses in both the L1D and the LFBs. Now I'll consider the rest of the cases:

If the prefetch results in a page fault or if the memory type of the target cache line is WC or UC, the prefetch is ignored. I don't know whether the L1D_PEND_MISS.PENDING event can be used to distinguish between a hit and this case. You can run experiment where the target address of the prefetch instruction to is in a virtual page with no valid mapping or mapped to a kernel page. Check if the change in the event count is unique with high probability.
If no LFBs are available, the prefetch is ignored. This case can be eliminated by switching off the sibling logical core and using cpuid instead of lfence before the first rdpmc.
If the prefetch hits in an LFB allocated for an RFO, ItoM, or a hardware prefetch request, then the prefetch is effectively redundant. For all of these types of requests, the change in the L1D_PEND_MISS.PENDING count may or not be distinguishable from a hit in the L1D. This case can be eliminated by using cpuid instead of lfence before the first rdpmc and turning of the two L1D hardware prefetchers.
I don't think a prefetch to a prefetchable memory type can hit in a WCB because changing the memory type of a location is a fully serializing operation, so this case is not a problem.

One obvious advantage of using L1D_PEND_MISS.PENDING instead of the sum SWPF_HIT+SWPF_MISS is the smaller number of events. Another advantage is that L1D_PEND_MISS.PENDING is supported on some of the earlier the microarchitectures. Also, as discussed above, it can be more powerful. It works on my Haswell with a threshold of 69-70 cycles.

If the L1D_PEND_MISS.PENDING event changes in different cases are not distinguishable, then the sum SWPF_HIT+SWPF_MISS can be used. These two events occur at the L2 and so they only tell you whether the prefetch missed in the L1D and a request is sent and accepted by the L2. If the request is rejected or hit in the L2's SQ, none of the two events may occur. In addition, all of the aforementioned cases will not be distinguishable from an L1D hit.

For normal demand loads, you can use MEM_LOAD_RETIRED.L1_HIT. If the load hits in the L1D, a single L1_HIT occurs. Otherwise, in any other case, no L1_HIT events occur, assuming that no other instruction between the two rdpmcs, such as cpuid, can generate L1_HIT events. You'll have to verify that cpuid doesn't generate L1_HIT events. Don't forget to count only user-mode events because an interrupt can occur between any two instructions and the interrupt handler may generate one or more L1_HIT events in kernel mode. While it's very unlikely, if you want to be 100% sure, check also whether the occurrence of an interrupt itself generates L1_HIT events.

how to interpret perf iTLB-loads,iTLB-load-misses

On your Broadwell processor, perf maps iTLB-loads to ITLB_MISSES.STLB_HIT, which represents the event of a TLB lookup that misses the L1 ITLB but hits the unified TLB for all page sizes, and iTLB-load-misses to ITLB_MISSES.MISS_CAUSES_A_WALK, which represents the event of a TLB lookup that misses both the L1 ITLB and the unified TLB (causing a page walk) for all page sizes. Therefore, iTLB-load-misses can be larger or smaller than or equal to iTLB-loads. They are independent events.

Am I correctly reasoning about cache performance?

An instance of type SideProcessor has the following fields:

    std::atomic<bool> processingRequested;
#ifdef PAD_ALIGNMENT
    std::array<bool, 64> padding;
#endif
    std::array<int, 100> dataArr;

The size of processingRequested is probably one byte. Without PAD_ALIGNMENT, the compiler will probably arrange the fields such that the first few elements of dataArr are in same 64-byte cache line as processingRequested. However, with PAD_ALIGNMENT, there will be a 64-byte gap between the two fields, so they the first element of the array and processingRequested will be in different cache lines.

Considering the loop in processData in isolation, one would expect that all of the 100 elements of the dataArr to easily fit in the L1D cache and so the vast majority of accesses should hit in the L1D. However, the main thread reads processingRequested in while (!(sideProcessor.isDone())) { } concurrently with the processing thread executing the loop in processData. Without PAD_ALIGNMENT, the main thread wants to read from the same same cache line that the processing thread wants to both read and write. This results in a false sharing situation where the shared cache line repeatedly bounces between the private caches of the two cores on the which the threads are running.

With false sharing between two cores in the same LLC sharing domain, there will be a negligible number of misses by the LLC (it can backstop the requests so they don't go to DRAM), but there will be a lot of read and RFO requests from the two cores. That's why the LLC miss event counts are small.

It appears to me that the compiler has unrolled the loop in processData four times and vectorized it using 16-byte SSE instructions. This would explain why the number of stores is close to a quarter of a billion. Without, the number of loads is about a billion, about a quarter of which is from the processing thread and most of the rest are from the main thread. The number of loads executed in while (!(sideProcessor.isDone())) { } depends on the time it takes to complete the execution of processData. So it makes sense that the number of loads is much smaller in the case of no false sharing (with PAD_ALIGNMENT).

In the case without PAD_ALIGNMENT, most of the L1-dcache-load-misses and LLC-loads events are from requests generated by the main thread while most of the LLC-stores events are from requests generated by the processing thread. All of these requests are to the line containing processingRequested. It makes sense that LLC-stores is much larger than LLC-loads because the main thread accesses the line more rapidly than the processing thread, so it's more likely that the RFOs miss in the private caches of the core on the processing thread is running. I think also most of the L1-dcache-load-misses events represent loads from the main thread to the shared cache line. It looks like only a third of these loads miss in the private L2 cache, which suggests that the line is being prefetched into the L2. This can be verified by disabling the L2 prefetchers and checking whether L1-dcache-load-misses is about equal to LLC-loads.