Why Doesn't Perf Report Cache Misses

Why doesn't perf report cache misses?

On my system, an Intel Xeon X5570 @ 2.93 GHz I was able to get perf stat to report cache references and misses by requesting those events explicitly like this

perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations sleep 5
Performance counter stats for 'sleep 5':

         10573 cache-references                                            
          1949 cache-misses              #   18.434 % of all cache refs    
       1077328 cycles                    #    0.000 GHz                    
        715248 instructions              #    0.66  insns per cycle        
        151188 branches                                                    
           154 faults                                                      
             0 migrations                                                  

   5.002776842 seconds time elapsed

The default set of events did not include cache events, matching your results, I don't know why

perf stat -B sleep 5

Performance counter stats for 'sleep 5':

      0.344308 task-clock                #    0.000 CPUs utilized          
             1 context-switches          #    0.003 M/sec                  
             0 CPU-migrations            #    0.000 M/sec                  
           154 page-faults               #    0.447 M/sec                  
        977183 cycles                    #    2.838 GHz                    
        586878 stalled-cycles-frontend   #   60.06% frontend cycles idle   
        430497 stalled-cycles-backend    #   44.05% backend  cycles idle   
        720815 instructions              #    0.74  insns per cycle        
                                         #    0.81  stalled cycles per insn
        152217 branches                  #  442.095 M/sec                  
          7646 branch-misses             #    5.02% of all branches        

   5.002763199 seconds time elapsed

Why won't perf report dcache-store-misses?

Perf prints <not supported> for generic events which were requested by user or by default event set (in perf stat) which are not mapped to real hardware PMU events on current hardware. Your hardware have no exact match to L1-dcache-store-misses generic event so perf informs you that your request sudo perf stat -e L1-dcache-load-misses,L1-dcache-store-misses ./progB can't be fully implemented on current machine.

Your cpu is "Product formerly Kaby Lake" which has skylake PMU according to linux kernel file arch/x86/events/intel/core.c:

#L4986
case INTEL_FAM6_KABYLAKE:
    memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids));

Line 420 of this file is the cache event mapping (generic perf event name to real hw pmu event code) for skylake pmu - skl_hw_cache_event_ids, and your l1d load/store miss are [ C(L1D ) ] - [ C(OP_READ) ] / [ C(OP_WRITE) ] - [ C(RESULT_MISS) ] fields of this strange data structure (= 0 means not mapped, and skl_hw_cache_extra_regs L525 has additional umask settings for events):

static ... const... skl_hw_cache_event_ids ... =
{
 [ C(L1D ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = 0x81d0,  /* MEM_INST_RETIRED.ALL_LOADS */
        [ C(RESULT_MISS)   ] = 0x151,   /* L1D.REPLACEMENT */
    },
    [ C(OP_WRITE) ] = {
        [ C(RESULT_ACCESS) ] = 0x82d0,  /* MEM_INST_RETIRED.ALL_STORES */
        [ C(RESULT_MISS)   ] = 0x0,
    }, ...
 },

So, for SkyLake L1d misses are defined for loads (op_read) as and not defined for stores (op_write). And L1d accesses are defined for both operations.

These generic events were probably created long time ago, when hardware had some PMU event to implement them. For example, Core 2 PMU has mapping for these events, arch/x86/events/intel/core.c line 1254 core2_hw_cache_event_ids const - l1d read miss is L1D_CACHE_LD.I_STATE, l1d write miss is L1D_CACHE_ST.I_STATE. perf subsystem in kernel just had to keep many generic event names, added in old versions, to have compatibility.

You should check output of sudo perf list cache command to select supported events for your CPU and its PMU. This command (in recent perf tool versions) will output only mapped generic names and will also print hardware-specific event names. You also should check Intel SDM, optimization and perfcounters manuals to get understanding about how the load and stores are implemented and which PMU events you should use to count hardware events.

While L1d store miss are not available on your cpu, you should think about what is the store miss and how it is implemented. Probably, this request will be passed to some next level of cache/memory hierarchy, for example it will become L2 store access. perf generic event set is ugly (was introduced in the era of 2 level cache in Core2) and has only L1 and LLC (last level cache) cache events. Not sure how LLC is mapped in the current era of shared L3, is it L2 or L3 (skylake's llc = L3). But intel-specific events should work.

Linux perf reporting cache misses for unexpected instruction

About your example:

There are several instructions before and at the high counter:

        │       movsd  (%rcx,%rsi,8),%xmm0
   0.13 │       ucomis (%rcx,%rdx,8),%xmm0
  57.99 │     ↑ jbe    ff

"movsd" loads word from (%rcx,%rsi,8) (some array access) into xmm0 register, and "ucomis" loads another word from (%rcx,%rdx,8) and compares it with just loaded value in xmm0 register. "jbe" is conditional jump which depends on compare outcome.

Many modern Intel CPUs (and AMD probably too) can and will fuse (combine) some combinations of operations (realworldtech.com/nehalem/5 "into a single uop, CMP+JCC") together, and cmp + conditional jump very common instruction combination to be fused (you can check it with Intel IACA simulating tool, use ver 2.1 for your CPU). Fused pair may be reported in perf/PMUs/PEBS incorrectly with skew of most events towards one of two instructions.

This code probably means that expression "dist[i] < dist[tmp]" generates two memory accesses, and both of values are used in ucomis instruction which is (partially?) fused with jbe conditional jump. Either dist[i] or dist[tmp] or both expressions generates high number of misses. Any of such miss will block ucomis to generate result and block jbe to give next instruction to execute (or to retire predicted instructions). So, jbe may get all fame of high counters instead of real memory-access instructions (and for "far" event like cache response there is some skew towards last blocked instruction).

You may try to merge visited[N] and dist[N] arrays into array[N] of struct { int visited; float dist} to force prefetching of array[i].dist when you access array[i].visited or you may try to change order of vertex access, or renumber graph vertex, or do some software prefetch for next one or more elements (?)

About generic perf event by name problems and possible uncore skew.

perf (perf_events) tool in Linux uses predefined set of events when called as perf list, and some listed hardware events can be not implemented; others are mapped to current CPU capabilities (and some mappings are not fully correct). Some basic info about real PMU is in your https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf (but it has more details for related Nehalem-EP variant).

For your Nehalem (Intel Core i5 750 with L3 cache of 8MB and without multi-CPU/multi-socket/NUMA support) perf will map standard ("Generic cache events") LLC-load-misses event as .. "OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS" as written in the best documentation of perf event mappings (the only one) - kernel source code

http://elixir.free-electrons.com/linux/v4.8/source/arch/x86/events/intel/core.c#L1103

 u64 nehalem_hw_cache_event_ids ...
[ C(LL  ) ] = {
    [ C(OP_READ) ] = {
        /* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */
        [ C(RESULT_ACCESS) ] = 0x01b7,
        /* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */
        [ C(RESULT_MISS)   ] = 0x01b7,
...
/*
 * Nehalem/Westmere MSR_OFFCORE_RESPONSE bits;
 * See IA32 SDM Vol 3B 30.6.1.3
 */
#define NHM_DMND_DATA_RD    (1 << 0)
#define NHM_DMND_READ       (NHM_DMND_DATA_RD)
#define NHM_L3_MISS (NHM_NON_DRAM|NHM_LOCAL_DRAM|NHM_REMOTE_DRAM|NHM_REMOTE_CACHE_FWD)
...
 u64 nehalem_hw_cache_extra_regs
  ..
 [ C(LL  ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = NHM_DMND_READ|NHM_L3_ACCESS,
        [ C(RESULT_MISS)   ] = NHM_DMND_READ|NHM_L3_MISS,

I think this event is not precise: cpu pipeline will post (with out-of-order) load request to the cache hierarchy and will execute other instructions. After some time (around 10 cycles to reach and get response from L2 and 40 cycles to reach L3) there will be response with miss flag in the corresponding (offcore?) PMU to increment counter. On this counter overflow, profiling interrupt will be generated from this PMU. In several cpu clock cycles it will reach pipeline to interrupt it, perf_events subsystem's handler will handle this with registering current (interrupted) EIP/RIP Instruction pointer and reset PMU counter back to some negative value (for example, -100000 to get interrupt for every 100000 L3 misses counted; use perf record -e LLC-load-misses -c 100000 to set exact count or perf will autotune limit to get some default frequency). The registered EIP/RIP is not the IP of load command and it may be also not the EIP/RIP of command which wants to use the loaded data.

But if your CPU is the only socket in the system and you access normal memory (not some mapped PCI-express space), L3 miss in fact will be implemented as local memory access and there are some counters for this... (https://software.intel.com/en-us/node/596851 - "Any memory requests missing here must be serviced by local or remote DRAM").

There are some listings of PMU events for your CPU:

Official Intel's "Intel® 64 and IA-32 Architectures Software Developer Manuals" (SDM): https://software.intel.com/en-us/articles/intel-sdm, Volume 3, Appendix A
- 3B: https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf "18.8 PERFORMANCE MONITORING FOR PROCESSORS BASED ON INTEL® MICROARCHITECTURE CODE NAME NEHALEM" from page 213 "Vol 3B 18-35"
- 3B: https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf "19.8 - Processors based on Intel® microarchitecture code name Nehalem" from page 365 and "Vol. 3B 19-61")
- Some other volume for Offcore response encoding? Vol. 3A 18-26?
from oprofile http://oprofile.sourceforge.net/docs/intel-corei7-events.php
from libpfm4's showevtinfo http://www.bnikolic.co.uk/blog/hpc-prof-events.html (note, this page with Sandy Bridge list, get libpfm4 ant run on your PC to get your list). There is also check_events tool in libpfm4 to help your encode event as raw for perf.
from VTune documentation: http://www.hpc.ut.ee/dokumendid/ips_xe_2015/vtune_amplifier_xe/documentation/en/help/reference/pmw_sp/events/offcore_response.html
from Nehalem PMU guide: https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf
ocperf tool from Intel's perf developer Andi Kleen, part of his pmu-tools https://github.com/andikleen/pmu-tools. ocperf is just wrapper for perf and this package will download event description and any supported event name will be converted into correct raw encoding ofperf`.

There should be some information about ANY_LLC_MISS offcore PMU event implementation and list of PEBS events for Nhm, but I can't find it now.

I can recommend you to use ocperf from https://github.com/andikleen/pmu-tools with any PMU events of your CPU without need to manually encode them. There are some PEBS events in your CPU, and there is Latency profiling / perf mem for some kind of memory access profiling (some random perf mem pdfs: 2012 post "perf: add memory access sampling support",RH 2013 - pg26-30, still not documented in 2015 - sowa pg19, ls /sys/devices/cpu/events). For newer CPUs there are newer tools like ucevent.

I also can recommend you to try cachegrind profiler/cache simulator tool of valgrind program with kcachegrind GUI to view profiles. Valgrind-based profilers may help you to get basic idea about how the code works: they collect exact instruction execution counts for every instruction, and cachegrind also simulates some abstract multi-level cache. But real CPU will execute several instruction per cycle (so, callgrind/cachegrind cost model of 1 instruction = 1 cpu clock cycle gives some error; cachegrind cache model have not the same logic as real cache). And all valgrind tools are dynamic binary instrumentation tools which will slow down your program 20-30 times compared to native run.

How does Linux perf calculate the cache-references and cache-misses events

The built-in perf events that you are interested in are mapping to the following hardware performance monitoring events on your processor:

  523,288,816      cache-references        (architectural event: LLC Reference)                             
  205,331,370      cache-misses            (architectural event: LLC Misses) 
  237,794,728      L1-dcache-load-misses   L1D.REPLACEMENT
3,495,080,007      L1-dcache-loads         MEM_INST_RETIRED.ALL_LOADS
2,039,344,725      L1-dcache-stores        MEM_INST_RETIRED.ALL_STORES                     
  531,452,853      L1-icache-load-misses   ICACHE_64B.IFTAG_MISS
   77,062,627      LLC-loads               OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
   27,462,249      LLC-load-misses         OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
   15,039,473      LLC-stores              OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
    3,829,429      LLC-store-misses        OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)

All of these events are documented in the Intel manual Volume 3. For more information on how to map perf events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.

But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn't it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.

LLC-load-misses and LLC-store-misses count only cacheable data read requests and RFO requests, respectively, that miss in the L3 cache. LLC-load-misses also includes reads for page walking. Both exclude hardware and software prefetching. (The difference compared to Haswell is that some types of prefetch requests are counted.)

cache-misses also includes prefetch requests and code fetch requests that miss in the L3 cache. All of these events only count core-originating requests. They include requests from uops irrespective of whether end up retiring and irrespective of the source of the response. It's unclear to me how a prefetch promoted to demand is counted.

Overall, I think cache-misses is always larger than LLC-load-misses + LLC-store-misses and cache-references is always larger than LLC-loads + LLC-stores.

The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores

It's only guaranteed that cache-reference is larger than cache-misses because the former counts requests irrespective of whether they miss the L3. It's normal for L1-dcache-loads to be larger than cache-reference because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it's not necessarily always the case because of hardware prefetches.

The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.

No, it's a trap. They are not easy to understand.

Why Doesn't Perf Report Cache Misses