Perf Stat Does Not Count Memory-Loads But Counts Memory-Stores

PERF STAT does not count memory-loads but counts memory-stores

The mem-loads event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3 performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_* are special and can only be counted by using the p modifier. That is, you have to specify mem-loads:p to perf to use the event correctly.

MEM_TRANS_RETIRED.LOAD_LATENCY_* is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):

When a user elects to sample one of these events, special hardware is
used that can keep track of a data load from issue to completion.
This is more complicated than simply counting instances of an event
(as with normal event-based sampling), and so only some loads are
tracked. Loads are randomly chosen, the latency determined for each,
and the correct event(s) incremented (latency >4, >8, >16, etc). Due
to the nature of the sampling for this event, only a small percentage
of an application's data loads can be tracked at any one time.

As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_* by no means count the total number of loads and it is not designed for that purpose at all.

If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_* is the right performance event to use. In fact, that is exactly the purpose of perf-mem and it achieves its purpose by using this event.

If you want to count the total number of load uops retired, then you should use L1-dcache-loads, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS performance event on Intel processors.

On the other hand, mem-stores and L1-dcache-stores are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES, which does count all retired store uops.

So in summary, if you are using perf-stat, you should (almost) always use L1-dcache-loads and L1-dcache-stores to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.

Inconsistent `perf annotate` memory load/store time reporting

Not exactly "memory" bound, but bound on latency of store-forwarding. i9-9900K and i7-7700 have exactly the same microarchitecture for each core so that's not surprising :P https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Key_changes_from_Kaby_Lake. (Except possibly for improvement in hardware mitigation of Meltdown, and possibly fixing the loop buffer (LSD).)

Remember that when a perf event counter overflows and triggers a sample, the out-of-order superscalar CPU has to choose exactly one of the in-flight instructions to "blame" for this cycles event. Often this is the oldest un-retired instruction in the ROB, or the one after. Be very suspicious of cycles event samples over very small scales.

Perf never blames a load that was slow to produce a result, usually the instruction that was waiting for it. (In this case an xor or add). Here, sometimes the store consuming the result of that xor. These aren't cache-miss loads; store-forwarding latency is only about 3 to 5 cycles on Skylake (variable, and shorter if you don't try too soon: Loop with function call faster than an empty loop) so you do have loads completing at about 2 per 3 to 5 cycles.

You have two dependency chains through memory

The longest one involving two RMWs of b. This is twice as long and will be the overall bottleneck for the loop.
The other involving one RMW of a (with an extra read each iteration which can happen in parallel with the read that's part of the next a ^= i;).

The dep chain for i only involves registers and can run far ahead of the others; it's no surprise that add $0x1,%rax has no counts. Its execution cost is totally hidden in the shadow of waiting for loads.

I'm a bit surprised there are significant counts for mov %edx,a. Perhaps it sometimes has to wait for the older store uops involving b to run on the CPUs single store-data port. (Uops are dispatched to ports according to oldest-ready first. How are x86 uops scheduled, exactly?)

Uops can't retire until all previous uops have executed, so it could just be getting some skew from the store at the bottom of the loop. Uops retire in groups of 4, so if the mov %edx,b does retire, the already-executed cmp/jcc, the mov load of a, and the xor %eax,%edx can retire with it. Those are not part of the dep chain that waits for b, so they're always going to be sitting in the ROB waiting to retire whenever the b store is ready to retire. (This is guesswork about how mov %edx,a could be getting counts, despite not being part of a real bottleneck.)

The store-address uops should all run far ahead of the loop because they don't have to wait for previous iterations: RIP-relative addressing¹ is ready right away. And they can run on port 7, or compete with loads for ports 2 or 3. Same for the loads: they can execute right away and detect what store they're waiting for, with the load buffer monitoring it and ready to report when the data becomes ready after the store-data uop does eventually run.

Presumably the front-end will eventually bottleneck on allocating load buffer entries, and that's what will limit how many uops can be in the back-end, not ROB or RS size.

Footnote 1: Your annotated output only shows a not a(%rip) so that's odd; doesn't matter if somehow you did get it to use 32-bit absolute, or if it's just a disassembly quirk failing to show RIP-relative.

Units of perf stat statistics

The unit is "single cache access" for loads, stores, references and misses. Loads correspond to count of load instructions, executed by processors; same for stores. Misses is the count, how much loads and stores were unable to get their data loaded from the cache of this level: L1 data cache for L1-dcache- events; Last Level Cache (usually L2 or L3 depending on your platform) for cache- events.

31 691 336 329 L1-dcache-loads                                             
    44 227 451 L1-dcache-load-misses       
15 596 746 809 L1-dcache-stores                                            
    20 575 093 L1-dcache-store-misses                                      


    26 542 169 cache-references                                            
    13 410 669 cache-misses

Cycles is the total count of CPU ticks, for which CPU executed your program. If you have 3 GHz CPU, there will be around 3 000 000 000 cycles per second at most. If the machine was busy, there will be less cycles available for your program

36 859 313 200 cycles

This is total count of instructions, executed from your program:

75 952 288 765 instructions

(I will use G suffix as abbreviation for billion)

From the numbers we can conclude: 76G instructions executed in 37G cycles (around 2 instructions per cpu tick, rather high level of IPC). You gave no information of your CPU and its frequency, but assuming 3 GHz CPU, the running time was near 12 seconds.

In 76G instructions, you have 31G load instructions (42%), and 15G store instructions (21%); so only 37% of instructions were no memory instructions. I don't know, what was the size of memory references (was it byte load and stores, 2 byte or wide SSE movs), but 31G load instructions looks too high for 750 MB file (mean is 0.02 bytes; but shortest possible load and store is single byte). So I think that your program did several copies of the data; or the file was bigger. 750 MB in 12 seconds looks rather slow (60 MBytes/s), but this can be true, if the first file was read and second file was written to the disk, without caching by Linux kernel (do you have fsync() call in your program? Do you profile your CPU or your HDD?). With cached files and/or RAMdrive (tmpfs - the filesystem, stored in the RAM memory) this speed should be much higher.

Modern versions of perf does some simple calculations in perf stat and also may print units, like shown here: http://www.bnikolic.co.uk/blog/hpc-prof-events.html

perf stat -d  md5sum *

    578.920753 task-clock                #    0.995 CPUs utilized
           211 context-switches          #    0.000 M/sec
             4 CPU-migrations            #    0.000 M/sec
           212 page-faults               #    0.000 M/sec
 1,744,441,333 cycles                    #    3.013 GHz                     [20.22%]
 1,064,408,505 stalled-cycles-frontend   #   61.02% frontend cycles idle    [30.68%]
   104,014,063 stalled-cycles-backend    #    5.96% backend  cycles idle    [41.00%]
 2,401,954,846 instructions              #    1.38  insns per cycle
                                         #    0.44  stalled cycles per insn [51.18%]
    14,519,547 branches                  #   25.080 M/sec                   [61.21%]
       109,768 branch-misses             #    0.76% of all branches         [61.48%]
   266,601,318 L1-dcache-loads           #  460.514 M/sec                   [50.90%]
    13,539,746 L1-dcache-load-misses     #    5.08% of all L1-dcache hits   [50.21%]
             0 LLC-loads                 #    0.000 M/sec                   [39.19%]
(wrongevent?)0 LLC-load-misses           #    0.00% of all LL-cache hits    [ 9.63%]

   0.581869522 seconds time elapsed

UPDATE Apr 18, 2014

please explain why cache-references are not correlating with L1-dcache numbers

Cache-references DOES correlate with L1-dcache numbers. cache-references is close to L1-dcache-store-misses or L1-dcache-load-misses. Why numbers are no equal? Because in your CPU (Core i5-2320) there are 3 levels of cache: L1, L2, L3; and LLC (last level cache) is L3. So, load or store instruction at first trys to get/save data in/from L1 cache (L1-dcache-loads, L1-dcache-stores). If address was not cached in L1, the request will go to L2 (L1-dcache-load-misses, L1-dcache-store-misses). In this run we have no exact data of how much request were served by L2 (the counters were not included into default set in perf stat). But we can assume that some loads/stores were served and some were not. Then not served-by-L2 requests will go to L3 (LLC), and we see that there were 26M references to L3 (cache-references) and half of them (13M) were L3 misses (cache-misses; served by main RAM memory). Another half were L3 hits.

44M+20M = 64M misses from L1 were passed to L2. 26M requests were passed from L2 to L3 - they are L2 misses. So 64M-26M = 38 millions requests were served by L2 (l2 hits).

how to interpret perf iTLB-loads,iTLB-load-misses

On your Broadwell processor, perf maps iTLB-loads to ITLB_MISSES.STLB_HIT, which represents the event of a TLB lookup that misses the L1 ITLB but hits the unified TLB for all page sizes, and iTLB-load-misses to ITLB_MISSES.MISS_CAUSES_A_WALK, which represents the event of a TLB lookup that misses both the L1 ITLB and the unified TLB (causing a page walk) for all page sizes. Therefore, iTLB-load-misses can be larger or smaller than or equal to iTLB-loads. They are independent events.

How does Linux perf calculate the cache-references and cache-misses events

The built-in perf events that you are interested in are mapping to the following hardware performance monitoring events on your processor:

  523,288,816      cache-references        (architectural event: LLC Reference)                             
  205,331,370      cache-misses            (architectural event: LLC Misses) 
  237,794,728      L1-dcache-load-misses   L1D.REPLACEMENT
3,495,080,007      L1-dcache-loads         MEM_INST_RETIRED.ALL_LOADS
2,039,344,725      L1-dcache-stores        MEM_INST_RETIRED.ALL_STORES                     
  531,452,853      L1-icache-load-misses   ICACHE_64B.IFTAG_MISS
   77,062,627      LLC-loads               OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
   27,462,249      LLC-load-misses         OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
   15,039,473      LLC-stores              OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
    3,829,429      LLC-store-misses        OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)

All of these events are documented in the Intel manual Volume 3. For more information on how to map perf events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.

But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn't it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.

LLC-load-misses and LLC-store-misses count only cacheable data read requests and RFO requests, respectively, that miss in the L3 cache. LLC-load-misses also includes reads for page walking. Both exclude hardware and software prefetching. (The difference compared to Haswell is that some types of prefetch requests are counted.)

cache-misses also includes prefetch requests and code fetch requests that miss in the L3 cache. All of these events only count core-originating requests. They include requests from uops irrespective of whether end up retiring and irrespective of the source of the response. It's unclear to me how a prefetch promoted to demand is counted.

Overall, I think cache-misses is always larger than LLC-load-misses + LLC-store-misses and cache-references is always larger than LLC-loads + LLC-stores.

The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores

It's only guaranteed that cache-reference is larger than cache-misses because the former counts requests irrespective of whether they miss the L3. It's normal for L1-dcache-loads to be larger than cache-reference because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it's not necessarily always the case because of hardware prefetches.

The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.

No, it's a trap. They are not easy to understand.

What is the meaning of Perf events: dTLB-loads and dTLB-stores?

When virtual memory is enabled, the virtual address of every single memory access needs to be looked up in the TLB to obtain the corresponding physical address and determine access permissions and privileges (or raise an exception in case of an invalid mapping). The dTLB-loads and dTLB-stores events represent a TLB lookup for a data memory load or store access, respectively. The is the perf definition of these events. but the exact meaning depends on the microarchitecture.

On Westmere, Skylake, Kaby Lake, Coffee Lake, Cannon Lake (and probably Ice Lake), dTLB-loads and dTLB-stores are mapped to MEM_INST_RETIRED.ALL_LOADS and MEM_INST_RETIRED.ALL_STORES, respectively. On Sandy Bridge, Ivy Bridge, Haswell, Broadwell, Goldmont, Goldmont Plus, they are mapped to MEM_UOP_RETIRED.ALL_LOADS and MEM_UOP_RETIRED.ALL_STORES, respectively. On Core2, Nehalem, Bonnell, Saltwell, they are mapped to L1D_CACHE_LD.MESI and L1D_CACHE_ST.MESI, respectively. (Note that on Bonnell and Saltwell, the official names of the events are L1D_CACHE.LD and L1D_CACHE.ST and the event codes used by perf are only documented in the Intel manual Volume 3 and not in other Intel sources on performance events.) The dTLB-loads and dTLB-stores events are not supported on Silvermont and Airmont.

On all current AMD processors, dTLB-loads is mapped to LsDcAccesses and dTLB-stores is not supported. However, LsDcAccesses counts TLB lookups for both loads and stores. On processors from other vendors, dTLB-loads and dTLB-stores are not supported.

See Hardware cache events and perf for how to map perf core events to native events.

The dTLB-loads and dTLB-stores event counts for the same program on different microarchitectures can be different not only because of differences in the microarchitectures but also because the meaning of the events is itself different. Therefore, even if the microarchitectural behavior of the program turned out to be the same on the microarchitectures, the event counts can still be different. A brief description of the native events on all Intel microarchitectures can be found here and a more detailed description on some of the microarchitectures can be found here.

Related: how to interpret perf iTLB-loads,iTLB-load-misses.

Perf Stat Does Not Count Memory-Loads But Counts Memory-Stores