Hardware Cache Events and Perf

Hardware cache events and perf

User @Margaret points towards a reasonable answer in the comments - read the kernel source to see the mapping for the PMU events.

We can check arch/x86/events/intel/core.c for the event definitions. I don't actually know if "core" here refers to the Core architecture, of just that this is the core fine with most definitions - but in any case it's the file you want to look at.

The key part is this section, which defines skl_hw_cache_event_ids:

static __initconst const u64 skl_hw_cache_event_ids
[PERF_COUNT_HW_CACHE_MAX]
[PERF_COUNT_HW_CACHE_OP_MAX]
[PERF_COUNT_HW_CACHE_RESULT_MAX] =
{
[ C(L1D ) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = 0x81d0, /* MEM_INST_RETIRED.ALL_LOADS */
[ C(RESULT_MISS) ] = 0x151, /* L1D.REPLACEMENT */
},
[ C(OP_WRITE) ] = {
[ C(RESULT_ACCESS) ] = 0x82d0, /* MEM_INST_RETIRED.ALL_STORES */
[ C(RESULT_MISS) ] = 0x0,
},
[ C(OP_PREFETCH) ] = {
[ C(RESULT_ACCESS) ] = 0x0,
[ C(RESULT_MISS) ] = 0x0,
},
},
...

Decoding the nested initializers, you get that the L1D-dcahe-load corresponds to MEM_INST_RETIRED.ALL_LOAD and L1-dcache-load-misses to L1D.REPLACEMENT.

We can double check this with perf:

$ ocperf stat -e mem_inst_retired.all_loads,L1-dcache-loads,l1d.replacement,L1-dcache-load-misses,L1-dcache-loads,mem_load_retired.l1_hit head -c100M /dev/zero > /dev/null

Performance counter stats for 'head -c100M /dev/zero':

11,587,793 mem_inst_retired_all_loads
11,587,793 L1-dcache-loads
20,233 l1d_replacement
20,233 L1-dcache-load-misses # 0.17% of all L1-dcache hits
11,587,793 L1-dcache-loads
11,495,053 mem_load_retired_l1_hit

0.024322360 seconds time elapsed

The "Hardware Cache" events show exactly the same values as using the underlying PMU events we guessed at by checking the source.

What are perf cache events meaning?

You seem to think that the cache-misses event is the sum of all other kind of cache misses (L1-dcache-load-misses, and so on). That is actually not true.

the cache-misses event represents the number of memory access that could not be served by any of the cache.

I admit that perf's documentation is not the best around.

However, one can learn quite a lot about it by reading (assuming that you already have a good knowledge of how a CPU and a performance monitoring unit work, this is clearly not a computer architecture course) the doc of the perf_event_open() function:

http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html

For example, by reading it you can see that the cache-misses event showed by perf list corresponds to PERF_COUNT_HW_CACHE_MISSES

only 2 PERF_TYPE_HW_CACHE events in perf event group

Note that, perf does allow measuring more than 2 PERF_TYPE_HW_CACHE events at the same time, the exception being the measurement of LLC-cache events.

The expectation is that, when there are 4 general-purpose and 3 fixed-purpose
hardware counters, 4 HW cache events (which default to RAW events) in perf can be measured without multiplexing, with hyper-threading ON.

sudo perf stat -e L1-icache-load-misses,L1-dcache-stores,L1-dcache-load-misses,dTLB-load-misses sleep 2

Performance counter stats for 'sleep 2':

26,893 L1-icache-load-misses
98,999 L1-dcache-stores
14,037 L1-dcache-load-misses
723 dTLB-load-misses

2.001732771 seconds time elapsed

0.001217000 seconds user
0.000000000 seconds sys

The problem appears when you try to measure events targeting the LLC-cache. It seems to be measuring only 2 LLC-cache specific events, concurrently, without multiplexing.

sudo perf stat -e LLC-load-misses,LLC-stores,LLC-store-misses,LLC-loads sleep 2

Performance counter stats for 'sleep 2':

2,419 LLC-load-misses # 0.00% of all LL-cache hits
2,963 LLC-stores
<not counted> LLC-store-misses (0.00%)
<not counted> LLC-loads (0.00%)

2.001486710 seconds time elapsed

0.001137000 seconds user
0.000000000 seconds sys

CPUs belonging to the skylake/kaby lake family of microarchitectures and some others, allow you to measure OFFCORE RESPONSE events. Monitoring OFFCORE_RESPONSE events requires programming extra MSRs, specifically, MSR_OFFCORE_RSP0 (MSR address 1A6H) and MSR_OFFCORE_RSP1 (MSR address 1A7H), in addition to programming the pair of IA32_PERFEVTSELx and IA32_PMCx registers.

Each pair of IA32_PERFEVTSELx and IA32_PMCx register will be associated with one of the above MSRs to measure LLC-cache events.

The definition of the OFFCORE_RESPONSE MSRs can be seen here.

static struct extra_reg intel_skl_extra_regs[] __read_mostly = {
INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff8fffull, RSP_0),
INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff8fffull, RSP_1),
........
}

0x01b7 in the INTEL_UEVENT_EXTRA_REG call refers to event-code b7 and umask 01. This event code 0x01b7 maps to LLC-cache events, as can be seen here -

[ C(LL  ) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
[ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
},
[ C(OP_WRITE) ] = {
[ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
[ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
},
[ C(OP_PREFETCH) ] = {
[ C(RESULT_ACCESS) ] = 0x0,
[ C(RESULT_MISS) ] = 0x0,
},
},

The event 0x01b7 will always map to MSR_OFFCORE_RSP_0, as can be seen here. The function, specified above, loops through the array of all the "extra registers" and associates the event->config(which contains the raw event id) with the offcore response MSR.

So, this would mean only one event can be measured at a time, since only one MSR - MSR_OFFCORE_RSP_0 can be mapped to a LLC-cache event. But, that is not the case!

The offcore registers are symmetric in nature, so when the first MSR - MSR_OFFCORE_RSP_0 register is busy, perf uses the second alternative MSR, MSR_OFFCORE_RSP_1 for measuring another offcore LLC event. This function here helps in doing that.

static int intel_alt_er(int idx, u64 config)
{
int alt_idx = idx;

if (!(x86_pmu.flags & PMU_FL_HAS_RSP_1))
return idx;

if (idx == EXTRA_REG_RSP_0)
alt_idx = EXTRA_REG_RSP_1;

if (idx == EXTRA_REG_RSP_1)
alt_idx = EXTRA_REG_RSP_0;

if (config & ~x86_pmu.extra_regs[alt_idx].valid_mask)
return idx;

return alt_idx;
}

The presence of only 2 offcore registers, for Kaby-Lake family of microrarchitectures hinder the ability to target more than 2 LLC-cache event measurement concurrently, without any multiplexing.

How does Linux perf calculate the cache-references and cache-misses events

The built-in perf events that you are interested in are mapping to the following hardware performance monitoring events on your processor:

  523,288,816      cache-references        (architectural event: LLC Reference)                             
205,331,370 cache-misses (architectural event: LLC Misses)
237,794,728 L1-dcache-load-misses L1D.REPLACEMENT
3,495,080,007 L1-dcache-loads MEM_INST_RETIRED.ALL_LOADS
2,039,344,725 L1-dcache-stores MEM_INST_RETIRED.ALL_STORES
531,452,853 L1-icache-load-misses ICACHE_64B.IFTAG_MISS
77,062,627 LLC-loads OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
27,462,249 LLC-load-misses OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
15,039,473 LLC-stores OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
3,829,429 LLC-store-misses OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)

All of these events are documented in the Intel manual Volume 3. For more information on how to map perf events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.

But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn't it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.

LLC-load-misses and LLC-store-misses count only cacheable data read requests and RFO requests, respectively, that miss in the L3 cache. LLC-load-misses also includes reads for page walking. Both exclude hardware and software prefetching. (The difference compared to Haswell is that some types of prefetch requests are counted.)

cache-misses also includes prefetch requests and code fetch requests that miss in the L3 cache. All of these events only count core-originating requests. They include requests from uops irrespective of whether end up retiring and irrespective of the source of the response. It's unclear to me how a prefetch promoted to demand is counted.

Overall, I think cache-misses is always larger than LLC-load-misses + LLC-store-misses and cache-references is always larger than LLC-loads + LLC-stores.

The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores

It's only guaranteed that cache-reference is larger than cache-misses because the former counts requests irrespective of whether they miss the L3. It's normal for L1-dcache-loads to be larger than cache-reference because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it's not necessarily always the case because of hardware prefetches.

The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.

No, it's a trap. They are not easy to understand.

Can't sample hardware cache events with linux perf

There is a difference in perf evlist -vvv output of three perf.data, one of cache event, second of software event, and last of hw cycles event:

echo '2^234567 %2' | perf record -e L1-dcache-stores -c 100 -o cache bc
echo '2^234567 %2' | perf record -e cycles -c 100 -o cycles bc
echo '2^234567 %2' | perf record -e cs -c 100 -o cs bc

perf evlist -vvv -i cache
L1-dcache-stores: sample_freq=100, type: 3, config: 256, size: 96, sample_type: IP|TID|TIME, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, enable_on_exec: 1, sample_id_all: 1, exclude_guest: 1
perf evlist -vvv -i cycles
cycles: sample_freq=100, size: 96, sample_type: IP|TID|TIME, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, enable_on_exec: 1, sample_id_all: 1, exclude_guest: 1
perf evlist -vvv -i cs
cs: sample_freq=100, type: 1, config: 3, size: 96, sample_type: IP|TID|TIME, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, enable_on_exec: 1, sample_id_all: 1, exclude_guest: 1

There are different types, and types are defined as

0028 enum perf_type_id {
0029 PERF_TYPE_HARDWARE = 0,
0030 PERF_TYPE_SOFTWARE = 1,
0031 PERF_TYPE_TRACEPOINT = 2,
0032 PERF_TYPE_HW_CACHE = 3,
0033 PERF_TYPE_RAW = 4,
0034 PERF_TYPE_BREAKPOINT = 5,
0035
0036 PERF_TYPE_MAX, /* non-ABI */
0037 };

Perf script has a output table which defines how to print event of every kind: http://lxr.free-electrons.com/source/tools/perf/builtin-script.c?v=3.16#L68

 68 /* default set to maintain compatibility with current format */
69 static struct {
70 bool user_set;
71 bool wildcard_set;
72 unsigned int print_ip_opts;
73 u64 fields;
74 u64 invalid_fields;
75 } output[PERF_TYPE_MAX] = {
76
77 [PERF_TYPE_HARDWARE] = {
78 .user_set = false,
79
80 .fields = PERF_OUTPUT_COMM | PERF_OUTPUT_TID |
81 PERF_OUTPUT_CPU | PERF_OUTPUT_TIME |
82 PERF_OUTPUT_EVNAME | PERF_OUTPUT_IP |
83 PERF_OUTPUT_SYM | PERF_OUTPUT_DSO,
84
85 .invalid_fields = PERF_OUTPUT_TRACE,
86 },
87
88 [PERF_TYPE_SOFTWARE] = {
89 .user_set = false,
90
91 .fields = PERF_OUTPUT_COMM | PERF_OUTPUT_TID |
92 PERF_OUTPUT_CPU | PERF_OUTPUT_TIME |
93 PERF_OUTPUT_EVNAME | PERF_OUTPUT_IP |
94 PERF_OUTPUT_SYM | PERF_OUTPUT_DSO,
95
96 .invalid_fields = PERF_OUTPUT_TRACE,
97 },
98
99 [PERF_TYPE_TRACEPOINT] = {
100 .user_set = false,
101
102 .fields = PERF_OUTPUT_COMM | PERF_OUTPUT_TID |
103 PERF_OUTPUT_CPU | PERF_OUTPUT_TIME |
104 PERF_OUTPUT_EVNAME | PERF_OUTPUT_TRACE,
105 },
106
107 [PERF_TYPE_RAW] = {
108 .user_set = false,
109
110 .fields = PERF_OUTPUT_COMM | PERF_OUTPUT_TID |
111 PERF_OUTPUT_CPU | PERF_OUTPUT_TIME |
112 PERF_OUTPUT_EVNAME | PERF_OUTPUT_IP |
113 PERF_OUTPUT_SYM | PERF_OUTPUT_DSO,
114
115 .invalid_fields = PERF_OUTPUT_TRACE,
116 },
117 };
118

So, there is no instructions of printing any of field from samples with type 3 - PERF_TYPE_HW_CACHE, and perf script does not print them. We can try to register this type in output array and even push the patch to kernel.

definition of linux perf cache-misses event?

The cache-misses event corresponds to the misses in the last level cache (LLC). Note that this is an architectural performance monitoring event, that is supposed to behave consistently across microarchitectures.

This can be verified from the source code - cache-misses

The first 2 digits of the hexadecimal 0x412e refer to the umask(41) and the last 2 digits refer to the event-select(2e).

From the Intel software developer's manual (look at the chapter on Performance Monitoring)

Last Level Cache Misses— Event select 2EH, Umask 41H

"This event counts each cache miss condition for references to the last level on-die cache. The event count may include speculation and cache line fills due to the first-level cache hardware prefetcher, but may exclude cache line fills due to other hardware-prefetchers."



Related Topics



Leave a reply



Submit