Hardware cache events and perf
User @Margaret points towards a reasonable answer in the comments - read the kernel source to see the mapping for the PMU events.
We can check arch/x86/events/intel/core.c for the event definitions. I don't actually know if "core" here refers to the Core architecture, of just that this is the core fine with most definitions - but in any case it's the file you want to look at.
The key part is this section, which defines skl_hw_cache_event_ids
:
static __initconst const u64 skl_hw_cache_event_ids
[PERF_COUNT_HW_CACHE_MAX]
[PERF_COUNT_HW_CACHE_OP_MAX]
[PERF_COUNT_HW_CACHE_RESULT_MAX] =
{
[ C(L1D ) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = 0x81d0, /* MEM_INST_RETIRED.ALL_LOADS */
[ C(RESULT_MISS) ] = 0x151, /* L1D.REPLACEMENT */
},
[ C(OP_WRITE) ] = {
[ C(RESULT_ACCESS) ] = 0x82d0, /* MEM_INST_RETIRED.ALL_STORES */
[ C(RESULT_MISS) ] = 0x0,
},
[ C(OP_PREFETCH) ] = {
[ C(RESULT_ACCESS) ] = 0x0,
[ C(RESULT_MISS) ] = 0x0,
},
},
...
Decoding the nested initializers, you get that the L1D-dcahe-load
corresponds to MEM_INST_RETIRED.ALL_LOAD
and L1-dcache-load-misses
to L1D.REPLACEMENT
.
We can double check this with perf:
$ ocperf stat -e mem_inst_retired.all_loads,L1-dcache-loads,l1d.replacement,L1-dcache-load-misses,L1-dcache-loads,mem_load_retired.l1_hit head -c100M /dev/zero > /dev/null
Performance counter stats for 'head -c100M /dev/zero':
11,587,793 mem_inst_retired_all_loads
11,587,793 L1-dcache-loads
20,233 l1d_replacement
20,233 L1-dcache-load-misses # 0.17% of all L1-dcache hits
11,587,793 L1-dcache-loads
11,495,053 mem_load_retired_l1_hit
0.024322360 seconds time elapsed
The "Hardware Cache" events show exactly the same values as using the underlying PMU events we guessed at by checking the source.
What are perf cache events meaning?
You seem to think that the cache-misses
event is the sum of all other kind of cache misses (L1-dcache-load-misses
, and so on). That is actually not true.
the cache-misses
event represents the number of memory access that could not be served by any of the cache.
I admit that perf's documentation is not the best around.
However, one can learn quite a lot about it by reading (assuming that you already have a good knowledge of how a CPU and a performance monitoring unit work, this is clearly not a computer architecture course) the doc of the perf_event_open() function:
http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html
For example, by reading it you can see that the cache-misses
event showed by perf list corresponds to PERF_COUNT_HW_CACHE_MISSES
only 2 PERF_TYPE_HW_CACHE events in perf event group
Note that, perf
does allow measuring more than 2 PERF_TYPE_HW_CACHE events at the same time, the exception being the measurement of LLC-cache
events.
The expectation is that, when there are 4 general-purpose and 3 fixed-purpose
hardware counters, 4 HW cache events (which default to RAW
events) in perf can be measured without multiplexing, with hyper-threading ON.
sudo perf stat -e L1-icache-load-misses,L1-dcache-stores,L1-dcache-load-misses,dTLB-load-misses sleep 2
Performance counter stats for 'sleep 2':
26,893 L1-icache-load-misses
98,999 L1-dcache-stores
14,037 L1-dcache-load-misses
723 dTLB-load-misses
2.001732771 seconds time elapsed
0.001217000 seconds user
0.000000000 seconds sys
The problem appears when you try to measure events targeting the LLC-cache
. It seems to be measuring only 2 LLC-cache
specific events, concurrently, without multiplexing.
sudo perf stat -e LLC-load-misses,LLC-stores,LLC-store-misses,LLC-loads sleep 2
Performance counter stats for 'sleep 2':
2,419 LLC-load-misses # 0.00% of all LL-cache hits
2,963 LLC-stores
<not counted> LLC-store-misses (0.00%)
<not counted> LLC-loads (0.00%)
2.001486710 seconds time elapsed
0.001137000 seconds user
0.000000000 seconds sys
CPUs belonging to the skylake/kaby lake
family of microarchitectures and some others, allow you to measure OFFCORE RESPONSE
events. Monitoring OFFCORE_RESPONSE
events requires programming extra MSRs, specifically, MSR_OFFCORE_RSP0
(MSR address 1A6H) and MSR_OFFCORE_RSP1
(MSR address 1A7H), in addition to programming the pair of IA32_PERFEVTSELx
and IA32_PMCx
registers.
Each pair of IA32_PERFEVTSELx
and IA32_PMCx
register will be associated with one of the above MSRs to measure LLC-cache events.
The definition of the OFFCORE_RESPONSE
MSRs can be seen here.
static struct extra_reg intel_skl_extra_regs[] __read_mostly = {
INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff8fffull, RSP_0),
INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff8fffull, RSP_1),
........
}
0x01b7
in the INTEL_UEVENT_EXTRA_REG
call refers to event-code b7
and umask 01
. This event code 0x01b7
maps to LLC-cache events, as can be seen here -
[ C(LL ) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
[ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
},
[ C(OP_WRITE) ] = {
[ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
[ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
},
[ C(OP_PREFETCH) ] = {
[ C(RESULT_ACCESS) ] = 0x0,
[ C(RESULT_MISS) ] = 0x0,
},
},
The event 0x01b7
will always map to MSR_OFFCORE_RSP_0
, as can be seen here. The function, specified above, loops through the array of all the "extra registers" and associates the event->config(which contains the raw event id) with the offcore response MSR.
So, this would mean only one event can be measured at a time, since only one MSR - MSR_OFFCORE_RSP_0
can be mapped to a LLC-cache
event. But, that is not the case!
The offcore registers are symmetric in nature, so when the first MSR - MSR_OFFCORE_RSP_0
register is busy, perf
uses the second alternative MSR, MSR_OFFCORE_RSP_1
for measuring another offcore LLC event. This function here helps in doing that.
static int intel_alt_er(int idx, u64 config)
{
int alt_idx = idx;
if (!(x86_pmu.flags & PMU_FL_HAS_RSP_1))
return idx;
if (idx == EXTRA_REG_RSP_0)
alt_idx = EXTRA_REG_RSP_1;
if (idx == EXTRA_REG_RSP_1)
alt_idx = EXTRA_REG_RSP_0;
if (config & ~x86_pmu.extra_regs[alt_idx].valid_mask)
return idx;
return alt_idx;
}
The presence of only 2 offcore registers, for Kaby-Lake
family of microrarchitectures hinder the ability to target more than 2 LLC-cache event measurement concurrently, without any multiplexing.
How does Linux perf calculate the cache-references and cache-misses events
The built-in perf
events that you are interested in are mapping to the following hardware performance monitoring events on your processor:
523,288,816 cache-references (architectural event: LLC Reference)
205,331,370 cache-misses (architectural event: LLC Misses)
237,794,728 L1-dcache-load-misses L1D.REPLACEMENT
3,495,080,007 L1-dcache-loads MEM_INST_RETIRED.ALL_LOADS
2,039,344,725 L1-dcache-stores MEM_INST_RETIRED.ALL_STORES
531,452,853 L1-icache-load-misses ICACHE_64B.IFTAG_MISS
77,062,627 LLC-loads OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
27,462,249 LLC-load-misses OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
15,039,473 LLC-stores OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
3,829,429 LLC-store-misses OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)
All of these events are documented in the Intel manual Volume 3. For more information on how to map perf
events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.
But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn't it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.
LLC-load-misses
and LLC-store-misses
count only cacheable data read requests and RFO requests, respectively, that miss in the L3 cache. LLC-load-misses
also includes reads for page walking. Both exclude hardware and software prefetching. (The difference compared to Haswell is that some types of prefetch requests are counted.)
cache-misses
also includes prefetch requests and code fetch requests that miss in the L3 cache. All of these events only count core-originating requests. They include requests from uops irrespective of whether end up retiring and irrespective of the source of the response. It's unclear to me how a prefetch promoted to demand is counted.
Overall, I think cache-misses
is always larger than LLC-load-misses
+ LLC-store-misses
and cache-references
is always larger than LLC-loads
+ LLC-stores
.
The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores
It's only guaranteed that cache-reference
is larger than cache-misses
because the former counts requests irrespective of whether they miss the L3. It's normal for L1-dcache-loads
to be larger than cache-reference
because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it's not necessarily always the case because of hardware prefetches.
The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.
No, it's a trap. They are not easy to understand.
Can't sample hardware cache events with linux perf
There is a difference in perf evlist -vvv
output of three perf.data, one of cache event, second of software event, and last of hw cycles event:
echo '2^234567 %2' | perf record -e L1-dcache-stores -c 100 -o cache bc
echo '2^234567 %2' | perf record -e cycles -c 100 -o cycles bc
echo '2^234567 %2' | perf record -e cs -c 100 -o cs bc
perf evlist -vvv -i cache
L1-dcache-stores: sample_freq=100, type: 3, config: 256, size: 96, sample_type: IP|TID|TIME, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, enable_on_exec: 1, sample_id_all: 1, exclude_guest: 1
perf evlist -vvv -i cycles
cycles: sample_freq=100, size: 96, sample_type: IP|TID|TIME, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, enable_on_exec: 1, sample_id_all: 1, exclude_guest: 1
perf evlist -vvv -i cs
cs: sample_freq=100, type: 1, config: 3, size: 96, sample_type: IP|TID|TIME, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, enable_on_exec: 1, sample_id_all: 1, exclude_guest: 1
There are different types, and types are defined as
0028 enum perf_type_id {
0029 PERF_TYPE_HARDWARE = 0,
0030 PERF_TYPE_SOFTWARE = 1,
0031 PERF_TYPE_TRACEPOINT = 2,
0032 PERF_TYPE_HW_CACHE = 3,
0033 PERF_TYPE_RAW = 4,
0034 PERF_TYPE_BREAKPOINT = 5,
0035
0036 PERF_TYPE_MAX, /* non-ABI */
0037 };
Perf script has a output
table which defines how to print event of every kind: http://lxr.free-electrons.com/source/tools/perf/builtin-script.c?v=3.16#L68
68 /* default set to maintain compatibility with current format */
69 static struct {
70 bool user_set;
71 bool wildcard_set;
72 unsigned int print_ip_opts;
73 u64 fields;
74 u64 invalid_fields;
75 } output[PERF_TYPE_MAX] = {
76
77 [PERF_TYPE_HARDWARE] = {
78 .user_set = false,
79
80 .fields = PERF_OUTPUT_COMM | PERF_OUTPUT_TID |
81 PERF_OUTPUT_CPU | PERF_OUTPUT_TIME |
82 PERF_OUTPUT_EVNAME | PERF_OUTPUT_IP |
83 PERF_OUTPUT_SYM | PERF_OUTPUT_DSO,
84
85 .invalid_fields = PERF_OUTPUT_TRACE,
86 },
87
88 [PERF_TYPE_SOFTWARE] = {
89 .user_set = false,
90
91 .fields = PERF_OUTPUT_COMM | PERF_OUTPUT_TID |
92 PERF_OUTPUT_CPU | PERF_OUTPUT_TIME |
93 PERF_OUTPUT_EVNAME | PERF_OUTPUT_IP |
94 PERF_OUTPUT_SYM | PERF_OUTPUT_DSO,
95
96 .invalid_fields = PERF_OUTPUT_TRACE,
97 },
98
99 [PERF_TYPE_TRACEPOINT] = {
100 .user_set = false,
101
102 .fields = PERF_OUTPUT_COMM | PERF_OUTPUT_TID |
103 PERF_OUTPUT_CPU | PERF_OUTPUT_TIME |
104 PERF_OUTPUT_EVNAME | PERF_OUTPUT_TRACE,
105 },
106
107 [PERF_TYPE_RAW] = {
108 .user_set = false,
109
110 .fields = PERF_OUTPUT_COMM | PERF_OUTPUT_TID |
111 PERF_OUTPUT_CPU | PERF_OUTPUT_TIME |
112 PERF_OUTPUT_EVNAME | PERF_OUTPUT_IP |
113 PERF_OUTPUT_SYM | PERF_OUTPUT_DSO,
114
115 .invalid_fields = PERF_OUTPUT_TRACE,
116 },
117 };
118
So, there is no instructions of printing any of field from samples with type 3 - PERF_TYPE_HW_CACHE, and perf script
does not print them. We can try to register this type in output
array and even push the patch to kernel.
definition of linux perf cache-misses event?
The cache-misses
event corresponds to the misses in the last level cache (LLC). Note that this is an architectural performance monitoring event, that is supposed to behave consistently across microarchitectures.
This can be verified from the source code - cache-misses
The first 2 digits of the hexadecimal 0x412e refer to the umask(41) and the last 2 digits refer to the event-select(2e).
From the Intel software developer's manual (look at the chapter on Performance Monitoring)
Last Level Cache Misses— Event select 2EH, Umask 41H
"This event counts each cache miss condition for references to the last level on-die cache. The event count may include speculation and cache line fills due to the first-level cache hardware prefetcher, but may exclude cache line fills due to other hardware-prefetchers."
Related Topics
Limiting Memory Usage in R Under Linux
Redirecting Stdout with Find -Exec and Without Creating New Shell
How to Create a Dynamic Variable and Assign Value to It
How to Calculate CPU Utilization of a Process & All Its Child Processes in Linux
Syntax Error Near Unexpected Token 'Then'
How to Use Gdb in Eclipse for C/C++ Debugging
Where Does Eclipse Look for Eclipse.Ini Under Linux
Library Path When Dynamically Loaded
Linux Capabilities (Setcap) Seems to Disable Ld_Library_Path
"In-Source Builds Are Not Allowed" in Cmake
Does Gcc Have Any Options to Add Version Info in Elf Binary File
What Is the Purpose of the "-I" and "-T" Options for the "Docker Exec" Command
Why Linux/Gnu Linker Chose Address 0X400000