Can't sample hardware cache events with linux perf
There is a difference in perf evlist -vvv
output of three perf.data, one of cache event, second of software event, and last of hw cycles event:
echo '2^234567 %2' | perf record -e L1-dcache-stores -c 100 -o cache bc
echo '2^234567 %2' | perf record -e cycles -c 100 -o cycles bc
echo '2^234567 %2' | perf record -e cs -c 100 -o cs bc
perf evlist -vvv -i cache
L1-dcache-stores: sample_freq=100, type: 3, config: 256, size: 96, sample_type: IP|TID|TIME, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, enable_on_exec: 1, sample_id_all: 1, exclude_guest: 1
perf evlist -vvv -i cycles
cycles: sample_freq=100, size: 96, sample_type: IP|TID|TIME, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, enable_on_exec: 1, sample_id_all: 1, exclude_guest: 1
perf evlist -vvv -i cs
cs: sample_freq=100, type: 1, config: 3, size: 96, sample_type: IP|TID|TIME, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, enable_on_exec: 1, sample_id_all: 1, exclude_guest: 1
There are different types, and types are defined as
0028 enum perf_type_id {
0029 PERF_TYPE_HARDWARE = 0,
0030 PERF_TYPE_SOFTWARE = 1,
0031 PERF_TYPE_TRACEPOINT = 2,
0032 PERF_TYPE_HW_CACHE = 3,
0033 PERF_TYPE_RAW = 4,
0034 PERF_TYPE_BREAKPOINT = 5,
0035
0036 PERF_TYPE_MAX, /* non-ABI */
0037 };
Perf script has a output
table which defines how to print event of every kind: http://lxr.free-electrons.com/source/tools/perf/builtin-script.c?v=3.16#L68
68 /* default set to maintain compatibility with current format */
69 static struct {
70 bool user_set;
71 bool wildcard_set;
72 unsigned int print_ip_opts;
73 u64 fields;
74 u64 invalid_fields;
75 } output[PERF_TYPE_MAX] = {
76
77 [PERF_TYPE_HARDWARE] = {
78 .user_set = false,
79
80 .fields = PERF_OUTPUT_COMM | PERF_OUTPUT_TID |
81 PERF_OUTPUT_CPU | PERF_OUTPUT_TIME |
82 PERF_OUTPUT_EVNAME | PERF_OUTPUT_IP |
83 PERF_OUTPUT_SYM | PERF_OUTPUT_DSO,
84
85 .invalid_fields = PERF_OUTPUT_TRACE,
86 },
87
88 [PERF_TYPE_SOFTWARE] = {
89 .user_set = false,
90
91 .fields = PERF_OUTPUT_COMM | PERF_OUTPUT_TID |
92 PERF_OUTPUT_CPU | PERF_OUTPUT_TIME |
93 PERF_OUTPUT_EVNAME | PERF_OUTPUT_IP |
94 PERF_OUTPUT_SYM | PERF_OUTPUT_DSO,
95
96 .invalid_fields = PERF_OUTPUT_TRACE,
97 },
98
99 [PERF_TYPE_TRACEPOINT] = {
100 .user_set = false,
101
102 .fields = PERF_OUTPUT_COMM | PERF_OUTPUT_TID |
103 PERF_OUTPUT_CPU | PERF_OUTPUT_TIME |
104 PERF_OUTPUT_EVNAME | PERF_OUTPUT_TRACE,
105 },
106
107 [PERF_TYPE_RAW] = {
108 .user_set = false,
109
110 .fields = PERF_OUTPUT_COMM | PERF_OUTPUT_TID |
111 PERF_OUTPUT_CPU | PERF_OUTPUT_TIME |
112 PERF_OUTPUT_EVNAME | PERF_OUTPUT_IP |
113 PERF_OUTPUT_SYM | PERF_OUTPUT_DSO,
114
115 .invalid_fields = PERF_OUTPUT_TRACE,
116 },
117 };
118
So, there is no instructions of printing any of field from samples with type 3 - PERF_TYPE_HW_CACHE, and perf script
does not print them. We can try to register this type in output
array and even push the patch to kernel.
Why won't perf report dcache-store-misses?
Perf prints <not supported>
for generic events which were requested by user or by default event set (in perf stat
) which are not mapped to real hardware PMU events on current hardware. Your hardware have no exact match to L1-dcache-store-misses
generic event so perf informs you that your request sudo perf stat -e L1-dcache-load-misses,L1-dcache-store-misses ./progB
can't be fully implemented on current machine.
Your cpu is "Product formerly Kaby Lake" which has skylake PMU according to linux kernel file arch/x86/events/intel/core.c
:
#L4986
case INTEL_FAM6_KABYLAKE:
memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids));
Line 420 of this file is the cache event mapping (generic perf event name to real hw pmu event code) for skylake pmu - skl_hw_cache_event_ids
, and your l1d load/store miss are [ C(L1D ) ]
- [ C(OP_READ) ]
/ [ C(OP_WRITE) ]
- [ C(RESULT_MISS) ]
fields of this strange data structure (= 0
means not mapped, and skl_hw_cache_extra_regs
L525 has additional umask settings for events):
static ... const... skl_hw_cache_event_ids ... =
{
[ C(L1D ) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = 0x81d0, /* MEM_INST_RETIRED.ALL_LOADS */
[ C(RESULT_MISS) ] = 0x151, /* L1D.REPLACEMENT */
},
[ C(OP_WRITE) ] = {
[ C(RESULT_ACCESS) ] = 0x82d0, /* MEM_INST_RETIRED.ALL_STORES */
[ C(RESULT_MISS) ] = 0x0,
}, ...
},
So, for SkyLake L1d misses are defined for loads (op_read) as and not defined for stores (op_write). And L1d accesses are defined for both operations.
These generic events were probably created long time ago, when hardware had some PMU event to implement them. For example, Core 2 PMU has mapping for these events, arch/x86/events/intel/core.c
line 1254 core2_hw_cache_event_ids
const - l1d read miss is L1D_CACHE_LD.I_STATE, l1d write miss is L1D_CACHE_ST.I_STATE. perf subsystem in kernel just had to keep many generic event names, added in old versions, to have compatibility.
You should check output of sudo perf list cache
command to select supported events for your CPU and its PMU. This command (in recent perf tool versions) will output only mapped generic names and will also print hardware-specific event names. You also should check Intel SDM, optimization and perfcounters manuals to get understanding about how the load and stores are implemented and which PMU events you should use to count hardware events.
While L1d store miss are not available on your cpu, you should think about what is the store miss and how it is implemented. Probably, this request will be passed to some next level of cache/memory hierarchy, for example it will become L2 store access. perf generic event set is ugly (was introduced in the era of 2 level cache in Core2) and has only L1 and LLC (last level cache) cache events. Not sure how LLC is mapped in the current era of shared L3, is it L2 or L3 (skylake's llc = L3). But intel-specific events should work.
Linux perf reporting cache misses for unexpected instruction
About your example:
There are several instructions before and at the high counter:
│ movsd (%rcx,%rsi,8),%xmm0
0.13 │ ucomis (%rcx,%rdx,8),%xmm0
57.99 │ ↑ jbe ff
"movsd" loads word from (%rcx,%rsi,8)
(some array access) into xmm0 register, and "ucomis" loads another word from (%rcx,%rdx,8)
and compares it with just loaded value in xmm0 register. "jbe" is conditional jump which depends on compare outcome.
Many modern Intel CPUs (and AMD probably too) can and will fuse (combine) some combinations of operations (realworldtech.com/nehalem/5 "into a single uop, CMP+JCC") together, and cmp + conditional jump very common instruction combination to be fused (you can check it with Intel IACA
simulating tool, use ver 2.1 for your CPU). Fused pair may be reported in perf/PMUs/PEBS incorrectly with skew of most events towards one of two instructions.
This code probably means that expression "dist[i] < dist[tmp]" generates two memory accesses, and both of values are used in ucomis
instruction which is (partially?) fused with jbe
conditional jump. Either dist[i] or dist[tmp] or both expressions generates high number of misses. Any of such miss will block ucomis
to generate result and block jbe
to give next instruction to execute (or to retire predicted instructions). So, jbe
may get all fame of high counters instead of real memory-access instructions (and for "far" event like cache response there is some skew towards last blocked instruction).
You may try to merge visited[N] and dist[N] arrays into array[N] of struct { int visited; float dist}
to force prefetching of array[i].dist
when you access array[i].visited
or you may try to change order of vertex access, or renumber graph vertex, or do some software prefetch for next one or more elements (?)
About generic perf
event by name problems and possible uncore skew.
perf
(perf_events) tool in Linux uses predefined set of events when called as perf list
, and some listed hardware events can be not implemented; others are mapped to current CPU capabilities (and some mappings are not fully correct). Some basic info about real PMU is in your https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf (but it has more details for related Nehalem-EP variant).
For your Nehalem (Intel Core i5 750 with L3 cache of 8MB and without multi-CPU/multi-socket/NUMA support) perf will map standard ("Generic cache events") LLC-load-misses
event as .. "OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS" as written in the best documentation of perf event mappings (the only one) - kernel source code
http://elixir.free-electrons.com/linux/v4.8/source/arch/x86/events/intel/core.c#L1103
u64 nehalem_hw_cache_event_ids ...
[ C(LL ) ] = {
[ C(OP_READ) ] = {
/* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
/* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */
[ C(RESULT_MISS) ] = 0x01b7,
...
/*
* Nehalem/Westmere MSR_OFFCORE_RESPONSE bits;
* See IA32 SDM Vol 3B 30.6.1.3
*/
#define NHM_DMND_DATA_RD (1 << 0)
#define NHM_DMND_READ (NHM_DMND_DATA_RD)
#define NHM_L3_MISS (NHM_NON_DRAM|NHM_LOCAL_DRAM|NHM_REMOTE_DRAM|NHM_REMOTE_CACHE_FWD)
...
u64 nehalem_hw_cache_extra_regs
..
[ C(LL ) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = NHM_DMND_READ|NHM_L3_ACCESS,
[ C(RESULT_MISS) ] = NHM_DMND_READ|NHM_L3_MISS,
I think this event is not precise: cpu pipeline will post (with out-of-order) load request to the cache hierarchy and will execute other instructions. After some time (around 10 cycles to reach and get response from L2 and 40 cycles to reach L3) there will be response with miss flag in the corresponding (offcore?) PMU to increment counter. On this counter overflow, profiling interrupt will be generated from this PMU. In several cpu clock cycles it will reach pipeline to interrupt it, perf_events subsystem's handler will handle this with registering current (interrupted) EIP/RIP Instruction pointer and reset PMU counter back to some negative value (for example, -100000 to get interrupt for every 100000 L3 misses counted; use perf record -e LLC-load-misses -c 100000
to set exact count or perf will autotune limit to get some default frequency). The registered EIP/RIP is not the IP of load command and it may be also not the EIP/RIP of command which wants to use the loaded data.
But if your CPU is the only socket in the system and you access normal memory (not some mapped PCI-express space), L3 miss in fact will be implemented as local memory access and there are some counters for this... (https://software.intel.com/en-us/node/596851 - "Any memory requests missing here must be serviced by local or remote DRAM").
There are some listings of PMU events for your CPU:
Official Intel's "Intel® 64 and IA-32 Architectures Software Developer Manuals" (SDM): https://software.intel.com/en-us/articles/intel-sdm, Volume 3, Appendix A
- 3B: https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf "18.8 PERFORMANCE MONITORING FOR PROCESSORS BASED ON INTEL® MICROARCHITECTURE CODE NAME NEHALEM" from page 213 "Vol 3B 18-35"
- 3B: https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf "19.8 - Processors based on Intel® microarchitecture code name Nehalem" from page 365 and "Vol. 3B 19-61")
- Some other volume for Offcore response encoding? Vol. 3A 18-26?
from oprofile http://oprofile.sourceforge.net/docs/intel-corei7-events.php
- from libpfm4's
showevtinfo
http://www.bnikolic.co.uk/blog/hpc-prof-events.html (note, this page with Sandy Bridge list, get libpfm4 ant run on your PC to get your list). There is alsocheck_events
tool in libpfm4 to help your encode event as raw forperf
. - from VTune documentation: http://www.hpc.ut.ee/dokumendid/ips_xe_2015/vtune_amplifier_xe/documentation/en/help/reference/pmw_sp/events/offcore_response.html
- from Nehalem PMU guide: https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf
ocperf
tool from Intel's perf developer Andi Kleen, part of his pmu-tools https://github.com/andikleen/pmu-tools.ocperf
is just wrapper forperf and this package will download event description and any supported event name will be converted into correct raw encoding of
perf`.
There should be some information about ANY_LLC_MISS offcore PMU event implementation and list of PEBS events for Nhm, but I can't find it now.
I can recommend you to use ocperf
from https://github.com/andikleen/pmu-tools with any PMU events of your CPU without need to manually encode them. There are some PEBS events in your CPU, and there is Latency profiling / perf mem
for some kind of memory access profiling (some random perf mem pdfs: 2012 post "perf: add memory access sampling support",RH 2013 - pg26-30, still not documented in 2015 - sowa pg19, ls /sys/devices/cpu/events
). For newer CPUs there are newer tools like ucevent.
I also can recommend you to try cachegrind
profiler/cache simulator tool of valgrind
program with kcachegrind
GUI to view profiles. Valgrind-based profilers may help you to get basic idea about how the code works: they collect exact instruction execution counts for every instruction, and cachegrind also simulates some abstract multi-level cache. But real CPU will execute several instruction per cycle (so, callgrind
/cachegrind
cost model of 1 instruction = 1 cpu clock cycle gives some error; cachegrind cache model have not the same logic as real cache). And all valgrind
tools are dynamic binary instrumentation tools which will slow down your program 20-30 times compared to native run.
PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE concurrent monitoring
- The
PERF_TYPE_HARDWARE
andPERF_TYPE_HW_CACHE
events are mapped to two sets of registers involved in performance monitoring. The first set of MSRs are calledIA32_PERFEVTSELx
where x can vary from 0 to N-1, N being the total number of general purpose counters available. ThePERFEVTSEL
is a short for "performance event select", they specify various conditions on the fulfillment of which event counting will happen. The second set of MSRs are calledIA32_PMCx
, where x varies similarly asPERFEVTSEL
. These PMC registers store the counts of performance monitoring events. EachPERFEVTSEL
register is paired with a correspondingPMC
register.
The mapping happens as follows-
At the initialization of the architecture specific portion of the kernel, a pmu for measuring hardware specific events is registered here with type PERF_TYPE_RAW
. All PERF_TYPE_HARDWARE
and PERF_TYPE_HW_CACHE
events are mapped to PERF_TYPE_RAW
events to identify the pmu, as can be seen here.
if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE)
type = PERF_TYPE_RAW;
The same architecture specific initialization is responsible for setting up the addresses of the first/base registers of each of the aforementioned sets of performance monitoring event registers, here
.eventsel = MSR_ARCH_PERFMON_EVENTSEL0,
.perfctr = MSR_ARCH_PERFMON_PERFCTR0,
The event_init
function specific to the PMU identified, is responsible for setting up and "reserving" the two sets of performance monitoring registers, as well as checking for event constraints etc., here. The reservation happens here.
for (i = 0; i < x86_pmu.num_counters; i++) {
if (!reserve_perfctr_nmi(x86_pmu_event_addr(i)))
goto perfctr_fail;
}
for (i = 0; i < x86_pmu.num_counters; i++) {
if (!reserve_evntsel_nmi(x86_pmu_config_addr(i)))
goto eventsel_fail;
}
The value num_counters
= number of general-purpose counters as identified by CPUID
instruction.
In addition to this, there are a couple of extra registers that monitor offcore events (eg. the LLC-cache specific events).
In later versions of architectural performance monitoring, some of the hardware events are measured with the help of fixed-purpose registers, as seen here. These are the fixed-purpose registers -
#define MSR_ARCH_PERFMON_FIXED_CTR0 0x309
#define MSR_ARCH_PERFMON_FIXED_CTR1 0x30a
#define MSR_ARCH_PERFMON_FIXED_CTR2 0x30b
The
PERF_TYPE_HARDWARE
pre-defined events are all architectural performance monitoring events. These events are architectural, since the behavior of each architectural performance event is expected to be consistent on all processors that support that event. All of thePERF_TYPE_HW_CACHE
events are non-architectural, which means they are model-specific and may vary from one family of processors to another.For an Intel Kaby Lake machine that I have, a total of 20
PERF_TYPE_HW_CACHE
events are pre-defined. The event constraints involved, ensure that the 3 fixed-function counters available are mapped to 3PERF_TYPE_HARDWARE
architectural events. Only one event can be measured on each of the fixed-function counters, so we can discard them for our analysis. The other constraint is that only two events targeting the LLC-caches, can be measured at the same time, since there are only twoOFFCORE RESPONSE
registers. Also, thenmi-watchdog
may pin an event to another counter from the family of general-purpose counters. If thenmi-watchdog
is disabled, we are left with 4 general purpose counters.
Given the constraints involved, and the limited number of counters available, there is just no way to avoid multiplexing if all the 20 hardware cache events are measured at the same time. Some workarounds to measure all the events, without incurring multiplexing and its errors, are -
3.1. Group all the PERF_TYPE_HW_CACHE
events into groups of 4, such that all of the 4 events can be scheduled on each of the 4 general-purpose counters at the same time. Make sure there are no more than 2 LLC cache events in a group. Run the same profile and obtain the counts for each of the groups separately.
3.2. If all the PERF_TYPE_HW_CACHE
events are to be monitored at the same time, then some of the errors with multiplexing can be reduced, by decreasing the value of perf_event_mux_interval_ms
. It can be configured via a sysfs entry called /sys/devices/cpu/perf_event_mux_interval_ms
. This value cannot be lowered beyond a point, as can be seen here.
- Monitoring upto 8 hardware or hardware-cache events would require hyperthreading to be disabled. Note that, the information about the number of general purpose counters available are retrieved using the
CPUID
instruction and the number of such counters are setup at the architecture initialization portion of the kernel startup via theearly_initcall
function. This can be seen here. Once the initialization is done, the kernel understands that only 4 counters are available, and any changes in hyperthreading capabilities later, do not make any difference.
Related Topics
Python3 No Such File or Directory
Git Clone Gnutls Recv Error (-9): a Tls Packet with Unexpected Length Was Received
How to Set Environment Variable Within Gdb Using Shell Command
Difference Between Starting a Command Using Init.D Script and Service Start
Building Gcc with Glibc in a Non-Standard Location Without Root
List Files That Are in Directory1 But Not in Directory2 and Vice Versa
What Is a Shell Command to Find The Longest Common Substring of Two Strings in Unix
How to Flush Stdout of a Running Process
How to Start a Process in Its Own Process Group
How to Specify Which Kernel to Build with Bitbake/Yocto
Detecting If The Monitor Is Powered Off
Library Path Order for Alternate Glibc Dynamic Linker (Ld.So)
Linux - Change The Hostname in The Cli
How to Prevent Git from Committing Two Files with Names Differing Only in Case