Using perf to monitor raw event counters
Ok, so I guess I figured it out.
For the the Intel machine I use, the format is as follows:<umask><eventselector>
where both are hexadecimal values. The leading zeros of the umask can be dropped, but not for the event selector.
So for the event 0xB0
with the mask 0x01
I can call:
perf record -e r1B0 ./mytestapp someargs
I could not manage to find the exact parsing of it in the perf kernel code (any kernel hacker here?), but I found these sources:
- A description of the use of perf with raw events in the c't magazine 13/03 (subscription required), which describes some raw events with their description from the Intel Architecture Software Developers Manuel (Vol 3b)
- A patch on the kernel mailing list, discussing the proper way to document it. It specified that the pattern above was "... was x86 specific and imcomplete at that"
- (Updated) The man page of newer versions shows an example on Intel machines:
man perf-list
Update:
As pointed out in the comments (thank you!), the libpfm translator can be used to obtain the proper event descriptor. The website linked in the comments (Bojan Nikolic: How to monitor the full range of CPU performance events), discovered by user 'osgx' explains it in further detail.
Counting L3 cache access event on Amd Zen 2 processors
The L3 cache events can only be counted on the L3 PMU as clearly specified in both the physical mnemonic (L3PMCx01
) and the logical mnemonic (Core::X86::Pmc::L3::L3RequestG1
) of the event you want to measure. The L3 PMU is formally called L3PMC. This is similar to the cbox PMUs on Intel processors.
The default PMU in perf for raw events is cpu
, which is the name the perf_events subsystem gives to the core PMU. An event specified using a raw event code without an explicit PMU, such as r8001, is equivalent to cpu/r8001/. The core event 0x001 represents the event Core::X86::Pmc::Core::FpSchedEmpty
and the umask 0x80 is undefined for this event (see Section 2.1.15.4.1). So you're counting an undefined event. In this case, if the event happened to be implemented but not documented, then the event count may not be zero depending on whether it occurs during the execution of the program being profiled. Otherwise, the event count would be zero. perf_events doesn't stop you from counting undefined events.
Starting with upstream kernel version v5.4-rc1, the L3PMC is supported in perf_events under the name amd_l3
. To determine whether you're using a kernel that supports this PMU, check whether it's enumerated using the command ls /sys/devices/*/format
. If not supported, then you can't measure the L3 events on that kernel through perf.
If amd_l3
is supported, you have to explicitly specify the PMU as in amd_l3/r8001/
or amd_l3/event=0x01,umask=0x80/
to have the event counted on the right PMU. Or you can just use the perf event name l3_request_g1.caching_l3_cache_accesses
.
Do you know what the event L3RequestG1
represents? The documentation only describes it as "Caching: L3 cache accesses," which isn't very meaningful. It seems to me that the types of transactions it counts are a subset of those covered by the event L3LookupState
. Table 19 in Section 2.1.15.2 says that L3 accesses and misses should be counted using rFF04 (L3LookupState
) and r0106 (L3CombClstrState
), respectively. Don't blindly expect that any of these events actually count whatever you want to measure.
The PPR you linked is not for any Zen2 processors, it's for some Zen and Zen+ processors (specifically models 00h-0Fh). You need to know the processor model and family to locate the right PPR.
only 2 PERF_TYPE_HW_CACHE events in perf event group
Note that, perf
does allow measuring more than 2 PERF_TYPE_HW_CACHE events at the same time, the exception being the measurement of LLC-cache
events.
The expectation is that, when there are 4 general-purpose and 3 fixed-purpose
hardware counters, 4 HW cache events (which default to RAW
events) in perf can be measured without multiplexing, with hyper-threading ON.
sudo perf stat -e L1-icache-load-misses,L1-dcache-stores,L1-dcache-load-misses,dTLB-load-misses sleep 2
Performance counter stats for 'sleep 2':
26,893 L1-icache-load-misses
98,999 L1-dcache-stores
14,037 L1-dcache-load-misses
723 dTLB-load-misses
2.001732771 seconds time elapsed
0.001217000 seconds user
0.000000000 seconds sys
The problem appears when you try to measure events targeting the LLC-cache
. It seems to be measuring only 2 LLC-cache
specific events, concurrently, without multiplexing.
sudo perf stat -e LLC-load-misses,LLC-stores,LLC-store-misses,LLC-loads sleep 2
Performance counter stats for 'sleep 2':
2,419 LLC-load-misses # 0.00% of all LL-cache hits
2,963 LLC-stores
<not counted> LLC-store-misses (0.00%)
<not counted> LLC-loads (0.00%)
2.001486710 seconds time elapsed
0.001137000 seconds user
0.000000000 seconds sys
CPUs belonging to the skylake/kaby lake
family of microarchitectures and some others, allow you to measure OFFCORE RESPONSE
events. Monitoring OFFCORE_RESPONSE
events requires programming extra MSRs, specifically, MSR_OFFCORE_RSP0
(MSR address 1A6H) and MSR_OFFCORE_RSP1
(MSR address 1A7H), in addition to programming the pair of IA32_PERFEVTSELx
and IA32_PMCx
registers.
Each pair of IA32_PERFEVTSELx
and IA32_PMCx
register will be associated with one of the above MSRs to measure LLC-cache events.
The definition of the OFFCORE_RESPONSE
MSRs can be seen here.
static struct extra_reg intel_skl_extra_regs[] __read_mostly = {
INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff8fffull, RSP_0),
INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff8fffull, RSP_1),
........
}
0x01b7
in the INTEL_UEVENT_EXTRA_REG
call refers to event-code b7
and umask 01
. This event code 0x01b7
maps to LLC-cache events, as can be seen here -
[ C(LL ) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
[ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
},
[ C(OP_WRITE) ] = {
[ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
[ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
},
[ C(OP_PREFETCH) ] = {
[ C(RESULT_ACCESS) ] = 0x0,
[ C(RESULT_MISS) ] = 0x0,
},
},
The event 0x01b7
will always map to MSR_OFFCORE_RSP_0
, as can be seen here. The function, specified above, loops through the array of all the "extra registers" and associates the event->config(which contains the raw event id) with the offcore response MSR.
So, this would mean only one event can be measured at a time, since only one MSR - MSR_OFFCORE_RSP_0
can be mapped to a LLC-cache
event. But, that is not the case!
The offcore registers are symmetric in nature, so when the first MSR - MSR_OFFCORE_RSP_0
register is busy, perf
uses the second alternative MSR, MSR_OFFCORE_RSP_1
for measuring another offcore LLC event. This function here helps in doing that.
static int intel_alt_er(int idx, u64 config)
{
int alt_idx = idx;
if (!(x86_pmu.flags & PMU_FL_HAS_RSP_1))
return idx;
if (idx == EXTRA_REG_RSP_0)
alt_idx = EXTRA_REG_RSP_1;
if (idx == EXTRA_REG_RSP_1)
alt_idx = EXTRA_REG_RSP_0;
if (config & ~x86_pmu.extra_regs[alt_idx].valid_mask)
return idx;
return alt_idx;
}
The presence of only 2 offcore registers, for Kaby-Lake
family of microrarchitectures hinder the ability to target more than 2 LLC-cache event measurement concurrently, without any multiplexing.
Performance monitoring with perf
ANSWER #1
Yes mostly. perf report
does show you a summary of the trace collected. Samples collected by perf record
are saved into a binary file called, by default, perf.data
. The perf report
command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first. However, you can do much more detailed profiling also using this report.
ANSWER #2
You should ideally use perf script -D
to get a trace of all data. The timestamp is in microseconds. Although, in kernels newer than the one you specify, with the help of a command line switch (-ns) you can display the time in nanoseconds as well. Here is the source -
Timestamp
It is quite hard to tell this without looking at what kind of "deltas" are you getting. Remember the period of collecting samples is usually tuned. There are two ways of specifying the rate at which to collect samples --
You can use the perf record (--c for count)
to specify the period at which to collect samples. This will mean that for every c occurrences of the event that you are measuring, you will have a sample for that. You can then modify the sampling period and test various values. This means that at every two occurences of the event for which you are measuring, the counter will overflow and you will record a sample.
The other way around to express the sampling period, is to specify the average rate of samples per second (frequency) - which you can do using perf record -F
. So perf record -F 1000
will record around 1000 samples per second and these samples will be generated when the hardware/PMU counter corresponding to the event overflows. This means that the kernel will dynamically adjust the sampling period. And you will get sample times at different random moments.
You can see for yourself in code here:
How perf dynamically updates time
ANSWER #3
Why not ? Ideally you should get the number of event samples collected if you do a perf report
and just do a deeper analysis. Also when you do a perf record
and finish recording samples, you would get a notification on the command line about the number of samples collected corresponding to the event you measured. (This may not be available in the kernel module you use, I would suggest you switch to a newer linux version if possible!). The number of samples should be the raw count - not the period.
If your period is 100 - it means that for the whole duration of the trace, perf
recorded every 100th event. That means, if a total of 1000 events happened for the trace duration, perf
approximately collected event 1, 100, 200, 300...1000.
Yes the samples recorded are not only from the application. In fact, you can use switches like this : perf record -e <event-name:u> or <event-name:k>
(u for userspace and k for kernel) to record events. Additionally perf
records samples from shared libraries as well. (Please consult the perf
man-page for more details).
As I said previously, perf report
should be an ideal tool to calculate the number of samples of event cycles
recorded by perf
. The number of events collected/recorded is not exact because it is simply not possible for hardware to record all cycle
events. This is because recording and preparing details of all the events require the kernel to maintain a ring buffer which gets written to periodically as and when the counter overflows. This writing to the buffer happens via interrupts. They take up a fraction of CPU time- this time is lost and could have been used to record events which will now be lost as the CPU was busy servicing interrupts. You can get a really great estimate by perf
even then, though.
CONCLUSION
perf
does especially what it intends to do given the limitations of hardware resources we have at hand currently. I would suggest going through the man-pages for each command to understand better.
QUESTIONS
I assume you are looking at
perf report
. I also assume you are talking about the overhead % inperf report
. Theoretically, it can be considered to be an arrangement of data from the highest to least occurrence as you specified. But, there are many underlying details that you need to consider and understand to properly make sense of the output. It represents which function has the most overhead (in terms of the number of events that occurred in that function ). There is also a parent-child relationship, based on which function calls which function, between all the functions and their overheads. Please use the Perf Report link to understand more.As you know already events are being sampled, not counted. So you cannot accurately get the number of events, but you will get the number of samples and based on the tuned frequency of collecting samples, you will also get the raw count of the number of events ( Everything should be available to you with the
perf report
output ).
How to read performance counters on i5, i7 CPUs
Looks like PAPI has very clean API and works just fine on Ubuntu 11.04.
Once it's installed, following app will do what I wanted:
#include <stdio.h>
#include <stdlib.h>
#include <papi.h>
#define NUM_EVENTS 4
void matmul(const double *A, const double *B,
double *C, int m, int n, int p)
{
int i, j, k;
for (i = 0; i < m; ++i)
for (j = 0; j < p; ++j) {
double sum = 0;
for (k = 0; k < n; ++k)
sum += A[i*n + k] * B[k*p + j];
C[i*p + j] = sum;
}
}
int main(int /* argc */, char ** /* argv[] */)
{
const int size = 300;
double a[size][size];
double b[size][size];
double c[size][size];
int event[NUM_EVENTS] = {PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_BR_MSP, PAPI_L1_DCM };
long long values[NUM_EVENTS];
/* Start counting events */
if (PAPI_start_counters(event, NUM_EVENTS) != PAPI_OK) {
fprintf(stderr, "PAPI_start_counters - FAILED\n");
exit(1);
}
matmul((double *)a, (double *)b, (double *)c, size, size, size);
/* Read the counters */
if (PAPI_read_counters(values, NUM_EVENTS) != PAPI_OK) {
fprintf(stderr, "PAPI_read_counters - FAILED\n");
exit(1);
}
printf("Total instructions: %lld\n", values[0]);
printf("Total cycles: %lld\n", values[1]);
printf("Instr per cycle: %2.3f\n", (double)values[0] / (double) values[1]);
printf("Branches mispredicted: %lld\n", values[2]);
printf("L1 Cache misses: %lld\n", values[3]);
/* Stop counting events */
if (PAPI_stop_counters(values, NUM_EVENTS) != PAPI_OK) {
fprintf(stderr, "PAPI_stoped_counters - FAILED\n");
exit(1);
}
return 0;
}
Tested this on Intel Q6600, it supports up to 4 performance events. Your processor may support more or less.
Related Topics
What Is This $Path in Linux and How to Modify It
Coreos - Get Docker Container Name by Pid
Git- How to Kill Ssh-Agent Properly on Linux
How to Use Aio and Epoll Together in a Single Event Loop
How to Disable Editing My History in Bash
Bumping Version Numbers for New Releases in Associated Files (Documentation)
Let Non-Root User Write to Linux Host in Docker
Newbie on Debian and Trying to Make Java 7 the Default Java Version Used
Cannot Install Extensions in Visual Studio Code
List Files Recursively in Linux Cli With Path Relative to the Current Directory
Find the Process Run by Nohup Command
How to Delete Everything in a String After a Specific Character
Auto Exit Telnet Command Back to Prompt Without Human Intervention ^] Quit Close Exit Code 1
How to Add a String to the Beginning of Each File in a Folder in Bash
Critical Timing in an Arm Linux Kernel Driver
What Are the Possible List of Linux Bash Shell Injection Commands