perf_event_open - how to monitoring multiple events
That's a bit tricky.
We create first counter as usual. Additionally, we pass PERF_FORMAT_GROUP
and PERF_FORMAT_ID
to be able to work with multiple counters simultaneously. This counter will be our group leader.
struct perf_event_attr pea;
int fd1, fd2;
uint64_t id1, id2;
memset(&pea, 0, sizeof(struct perf_event_attr));
pea.type = PERF_TYPE_HARDWARE;
pea.size = sizeof(struct perf_event_attr);
pea.config = PERF_COUNT_HW_CPU_CYCLES;
pea.disabled = 1;
pea.exclude_kernel = 1;
pea.exclude_hv = 1;
pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
fd1 = syscall(__NR_perf_event_open, &pea, 0, -1, -1, 0);
Next, we retrieve identifier for the first counter:
ioctl(fd1, PERF_EVENT_IOC_ID, &id1);
Second (and all further counters) are created in the same fashion with only one exception: we pass fd1
value as group leader argument:
memset(&pea, 0, sizeof(struct perf_event_attr));
pea.type = PERF_TYPE_SOFTWARE;
pea.size = sizeof(struct perf_event_attr);
pea.config = PERF_COUNT_SW_PAGE_FAULTS;
pea.disabled = 1;
pea.exclude_kernel = 1;
pea.exclude_hv = 1;
pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
fd2 = syscall(__NR_perf_event_open, &pea, 0, -1, fd1, 0); // <-- here
ioctl(fd2, PERF_EVENT_IOC_ID, &id2);
Next we need to declare a data structure to read multiple counters at once. You have to declare different set of fields depending on what flags you pass to perf_event_open
. Manual page mentions all possible fields. In our case, we passed PERF_FORMAT_ID
flag which adds id
field. This will allow us to distinguish between different counters.
struct read_format {
uint64_t nr;
struct {
uint64_t value;
uint64_t id;
} values[/*2*/];
};
Now we call standard profiling ioctls:
ioctl(fd1, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
ioctl(fd1, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP);
do_something();
ioctl(fd1, PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
Finally, we read the counters from group leader file descriptor. Both counters are returned in single read_format
structure that we declared:
char buf[4096];
struct read_format* rf = (struct read_format*) buf;
uint64_t val1, val2;
read(fd1, buf, sizeof(buf));
for (i = 0; i < rf->nr; i++) {
if (rf->values[i].id == id1) {
val1 = rf->values[i].value;
} else if (rf->values[i].id == id2) {
val2 = rf->values[i].value;
}
}
printf("cpu cycles: %"PRIu64"\n", val1);
printf("page faults: %"PRIu64"\n", val2);
Below is the full program listing:
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <linux/hw_breakpoint.h>
#include <asm/unistd.h>
#include <errno.h>
#include <stdint.h>
#include <inttypes.h>
struct read_format {
uint64_t nr;
struct {
uint64_t value;
uint64_t id;
} values[];
};
void do_something() {
int i;
char* ptr;
ptr = malloc(100*1024*1024);
for (i = 0; i < 100*1024*1024; i++) {
ptr[i] = (char) (i & 0xff); // pagefault
}
free(ptr);
}
int main(int argc, char* argv[]) {
struct perf_event_attr pea;
int fd1, fd2;
uint64_t id1, id2;
uint64_t val1, val2;
char buf[4096];
struct read_format* rf = (struct read_format*) buf;
int i;
memset(&pea, 0, sizeof(struct perf_event_attr));
pea.type = PERF_TYPE_HARDWARE;
pea.size = sizeof(struct perf_event_attr);
pea.config = PERF_COUNT_HW_CPU_CYCLES;
pea.disabled = 1;
pea.exclude_kernel = 1;
pea.exclude_hv = 1;
pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
fd1 = syscall(__NR_perf_event_open, &pea, 0, -1, -1, 0);
ioctl(fd1, PERF_EVENT_IOC_ID, &id1);
memset(&pea, 0, sizeof(struct perf_event_attr));
pea.type = PERF_TYPE_SOFTWARE;
pea.size = sizeof(struct perf_event_attr);
pea.config = PERF_COUNT_SW_PAGE_FAULTS;
pea.disabled = 1;
pea.exclude_kernel = 1;
pea.exclude_hv = 1;
pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
fd2 = syscall(__NR_perf_event_open, &pea, 0, -1, fd1 /*!!!*/, 0);
ioctl(fd2, PERF_EVENT_IOC_ID, &id2);
ioctl(fd1, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
ioctl(fd1, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP);
do_something();
ioctl(fd1, PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
read(fd1, buf, sizeof(buf));
for (i = 0; i < rf->nr; i++) {
if (rf->values[i].id == id1) {
val1 = rf->values[i].value;
} else if (rf->values[i].id == id2) {
val2 = rf->values[i].value;
}
}
printf("cpu cycles: %"PRIu64"\n", val1);
printf("page faults: %"PRIu64"\n", val2);
return 0;
}
only 2 PERF_TYPE_HW_CACHE events in perf event group
Note that, perf
does allow measuring more than 2 PERF_TYPE_HW_CACHE events at the same time, the exception being the measurement of LLC-cache
events.
The expectation is that, when there are 4 general-purpose and 3 fixed-purpose
hardware counters, 4 HW cache events (which default to RAW
events) in perf can be measured without multiplexing, with hyper-threading ON.
sudo perf stat -e L1-icache-load-misses,L1-dcache-stores,L1-dcache-load-misses,dTLB-load-misses sleep 2
Performance counter stats for 'sleep 2':
26,893 L1-icache-load-misses
98,999 L1-dcache-stores
14,037 L1-dcache-load-misses
723 dTLB-load-misses
2.001732771 seconds time elapsed
0.001217000 seconds user
0.000000000 seconds sys
The problem appears when you try to measure events targeting the LLC-cache
. It seems to be measuring only 2 LLC-cache
specific events, concurrently, without multiplexing.
sudo perf stat -e LLC-load-misses,LLC-stores,LLC-store-misses,LLC-loads sleep 2
Performance counter stats for 'sleep 2':
2,419 LLC-load-misses # 0.00% of all LL-cache hits
2,963 LLC-stores
<not counted> LLC-store-misses (0.00%)
<not counted> LLC-loads (0.00%)
2.001486710 seconds time elapsed
0.001137000 seconds user
0.000000000 seconds sys
CPUs belonging to the skylake/kaby lake
family of microarchitectures and some others, allow you to measure OFFCORE RESPONSE
events. Monitoring OFFCORE_RESPONSE
events requires programming extra MSRs, specifically, MSR_OFFCORE_RSP0
(MSR address 1A6H) and MSR_OFFCORE_RSP1
(MSR address 1A7H), in addition to programming the pair of IA32_PERFEVTSELx
and IA32_PMCx
registers.
Each pair of IA32_PERFEVTSELx
and IA32_PMCx
register will be associated with one of the above MSRs to measure LLC-cache events.
The definition of the OFFCORE_RESPONSE
MSRs can be seen here.
static struct extra_reg intel_skl_extra_regs[] __read_mostly = {
INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff8fffull, RSP_0),
INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff8fffull, RSP_1),
........
}
0x01b7
in the INTEL_UEVENT_EXTRA_REG
call refers to event-code b7
and umask 01
. This event code 0x01b7
maps to LLC-cache events, as can be seen here -
[ C(LL ) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
[ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
},
[ C(OP_WRITE) ] = {
[ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
[ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
},
[ C(OP_PREFETCH) ] = {
[ C(RESULT_ACCESS) ] = 0x0,
[ C(RESULT_MISS) ] = 0x0,
},
},
The event 0x01b7
will always map to MSR_OFFCORE_RSP_0
, as can be seen here. The function, specified above, loops through the array of all the "extra registers" and associates the event->config(which contains the raw event id) with the offcore response MSR.
So, this would mean only one event can be measured at a time, since only one MSR - MSR_OFFCORE_RSP_0
can be mapped to a LLC-cache
event. But, that is not the case!
The offcore registers are symmetric in nature, so when the first MSR - MSR_OFFCORE_RSP_0
register is busy, perf
uses the second alternative MSR, MSR_OFFCORE_RSP_1
for measuring another offcore LLC event. This function here helps in doing that.
static int intel_alt_er(int idx, u64 config)
{
int alt_idx = idx;
if (!(x86_pmu.flags & PMU_FL_HAS_RSP_1))
return idx;
if (idx == EXTRA_REG_RSP_0)
alt_idx = EXTRA_REG_RSP_1;
if (idx == EXTRA_REG_RSP_1)
alt_idx = EXTRA_REG_RSP_0;
if (config & ~x86_pmu.extra_regs[alt_idx].valid_mask)
return idx;
return alt_idx;
}
The presence of only 2 offcore registers, for Kaby-Lake
family of microrarchitectures hinder the ability to target more than 2 LLC-cache event measurement concurrently, without any multiplexing.
system wide perf_event_open
It looks like none of those options aggregate counts for you, they either count on one core, or virtualize the counters across context switches.
If you look at what system-wide perf stat -a
does (e.g. with strace -f perf stat
), you can see it calls perf_event_open
once per event per core. It has to add up the counts for an event across cores; the system-call API won't do that for you.
PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE concurrent monitoring
- The
PERF_TYPE_HARDWARE
andPERF_TYPE_HW_CACHE
events are mapped to two sets of registers involved in performance monitoring. The first set of MSRs are calledIA32_PERFEVTSELx
where x can vary from 0 to N-1, N being the total number of general purpose counters available. ThePERFEVTSEL
is a short for "performance event select", they specify various conditions on the fulfillment of which event counting will happen. The second set of MSRs are calledIA32_PMCx
, where x varies similarly asPERFEVTSEL
. These PMC registers store the counts of performance monitoring events. EachPERFEVTSEL
register is paired with a correspondingPMC
register.
The mapping happens as follows-
At the initialization of the architecture specific portion of the kernel, a pmu for measuring hardware specific events is registered here with type PERF_TYPE_RAW
. All PERF_TYPE_HARDWARE
and PERF_TYPE_HW_CACHE
events are mapped to PERF_TYPE_RAW
events to identify the pmu, as can be seen here.
if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE)
type = PERF_TYPE_RAW;
The same architecture specific initialization is responsible for setting up the addresses of the first/base registers of each of the aforementioned sets of performance monitoring event registers, here
.eventsel = MSR_ARCH_PERFMON_EVENTSEL0,
.perfctr = MSR_ARCH_PERFMON_PERFCTR0,
The event_init
function specific to the PMU identified, is responsible for setting up and "reserving" the two sets of performance monitoring registers, as well as checking for event constraints etc., here. The reservation happens here.
for (i = 0; i < x86_pmu.num_counters; i++) {
if (!reserve_perfctr_nmi(x86_pmu_event_addr(i)))
goto perfctr_fail;
}
for (i = 0; i < x86_pmu.num_counters; i++) {
if (!reserve_evntsel_nmi(x86_pmu_config_addr(i)))
goto eventsel_fail;
}
The value num_counters
= number of general-purpose counters as identified by CPUID
instruction.
In addition to this, there are a couple of extra registers that monitor offcore events (eg. the LLC-cache specific events).
In later versions of architectural performance monitoring, some of the hardware events are measured with the help of fixed-purpose registers, as seen here. These are the fixed-purpose registers -
#define MSR_ARCH_PERFMON_FIXED_CTR0 0x309
#define MSR_ARCH_PERFMON_FIXED_CTR1 0x30a
#define MSR_ARCH_PERFMON_FIXED_CTR2 0x30b
The
PERF_TYPE_HARDWARE
pre-defined events are all architectural performance monitoring events. These events are architectural, since the behavior of each architectural performance event is expected to be consistent on all processors that support that event. All of thePERF_TYPE_HW_CACHE
events are non-architectural, which means they are model-specific and may vary from one family of processors to another.For an Intel Kaby Lake machine that I have, a total of 20
PERF_TYPE_HW_CACHE
events are pre-defined. The event constraints involved, ensure that the 3 fixed-function counters available are mapped to 3PERF_TYPE_HARDWARE
architectural events. Only one event can be measured on each of the fixed-function counters, so we can discard them for our analysis. The other constraint is that only two events targeting the LLC-caches, can be measured at the same time, since there are only twoOFFCORE RESPONSE
registers. Also, thenmi-watchdog
may pin an event to another counter from the family of general-purpose counters. If thenmi-watchdog
is disabled, we are left with 4 general purpose counters.
Given the constraints involved, and the limited number of counters available, there is just no way to avoid multiplexing if all the 20 hardware cache events are measured at the same time. Some workarounds to measure all the events, without incurring multiplexing and its errors, are -
3.1. Group all the PERF_TYPE_HW_CACHE
events into groups of 4, such that all of the 4 events can be scheduled on each of the 4 general-purpose counters at the same time. Make sure there are no more than 2 LLC cache events in a group. Run the same profile and obtain the counts for each of the groups separately.
3.2. If all the PERF_TYPE_HW_CACHE
events are to be monitored at the same time, then some of the errors with multiplexing can be reduced, by decreasing the value of perf_event_mux_interval_ms
. It can be configured via a sysfs entry called /sys/devices/cpu/perf_event_mux_interval_ms
. This value cannot be lowered beyond a point, as can be seen here.
- Monitoring upto 8 hardware or hardware-cache events would require hyperthreading to be disabled. Note that, the information about the number of general purpose counters available are retrieved using the
CPUID
instruction and the number of such counters are setup at the architecture initialization portion of the kernel startup via theearly_initcall
function. This can be seen here. Once the initialization is done, the kernel understands that only 4 counters are available, and any changes in hyperthreading capabilities later, do not make any difference.
Related Topics
Relinking an Anonymous (Unlinked But Open) File
Linking 32-Bit Library to 64-Bit Program
Prevent Gnome Terminal from Exiting After Execution
Get a Browser Rendered HTML+Javascript
Nginx: Serve Multiple Laravel Apps with Same Url But Two Different Sub Locations in Linux
What Does "-Sh: Executable_Path:Not Found" Mean
Just Black Screen After Running Qemu
Need a Good Hex Editor for Linux
How Is the Linux Kernel Tested
How to Undo Strip - I.E. Add Symbols Back to Stripped Binary
Maximum Number of Bash Arguments != Max Num Cp Arguments
Will Read() Ever Block After Select()
How to Get Amount of Queued Data for Udp Socket
Receiving Key Press and Key Release Events in Linux Terminal Applications