Perf_Event_Open - How to Monitoring Multiple Events

perf_event_open - how to monitoring multiple events

That's a bit tricky.

We create first counter as usual. Additionally, we pass PERF_FORMAT_GROUP and PERF_FORMAT_ID to be able to work with multiple counters simultaneously. This counter will be our group leader.

struct perf_event_attr pea;
int fd1, fd2;
uint64_t id1, id2; 

memset(&pea, 0, sizeof(struct perf_event_attr));
pea.type = PERF_TYPE_HARDWARE;
pea.size = sizeof(struct perf_event_attr);
pea.config = PERF_COUNT_HW_CPU_CYCLES;
pea.disabled = 1;
pea.exclude_kernel = 1;
pea.exclude_hv = 1;
pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
fd1 = syscall(__NR_perf_event_open, &pea, 0, -1, -1, 0);

Next, we retrieve identifier for the first counter:

ioctl(fd1, PERF_EVENT_IOC_ID, &id1);

Second (and all further counters) are created in the same fashion with only one exception: we pass fd1 value as group leader argument:

memset(&pea, 0, sizeof(struct perf_event_attr));
pea.type = PERF_TYPE_SOFTWARE;
pea.size = sizeof(struct perf_event_attr);
pea.config = PERF_COUNT_SW_PAGE_FAULTS;
pea.disabled = 1;
pea.exclude_kernel = 1;
pea.exclude_hv = 1;
pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
fd2 = syscall(__NR_perf_event_open, &pea, 0, -1, fd1, 0); // <-- here
ioctl(fd2, PERF_EVENT_IOC_ID, &id2);

Next we need to declare a data structure to read multiple counters at once. You have to declare different set of fields depending on what flags you pass to perf_event_open. Manual page mentions all possible fields. In our case, we passed PERF_FORMAT_ID flag which adds id field. This will allow us to distinguish between different counters.

struct read_format {
    uint64_t nr;
    struct {
        uint64_t value;
        uint64_t id;
    } values[/*2*/];
};

Now we call standard profiling ioctls:

ioctl(fd1, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
ioctl(fd1, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP);
do_something();
ioctl(fd1, PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);

Finally, we read the counters from group leader file descriptor. Both counters are returned in single read_format structure that we declared:

char buf[4096];
struct read_format* rf = (struct read_format*) buf;
uint64_t val1, val2;

read(fd1, buf, sizeof(buf));
for (i = 0; i < rf->nr; i++) {
  if (rf->values[i].id == id1) {
    val1 = rf->values[i].value;
  } else if (rf->values[i].id == id2) {
    val2 = rf->values[i].value;
  }
}
printf("cpu cycles: %"PRIu64"\n", val1);
printf("page faults: %"PRIu64"\n", val2);

Below is the full program listing:

#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <linux/hw_breakpoint.h>
#include <asm/unistd.h>
#include <errno.h>
#include <stdint.h>
#include <inttypes.h>

struct read_format {
  uint64_t nr;
  struct {
    uint64_t value;
    uint64_t id;
  } values[];
};

void do_something() {
  int i;
  char* ptr;

  ptr = malloc(100*1024*1024);
  for (i = 0; i < 100*1024*1024; i++) {
    ptr[i] = (char) (i & 0xff); // pagefault
  }
  free(ptr);
}

int main(int argc, char* argv[]) {
  struct perf_event_attr pea;
  int fd1, fd2;
  uint64_t id1, id2;
  uint64_t val1, val2;
  char buf[4096];
  struct read_format* rf = (struct read_format*) buf;
  int i;

  memset(&pea, 0, sizeof(struct perf_event_attr));
  pea.type = PERF_TYPE_HARDWARE;
  pea.size = sizeof(struct perf_event_attr);
  pea.config = PERF_COUNT_HW_CPU_CYCLES;
  pea.disabled = 1;
  pea.exclude_kernel = 1;
  pea.exclude_hv = 1;
  pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
  fd1 = syscall(__NR_perf_event_open, &pea, 0, -1, -1, 0);
  ioctl(fd1, PERF_EVENT_IOC_ID, &id1);

  memset(&pea, 0, sizeof(struct perf_event_attr));
  pea.type = PERF_TYPE_SOFTWARE;
  pea.size = sizeof(struct perf_event_attr);
  pea.config = PERF_COUNT_SW_PAGE_FAULTS;
  pea.disabled = 1;
  pea.exclude_kernel = 1;
  pea.exclude_hv = 1;
  pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
  fd2 = syscall(__NR_perf_event_open, &pea, 0, -1, fd1 /*!!!*/, 0);
  ioctl(fd2, PERF_EVENT_IOC_ID, &id2);


  ioctl(fd1, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
  ioctl(fd1, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP);
  do_something();
  ioctl(fd1, PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);


  read(fd1, buf, sizeof(buf));
  for (i = 0; i < rf->nr; i++) {
    if (rf->values[i].id == id1) {
      val1 = rf->values[i].value;
    } else if (rf->values[i].id == id2) {
      val2 = rf->values[i].value;
    }
  }

  printf("cpu cycles: %"PRIu64"\n", val1);
  printf("page faults: %"PRIu64"\n", val2);

  return 0;
}

only 2 PERF_TYPE_HW_CACHE events in perf event group

Note that, perf does allow measuring more than 2 PERF_TYPE_HW_CACHE events at the same time, the exception being the measurement of LLC-cache events.

The expectation is that, when there are 4 general-purpose and 3 fixed-purpose
hardware counters, 4 HW cache events (which default to RAW events) in perf can be measured without multiplexing, with hyper-threading ON.

sudo perf stat -e L1-icache-load-misses,L1-dcache-stores,L1-dcache-load-misses,dTLB-load-misses sleep 2

 Performance counter stats for 'sleep 2':

            26,893      L1-icache-load-misses                                       
            98,999      L1-dcache-stores                                            
            14,037      L1-dcache-load-misses                                       
               723      dTLB-load-misses                                            

       2.001732771 seconds time elapsed

       0.001217000 seconds user
       0.000000000 seconds sys

The problem appears when you try to measure events targeting the LLC-cache. It seems to be measuring only 2 LLC-cache specific events, concurrently, without multiplexing.

sudo perf stat -e LLC-load-misses,LLC-stores,LLC-store-misses,LLC-loads sleep 2

 Performance counter stats for 'sleep 2':

             2,419      LLC-load-misses           #    0.00% of all LL-cache hits   
             2,963      LLC-stores                                                  
     <not counted>      LLC-store-misses                                              (0.00%)
     <not counted>      LLC-loads                                                     (0.00%)

       2.001486710 seconds time elapsed

       0.001137000 seconds user
       0.000000000 seconds sys

CPUs belonging to the skylake/kaby lake family of microarchitectures and some others, allow you to measure OFFCORE RESPONSE events. Monitoring OFFCORE_RESPONSE events requires programming extra MSRs, specifically, MSR_OFFCORE_RSP0 (MSR address 1A6H) and MSR_OFFCORE_RSP1 (MSR address 1A7H), in addition to programming the pair of IA32_PERFEVTSELx and IA32_PMCx registers.

Each pair of IA32_PERFEVTSELx and IA32_PMCx register will be associated with one of the above MSRs to measure LLC-cache events.

The definition of the OFFCORE_RESPONSE MSRs can be seen here.

static struct extra_reg intel_skl_extra_regs[] __read_mostly = {
    INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff8fffull, RSP_0),
    INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff8fffull, RSP_1),
    ........
}

0x01b7 in the INTEL_UEVENT_EXTRA_REG call refers to event-code b7 and umask 01. This event code 0x01b7 maps to LLC-cache events, as can be seen here -

[ C(LL  ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = 0x1b7,   /* OFFCORE_RESPONSE */
        [ C(RESULT_MISS)   ] = 0x1b7,   /* OFFCORE_RESPONSE */
    },
    [ C(OP_WRITE) ] = {
        [ C(RESULT_ACCESS) ] = 0x1b7,   /* OFFCORE_RESPONSE */
        [ C(RESULT_MISS)   ] = 0x1b7,   /* OFFCORE_RESPONSE */
    },
    [ C(OP_PREFETCH) ] = {
        [ C(RESULT_ACCESS) ] = 0x0,
        [ C(RESULT_MISS)   ] = 0x0,
    },
 },

The event 0x01b7 will always map to MSR_OFFCORE_RSP_0, as can be seen here. The function, specified above, loops through the array of all the "extra registers" and associates the event->config(which contains the raw event id) with the offcore response MSR.

So, this would mean only one event can be measured at a time, since only one MSR - MSR_OFFCORE_RSP_0 can be mapped to a LLC-cache event. But, that is not the case!

The offcore registers are symmetric in nature, so when the first MSR - MSR_OFFCORE_RSP_0 register is busy, perf uses the second alternative MSR, MSR_OFFCORE_RSP_1 for measuring another offcore LLC event. This function here helps in doing that.

static int intel_alt_er(int idx, u64 config)
{
    int alt_idx = idx;

    if (!(x86_pmu.flags & PMU_FL_HAS_RSP_1))
        return idx;

    if (idx == EXTRA_REG_RSP_0)
        alt_idx = EXTRA_REG_RSP_1;

    if (idx == EXTRA_REG_RSP_1)
        alt_idx = EXTRA_REG_RSP_0;

    if (config & ~x86_pmu.extra_regs[alt_idx].valid_mask)
        return idx;

    return alt_idx;
}

The presence of only 2 offcore registers, for Kaby-Lake family of microrarchitectures hinder the ability to target more than 2 LLC-cache event measurement concurrently, without any multiplexing.

system wide perf_event_open

It looks like none of those options aggregate counts for you, they either count on one core, or virtualize the counters across context switches.

If you look at what system-wide perf stat -a does (e.g. with strace -f perf stat), you can see it calls perf_event_open once per event per core. It has to add up the counts for an event across cores; the system-call API won't do that for you.

PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE concurrent monitoring

The PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events are mapped to two sets of registers involved in performance monitoring. The first set of MSRs are called IA32_PERFEVTSELx where x can vary from 0 to N-1, N being the total number of general purpose counters available. The PERFEVTSEL is a short for "performance event select", they specify various conditions on the fulfillment of which event counting will happen. The second set of MSRs are called IA32_PMCx, where x varies similarly as PERFEVTSEL. These PMC registers store the counts of performance monitoring events. Each PERFEVTSEL register is paired with a corresponding PMC register.

The mapping happens as follows-

At the initialization of the architecture specific portion of the kernel, a pmu for measuring hardware specific events is registered here with type PERF_TYPE_RAW. All PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events are mapped to PERF_TYPE_RAW events to identify the pmu, as can be seen here.

if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE)
        type = PERF_TYPE_RAW;

The same architecture specific initialization is responsible for setting up the addresses of the first/base registers of each of the aforementioned sets of performance monitoring event registers, here

    .eventsel       = MSR_ARCH_PERFMON_EVENTSEL0,
    .perfctr        = MSR_ARCH_PERFMON_PERFCTR0,

The event_init function specific to the PMU identified, is responsible for setting up and "reserving" the two sets of performance monitoring registers, as well as checking for event constraints etc., here. The reservation happens here.

for (i = 0; i < x86_pmu.num_counters; i++) {
        if (!reserve_perfctr_nmi(x86_pmu_event_addr(i)))
            goto perfctr_fail;
    }

    for (i = 0; i < x86_pmu.num_counters; i++) {
        if (!reserve_evntsel_nmi(x86_pmu_config_addr(i)))
            goto eventsel_fail;
    }

The value num_counters = number of general-purpose counters as identified by CPUID instruction.

In addition to this, there are a couple of extra registers that monitor offcore events (eg. the LLC-cache specific events).

In later versions of architectural performance monitoring, some of the hardware events are measured with the help of fixed-purpose registers, as seen here. These are the fixed-purpose registers -

#define MSR_ARCH_PERFMON_FIXED_CTR0 0x309
#define MSR_ARCH_PERFMON_FIXED_CTR1 0x30a
#define MSR_ARCH_PERFMON_FIXED_CTR2 0x30b

The PERF_TYPE_HARDWARE pre-defined events are all architectural performance monitoring events. These events are architectural, since the behavior of each architectural performance event is expected to be consistent on all processors that support that event. All of the PERF_TYPE_HW_CACHE events are non-architectural, which means they are model-specific and may vary from one family of processors to another.
For an Intel Kaby Lake machine that I have, a total of 20 PERF_TYPE_HW_CACHE events are pre-defined. The event constraints involved, ensure that the 3 fixed-function counters available are mapped to 3 PERF_TYPE_HARDWARE architectural events. Only one event can be measured on each of the fixed-function counters, so we can discard them for our analysis. The other constraint is that only two events targeting the LLC-caches, can be measured at the same time, since there are only two OFFCORE RESPONSE registers. Also, the nmi-watchdog may pin an event to another counter from the family of general-purpose counters. If the nmi-watchdog is disabled, we are left with 4 general purpose counters.

Given the constraints involved, and the limited number of counters available, there is just no way to avoid multiplexing if all the 20 hardware cache events are measured at the same time. Some workarounds to measure all the events, without incurring multiplexing and its errors, are -

3.1. Group all the PERF_TYPE_HW_CACHE events into groups of 4, such that all of the 4 events can be scheduled on each of the 4 general-purpose counters at the same time. Make sure there are no more than 2 LLC cache events in a group. Run the same profile and obtain the counts for each of the groups separately.

3.2. If all the PERF_TYPE_HW_CACHE events are to be monitored at the same time, then some of the errors with multiplexing can be reduced, by decreasing the value of perf_event_mux_interval_ms. It can be configured via a sysfs entry called /sys/devices/cpu/perf_event_mux_interval_ms. This value cannot be lowered beyond a point, as can be seen here.

Monitoring upto 8 hardware or hardware-cache events would require hyperthreading to be disabled. Note that, the information about the number of general purpose counters available are retrieved using the CPUID instruction and the number of such counters are setup at the architecture initialization portion of the kernel startup via the early_initcall function. This can be seen here. Once the initialization is done, the kernel understands that only 4 counters are available, and any changes in hyperthreading capabilities later, do not make any difference.

Perf_Event_Open - How to Monitoring Multiple Events