Which Perf Events Can Use Pebs

Which perf events can use PEBS?

There is hack to support cycles:p on SandyBridge which has no PEBS for CPU_CLK_UNHALTED.*. The hack is implemented in the kernel part of perf in intel_pebs_aliases_snb(). When user requests -e cycles which is PERF_COUNT_HW_CPU_CYCLES (translates to CPU_CLK_UNHALTED.CORE) with nonzero precise modifier, this function will change hardware event to UOPS_RETIRED.ALL with PEBS:

  29    [PERF_COUNT_HW_CPU_CYCLES]      = 0x003c,

2739 static void intel_pebs_aliases_snb(struct perf_event *event)
2740 {
2741    if ((event->hw.config & X86_RAW_EVENT_MASK) == 0x003c) {
2742        /*
2743         * Use an alternative encoding for CPU_CLK_UNHALTED.THREAD_P
2744         * (0x003c) so that we can use it with PEBS.
2745         *
2746         * The regular CPU_CLK_UNHALTED.THREAD_P event (0x003c) isn't
2747         * PEBS capable. However we can use UOPS_RETIRED.ALL
2748         * (0x01c2), which is a PEBS capable event, to get the same
2749         * count.
2750         *
2751         * UOPS_RETIRED.ALL counts the number of cycles that retires
2752         * CNTMASK micro-ops. By setting CNTMASK to a value (16)
2753         * larger than the maximum number of micro-ops that can be
2754         * retired per cycle (4) and then inverting the condition, we
2755         * count all cycles that retire 16 or less micro-ops, which
2756         * is every cycle.
2757         *
2758         * Thereby we gain a PEBS capable cycle counter.
2759         */
2760        u64 alt_config = X86_CONFIG(.event=0xc2, .umask=0x01, .inv=1, .cmask=16);
2761 
2762        alt_config |= (event->hw.config & ~X86_RAW_EVENT_MASK);
2763        event->hw.config = alt_config;
2764    }
2765 }

The intel_pebs_aliases_snb hack is registered in 3557 __init int intel_pmu_init(void) for case INTEL_FAM6_SANDYBRIDGE: / case INTEL_FAM6_SANDYBRIDGE_X: as

3772        x86_pmu.event_constraints = intel_snb_event_constraints;
3773        x86_pmu.pebs_constraints = intel_snb_pebs_event_constraints;
3774        x86_pmu.pebs_aliases = intel_pebs_aliases_snb;

pebs_aliases is called from intel_pmu_hw_config() when precise_ip is set to non-zero:

2814 static int intel_pmu_hw_config(struct perf_event *event)
2815 {

2821    if (event->attr.precise_ip) {

2828        if (x86_pmu.pebs_aliases)
2829            x86_pmu.pebs_aliases(event);
2830    }

The hack was implemented in 2012, lkml threads "[PATCH] perf, x86: Make cycles:p working on SNB", "[tip:perf/core] perf/x86: Implement cycles:p for SNB/IVB", cccb9ba9e4ee0d750265f53de9258df69655c40b, http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=cccb9ba9e4ee0d750265f53de9258df69655c40b:

perf/x86: Implement cycles:p for SNB/IVB

Now that there's finally a chip with working PEBS (IvyBridge), we can
enable the hardware and implement cycles:p for SNB/IVB.

And I think, there is no full list of such "precise" converting hack besides the linux source code in arch/x86/events/intel/core.c, grep for static void intel_pebs_aliases (usually cycles:p / CPU_CLK_UNHALTED 0x003c is implemented) and check intel_pmu_init for actual model and exact x86_pmu.pebs_aliases variant selected:

intel_pebs_aliases_core2, INST_RETIRED.ANY_P (0x00c0) CNTMASK=16 instead of cycles:p
intel_pebs_aliases_snb, UOPS_RETIRED.ALL (0x01c2) CNTMASK=16 instead of cycles:p
intel_pebs_aliases_precdist for highest values of precise_ip, INST_RETIRED.PREC_DIST (0x01c0) instead of cycles:ppp on SKL, IVB, HSW, BDW

Good resources on how to program PEBS (Precise event based sampling) counters?

Please, don't mix tracing and timing measurements in single run.

It is just impossible both to have fastest run of Spec and all memory accesses traced. Do one run for timing and other (longer,slower) for memory access tracing.

In https://github.com/pyrovski/powertools the frequency of collected events is controlled by reset_val argument of pebs_init:

https://github.com/pyrovski/powertools/blob/0f66c5f3939a9b7b88ec73f140f1a0892cfba235/msr_pebs.c#L72

void
pebs_init(int nRecords, uint64_t *counter, uint64_t *reset_val ){
    // 1. Set up the precise event buffering utilities.
    //  a.  Place values in the
    //      i.   precise event buffer base,
    //      ii.  precise event index
    //      iii. precise event absolute maximum,
    //      iv.  precise event interrupt threshold,
    //      v.   and precise event counter reset fields
    //      of the DS buffer management area.
    //
    // 2.  Enable PEBS.  Set the Enable PEBS on PMC0 flag 
    //  (bit 0) in IA32_PEBS_ENABLE_MSR.
    //
    // 3.  Set up the IA32_PMC0 performance counter and 
    //  IA32_PERFEVTSEL0 for an event listed in Table 
    //  18-10.

    // IA32_DS_AREA points to 0x58 bytes of memory.  
    // (11 entries * 8 bytes each = 88 bytes.)

    // Each PEBS record is 0xB0 byes long.
...
    pds_area->pebs_counter0_reset       = reset_val[0];
    pds_area->pebs_counter1_reset       = reset_val[1];
    pds_area->pebs_counter2_reset       = reset_val[2];
    pds_area->pebs_counter3_reset       = reset_val[3];
...

    write_msr(0, PMC0, reset_val[0]);
    write_msr(1, PMC1, reset_val[1]);
    write_msr(2, PMC2, reset_val[2]);
    write_msr(3, PMC3, reset_val[3]);

This project is library to access PEBS, and there are no examples of its usage included in project (as I found there is only one disabled test in other projects by tpatki).

Check intel SDM Manual Vol 3B (this is the only good resource for PEBS programming) for meaning of the fields and PEBS configuration and output:
https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-734.html

18.15.7 Processor Event-Based Sampling

PEBS permits the saving of precise architectural information associated with one or more performance events in the precise event records buffer, which is part of the DS save area (see Section 17.4.9, “BTS and DS Save Area”).
To use this mechanism, a counter is configured to overflow after it has counted a preset number of events. After the counter overflows, the processor copies the current state of the general-purpose and EFLAGS registers and instruction pointer into a record in the precise event records buffer. The processor then resets the count in the performance counter and restarts the counter. When the precise event records buffer is nearly full, an interrupt is generated, allowing the precise event records to be saved. A circular buffer is not supported for precise event
records.
... After the PEBS-enabled counter has overflowed, PEBS
record is recorded

(So, reset value is probably negative, equal to -1000 to get every 1000th event, -10 to get every 10th event. Counter will increment and PEBS is written at counter overflow.)

and https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-656.html 18.4.4 Processor Event Based Sampling (PEBS) "Table 18-10" - only L1/L2/DTLB misses have PEBS event in Intel Core. (Find PEBS section for your CPU and search for memory events. PEBS-capable events are really rare.)

So, to have more event recorded you probably want to set reset part of this function to smaller absolute value, like -50 or -10. With PEBS this may work (and try perf -e cycles:upp -c 10 - don't ask to profile kernel with so high frequency, only user-space :u and ask for precise with :pp and ask for -10 counter with -c 10. perf has all PEBS mechanics implemented both for MSR and for buffer parsing).

Another good resource for PMU (hardware performance monitoring unit) are also from Intel, PMU Programming Guides. They have short and compact description both of usual PMU and PEBS too. There is public "Nehalem Core PMU", most of it still useful for newer CPUs - https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf (And there are uncore PMU guides: E5-2600 Uncore PMU Guide, 2012 https://www.intel.com/content/dam/www/public/us/en/documents/design-guides/xeon-e5-2600-uncore-guide.pdf)

External pdf about PEBS: https://www.blackhat.com/docs/us-15/materials/us-15-Herath-These-Are-Not-Your-Grand-Daddys-CPU-Performance-Counters-CPU-Hardware-Performance-Counters-For-Security.pdf#page=23 PMCs: Setting Up for PEBS - from "Black Hat USA 2015 - These are Not Your Grand Daddy's CPU Performance Counters"

You may start from short and simple program (not the ref inputs of recent SpecCPU) and use perf linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. PEBS is used with perf by adding :p and :pp suffix to the event specifier record -e event:pp. Also try pmu-tools ocperf.py for easier intel event name encoding.

Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory tests like (worst case of memory recording overhead, left part on the Arithmetic Intensity scale of Roofline model - STREAM is BLAS1, GUPS and memlat are almost SpMV; real tasks are usually not so left on the scale):

STREAM test (linear access to memory),
RandomAccess (GUPS) test
some memory latency test (memlat of 7z, lat_mem_rd of lmbench).

Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?

Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes to be recorded to the memory. So, having memory tracing enabled (more than 10% or mem.access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.

Your CPU E5-2620 v4 is Broadwell-EP 14nm so it may have also some earlier variant of the Intel PT: https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org/processor-trace and especially Andi Kleen's blog on pt: http://halobates.de/blog/p/410 "Cheat sheet for Intel Processor Trace with Linux perf and gdb"

PT support in hardware: Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.

PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly:

http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf - LLC misses recorded to offline analysis, no timing was recorded from tracing runs
http://users.ece.utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf - all load/stores instrumented by writing into additional huge tracing buffer to periodic (rare) online aggregation. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core.
http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (by Aamer Jaleel of Intel Corporation, VSSAD) - Pin-based instrumentation - program code was modified and instrumented to write memory access metadata into buffer. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core. The paper lists and explains instrumentation overhead and Caveats:

Instrumentation Overhead: Instrumentation involves
injecting extra code dynamically or statically into the
target application. The additional code causes an
application to spend extra time in executing the original
application ... Additionally, for multi-threaded
applications, instrumentation can modify the ordering of
instructions executed between different threads of the
application. As a result, IDS with multi-threaded
applications comes at the lack of some fidelity

Lack of Speculation: Instrumentation only observes
instructions executed on the correct path of execution. As
a result, IDS may not be able to support wrong-path ...

User-level Traffic Only: Current binary instrumentation
tools only support user-level instrumentation. Thus,
applications that are kernel intensive are unsuitable for
user-level IDS.

Event-based sampling with the perf userland tool and PEBS

It turns out the specific event that the generic cache-misses maps to does not support PEBS. An alternative is to use one of the events that are supported by PEBS (see the list for the Nehalem architecture here) with an appropriate mask to narrow it down. Specifically, one could use MEM_LOAD_RETIRED:LLC_MISS, even though the event doesn't seem to be accurate on all occasions.

Why is perf not working for precise events in my Intel Skylake Server?

You cannot obtain Precise event numbers with perf stat.

perf stat runs in non-sampling mode, wherein perf maintains a running count of all occurrences of events. It does not make sense to record Precise events in counting mode. Precise events, as Peter mentioned, helps you to correctly narrow down the instruction (actually +1 instruction, from the instruction that triggers the PEBS assist), which the record in the sample is attributed to.

Also, the PEBS interrupt-handler is known to cause conflicts with the counter overflow NMI that runs when perf stat is run. For more understanding, you should look at this discussion.

For the above reasons, recording precise events has been disabled in non-sampling mode, as can be seen here.

/* There's no sense in having PEBS for non sampling events: */
    if (!is_sampling_event(event))
            return -EINVAL;

You should use perf record to record precise events, since it seems that event mem_load_l3_miss_retired.remote_dram has support for PEBS already.

perf record -e mem_load_l3_miss_retired.remote_dram:p sleep 2

Which event should I use for -e option for perf to get function branch events?

perf list doesn't list actual hardware events, it is just a list of perf-predefined list, and it is not fully supported by any CPU. Some CPUs maps several events to perf's predefined, other map different event set.

You should check documentation of your CPU core (qualcomm krait 400) to find actual hardware performance monitoring events (counters) and use them as raw (encoding to perf stat -e rXXXX or to RAW in perf_attr is architecture specific too). Also you can try perf stat / perf stat -d to check which events are counted (supported) from some default lists.

Your nexus 5 is based on Krait 400 CPU core.

There were some problems reported in krait: How to get perf_event results for 2nd Nexus7 with Krait CPU
and there was link to patch, defining standard events for krait:

http://www.serverphorums.com/read.php?12,850329

There are two sets of mapping from predefined perf to actual hw events. One with support of branch-instructions event and other without:

/*
+ * Krait HW events mapping
+ */
+static const unsigned krait_perf_map[PERF_COUNT_HW_MAX] = {
+ [PERF_COUNT_HW_CPU_CYCLES] = ARMV7_PERFCTR_CPU_CYCLES,
+ [PERF_COUNT_HW_INSTRUCTIONS] = ARMV7_PERFCTR_INSTR_EXECUTED,
+ [PERF_COUNT_HW_CACHE_REFERENCES] = HW_OP_UNSUPPORTED,
+ [PERF_COUNT_HW_CACHE_MISSES] = HW_OP_UNSUPPORTED,
+ [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = ARMV7_PERFCTR_PC_WRITE,
+ [PERF_COUNT_HW_BRANCH_MISSES] = ARMV7_PERFCTR_PC_BRANCH_MIS_PRED,
+ [PERF_COUNT_HW_BUS_CYCLES] = ARMV7_PERFCTR_CLOCK_CYCLES,
+};
+
+static const unsigned krait_perf_map_no_branch[PERF_COUNT_HW_MAX] = {
+ [PERF_COUNT_HW_CPU_CYCLES] = ARMV7_PERFCTR_CPU_CYCLES,
+ [PERF_COUNT_HW_INSTRUCTIONS] = ARMV7_PERFCTR_INSTR_EXECUTED,
+ [PERF_COUNT_HW_CACHE_REFERENCES] = HW_OP_UNSUPPORTED,
+ [PERF_COUNT_HW_CACHE_MISSES] = HW_OP_UNSUPPORTED,
+ [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = HW_OP_UNSUPPORTED,
+ [PERF_COUNT_HW_BRANCH_MISSES] = ARMV7_PERFCTR_PC_BRANCH_MIS_PRED,
+ [PERF_COUNT_HW_BUS_CYCLES] = ARMV7_PERFCTR_CLOCK_CYCLES,
+};

According to selection code, this is feature of later versions of Krait CPU:

+static int krait_pmu_init(struct arm_pmu *cpu_pmu)
+{
+ u32 id = read_cpuid_id() & 0xffffff00;
+
+ armv7pmu_init(cpu_pmu);
+ cpu_pmu->name = "ARMv7 Krait";
+ /* Some early versions of Krait don't support PC write events */
+ if (id == 0x511f0400 || id == 0x510f0600)
+ cpu_pmu->map_event = krait_map_event_no_branch;
+ else
+ cpu_pmu->map_event = krait_map_event;
+ cpu_pmu->num_events = armv7_read_num_pmnc_events();
+ cpu_pmu->set_event_filter = armv7pmu_set_event_filter;
+ return 0;
+}

As I can decode cpuid - Krait 400 and Krait 600 have no support of branch-instruction PMU event (PC write event).

Update: For your Nexus 5x if it use ARM Cortex A57 core, there is list of raw events, based on "Table 11-24 from the "Cortex A57 Technical Reference Manual""

https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/lib/events/arm_cortex_a57_events.h

Still no counter for all branches. There are BRANCH_MISPRED & BRANCH_PRED but I have no access to docs and don't know will they count all branches or not.

How does perf record (or other profilers) pick which instruction to count as costing time?

(quick not super detailed answer; a more detailed one would be good if someone wants to write one).

perf just uses the CPU's own hardware performance counters, which can be put into a mode where they record an event when the counter counts down to zero or up to a threshold.

Either raising an interrupt or writing an event into a buffer in memory (with PEBS precise events). That event will include a code address that the CPU picked to associate with the event (i.e. the point at which the interrupt was raised), even for events like cycles which unlike instructions don't inherently have a specific instruction associated. The out-of-order exec back-end can have a couple hundred instructions in flight when counter wraps, but has to pick exactly one for any given sample.

Generally the CPU "blames" the instruction that was waiting for a slow-to-produce result, not the one producing it, especially cache-miss loads.

For an example with Intel x86 CPUs, see Why is this jump instruction so expensive when performing pointer chasing?
which also appears to depend on the effect of letting the last instruction in the ROB retire when an interrupt is raised. (Intel CPUs at least do seem to do that; makes sense for ensuring forward progress even with a potentially slow instruction.)

In general there can be "skew" when a later instruction is blamed than the one actually taking the time, possibly with different causes. (Perhaps especially for uncore events, since they happen asynchronously to the core clock.)

Other related Q&As with interesting examples or other things

Inconsistent `perf annotate` memory load/store time reporting
Linux perf reporting cache misses for unexpected instruction
https://travisdowns.github.io/blog/2019/08/20/interrupts.html - some experiments into which instructions tend to get counts on Skylake.

Which Perf Events Can Use Pebs