Read Performance Counters Periodically in Linux

Read performance counters periodically in linux

It seems that the perf tool in Linux works by recording an event when the counters reach a specific value, rather than sampling at regular intervals.

Command perf record -e cycles,instructions -c 10000 stores an event every 10000 cycles and every 10000 instructions. It can be run against a new command or an existing pid. It records to perf.data in current directory.

Analyzing the data is another matter. Using perf script gets you quite close:

ls 16040 2152149.005813: cycles:          c113a068  ([kernel.kallsyms])
ls 16040 2152149.005820: cycles:          c1576af0  ([kernel.kallsyms])
ls 16040 2152149.005827: cycles:          c10ed6aa  ([kernel.kallsyms])
ls 16040 2152149.005831: instructions:          c1104b30  ([kernel.kallsyms])
ls 16040 2152149.005835: cycles:          c11777c1  ([kernel.kallsyms])
ls 16040 2152149.005842: cycles:          c10702a8  ([kernel.kallsyms])
...

You need to write a script that takes a bunch of lines from that output and counts the number of 'cycles' and 'instructions' events in that set. You can adjust the resolution by changing the parameter -c 10000 in the recording command.

I verified the analysis by running perf stat and perf record against ls /. Stat reported 2 634 205 cycles, 1 725 255 instructions, while script output had 410 cycles events and 189 instructions events. The smaller the -c value, the more overhead there seems to be in the cycles reading.

There is also a -F option to perf record, which samples at regular intervals. However, I could not find a way to retrieve the counter values when using this option.

Edit: perf stat apparently works on pids also, and captures data until ctrl-c is pressed. It should be quite easy to modify the source so that it always captures for N seconds and then run it in a loop.

Use linux perf utility to report counters every second like vmstat

There is perf stat option "interval-print" of -I N where N is millisecond interval to do interval counter printing every N milliseconds (N>=10): http://man7.org/linux/man-pages/man1/perf-stat.1.html

  -I msecs, --interval-print msecs
       Print count deltas every N milliseconds (minimum: 10ms) The
       overhead percentage could be high in some cases, for instance
       with small, sub 100ms intervals. Use with caution. example: perf
       stat -I 1000 -e cycles -a sleep 5

  For best results it is usually a good idea to use it with interval
   mode like -I 1000, as the bottleneck of workloads can change often.

There is also importing results in machine-readable form, and with -I first field is datetime:

With -x, perf stat is able to output a not-quite-CSV format output ... optional usec time stamp in fractions of second (with -I xxx)

vmstat, systat-family tools iostat, mpstat, etc periodic printing is -I 1000 of perf stat (every second), for example system-wide (add -A to separate cpu counters):

  perf stat -a -I 1000

The option is implemented in builtin-stat.c http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8 __run_perf_stat function

531 static int __run_perf_stat(int argc, const char **argv)
532 {
533         int interval = stat_config.interval;

For perf stat -I 1000 with some program argument (forks=1), for example perf stat -I 1000 sleep 10 there is interval loop (ts is the millisecond interval converted to struct timespec):

639                 enable_counters();
641                 if (interval) {
642                         while (!waitpid(child_pid, &status, WNOHANG)) {
643                                 nanosleep(&ts, NULL);
644                                 process_interval();
645                         }
646                 }
666         disable_counters();

For variant of system-wide hardware performance monitor counting and forks=0 there is other interval loop

658                 enable_counters();
659                 while (!done) {
660                         nanosleep(&ts, NULL);
661                         if (interval)
662                                 process_interval();
663                 }
666         disable_counters();

process_interval() http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8#L347 from the same file uses read_counters(); which loops over event list and invokes read_counter() which loops over all known threads and all cpus and starts actual reading function:

306         for (thread = 0; thread < nthreads; thread++) {
307                 for (cpu = 0; cpu < ncpus; cpu++) {
...
310                         count = perf_counts(counter->counts, cpu, thread);
311                         if (perf_evsel__read(counter, cpu, thread, count))
312                                 return -1;

perf_evsel__read is the real counter read while program is still running:

1207 int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
1208                      struct perf_counts_values *count)
1209 {
1210         memset(count, 0, sizeof(*count));
1211 
1212         if (FD(evsel, cpu, thread) < 0)
1213                 return -EINVAL;
1214 
1215         if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
1216                 return -errno;
1217 
1218         return 0;
1219 }

ARMv7 instructions to access performance counters directly from assembly language

There are some examples of direct PMU performance counters usage on ARM, for example

armv7: http://neocontra.blogspot.com/2013/05/user-mode-performance-counters-for.html

armv8: http://zhiyisun.github.io/2016/03/02/How-to-Use-Performance-Monitor-Unit-(PMU)-of-64-bit-ARMv8-A-in-Linux.html

So the first thing is to create a kernel module to enable user-mode access to PMU counters. Below is the code to set PMU register PMUSERENR_EL0 to enable user-mode access.

/*Enable user-mode access to counters. */
asm volatile("msr pmuserenr_el0, %0" : : "r"((u64)ARMV8_PMUSERENR_EN_EL0|ARMV8_PMUSERENR_ER|ARMV8_PMUSERENR_CR));

/*   Performance Monitors Count Enable Set register bit 30:0 disable, 31 enable. Can also enable other event counters here. */ 
asm volatile("msr pmcntenset_el0, %0" : : "r" (ARMV8_PMCNTENSET_EL0_ENABLE));

/* Enable counters */
u64 val=0;
asm volatile("mrs %0, pmcr_el0" : "=r" (val));
asm volatile("msr pmcr_el0, %0" : : "r" (val|ARMV8_PMCR_E));

But performance counters are privileged part of system, by default they are only accessible from kernel mode. You can't just use assembly instructions in user space code to use them, and only result you will get is SIGSEGV or other variant of permission denied. To enable access from user-space, some work should be done in kernel mode. It can be any of existing PMU driver: perf or oprofile (older pmu access tool), or it can be some custom kernel module which will enable user-space access to PMU registers. But to compile your module you still need most of kernel development infrastructure for your kernel (I expect that standard chromebook kernel has no kernel includes "kbuild" to do module build, and this kernel may not accept unsigned modules in standard configuration).

What can you do:

Use another machine, something more recent than your outdated chromebook. Your project may have some machines in remote access, you can try to buy some small and popular ARM single-board computer with linux (like raspberry pi 3/4). That popular board will have more recent arm cpu core, and it will have ubuntu or debian
Check oprofile subsystem, it may be enabled in your kernel. Oprofile tools are older than perf but can access PMU counters too.
Recompile linux kernel with perf_events subsystem enabled. You need only correct kernel which will boot on your chromebook, and any compiler to rebuild perf out-of-tree from https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/ (use any version of perf). Or use perf_event_open syscall directly
Check for /lib/modules/`uname -r`/build directory. If it exists, you can try to build custom kernel module to enable user-space direct access

TRM on pmcr_el0 and other PMU registers: https://developer.arm.com/documentation/100442/0100/debug-registers/aarch64-pmu-registers/pmcr-el0--performance-monitors-control-register--el0 https://developer.arm.com/docs/ddi0595/h/aarch64-system-registers/pmcr_el0 https://developer.arm.com/docs/ddi0595/h/aarch32-system-registers/pmccntr https://developer.arm.com/documentation/ddi0535/c/performance-monitoring-unit and some overview https://people.inf.ethz.ch/markusp/teaching/263-2300-ETH-spring14/slides/08-perfcounters.pdf

Performance Counters and IMC Counter Not Matching

Actually, it was mostly caused by the GPU device. This was the reason for exclusion from performance counters. Here is the relevant output for a sample execution of PCM on a relatively idle system with resolution 3840x2160 and refresh rate 60 using xrandr:
Relevant PCM Output for High-Resolution Case
And this is for the situation with resolution 800x600 and the same refresh rate (i.e., 60):
Relevant PCM Output for Low-Resolution Case
As can be seen, changing screen resolution reduced read and IO traffic considerably (more than 100x!).

How does a system wide profiler (e.g. perf) correlate counters with instructions?

So I think there's some kernel module that launches software interrupts at a certain sampling rate.

Perf is not module, it is part of the Linux kernel, implemented in
kernel/events/core.c and for every supported architecture and cpu model, for example arch/x86/kernel/cpu/perf_event*.c. But Oprofile was a module, with similar approach.

Perf generally works by asking PMU (Performance monitoring unit) of CPU to generate interrupt after N events of some hardware performance counter (Yokohama, slide 5 "• Interrupt when threshold reached: allows sampling"). Actually it may be implemented as:

select some PMU counter
initialize it to -N, where N is the sampling period (we want interrupt after N events, for example, after 2 millions of cycles perf record -c 2000000 -e cycles, or some N computed and tuned by perf when no extra option is set or -F is given)
set this counter to wanted event, and ask PMU to generate interrupt on overflow (ARCH_PERFMON_EVENTSEL_INT). It will happen after N increments of our counter.

All modern Intel chips supports this, for example, Nehalem: https://software.intel.com/sites/default/files/76/87/30320 - Nehalem Performance Monitoring Unit Programming Guide

EBS - Event Based Sampling. A technique in which counters are pre-loaded with a large negative count, and they are configured to interrupt the processor on overflow. When the counter overflows the interrupt service routine capture profiling data.

So, when you use hardware PMU, there is no additional work at timer interrupt with special reading of hardware PMU counters. There is some work to save/restore PMU state at task switch, but this (*_sched_in/*_sched_out of kernel/events/core.c) will not change PMU counter value for current thread nor will export it to user-space.

There is a handler: arch/x86/kernel/cpu/perf_event.c: x86_pmu_handle_irq which finds the overflowed counter and calls perf_sample_data_init(&data, 0, event->hw.last_period); to record the current time, IP of last executed command (it can be inexact because of out-of-order nature of most Intel microarchitetures, there is limited workaround for some events - PEBS, perf record -e cycles:pp), stacktrace data (if -g was used in record), etc. Then handler resets the counter value to the -N (x86_perf_event_set_period, wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); - note the minus before left)

The lower the sampling rate, the lower the profiler overhead.

Perf allows you to set target sampling rate with -F option, -F 1000 means around 1000 irq/s. High rates are not recommended due to high overhead. Ten years ago Intel VTune recommended not more than 1000 irq/s (http://www.cs.utah.edu/~mhall/cs4961f09/VTune-1.pdf "Try to get about a 1000 samples per second per logical CPU."), perf usually don't allow too high rate for non-root (autotuned to lower rate when "perf interrupt took too long" - check in your dmesg; also check sysctl -a|grep perf, for example kernel.perf_cpu_time_max_percent=25 - which means that perf will try to use not more then 25 % of CPU)

Can you interrogate for example the task scheduler to find out what was running when you interrupted him?

No. But you can enable tracepoint at sched_switch or other sched event (list all available in sched: perf list 'sched:*'), and use it as profiling event for the perf. You can even ask perf to record stacktrace at this tracepoint:

 perf record -a -g -e "sched:sched_switch" sleep 10

Won't that affect the execution of the scheduler

Enabled tracepoint will make add some perf event sampling work to the function with tracepoint

Is the list of task_struct objects available?

Only via ftrace...

Information about context switches

This is software perf event, just call to perf_sw_event with PERF_COUNT_SW_CONTEXT_SWITCHES event from sched/core.c (indirectly). Example of direct call - migration software event: kernel/sched/core.c set_task_cpu(): p->se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);

PS: there are good slides on perf, ftrace and other profiling and tracing subsystems in Linux by Gregg: http://www.brendangregg.com/linuxperf.html

How to Configure and Sample Intel Performance Counters In-Process

It seems the best way -- for Linux at least -- is to use the msr device node.

You simply open a device node, seek to the address of the MSR required, and read or write 8 bytes.

OpenBSD is harder, since (at the time of writing) there is no user-space proxy to the MSRs. So you would need to write a kernel module or implement a sysctl by hand.

Read Performance Counters Periodically in Linux