Read performance counters periodically in linux
It seems that the perf tool in Linux works by recording an event when the counters reach a specific value, rather than sampling at regular intervals.
Command perf record -e cycles,instructions -c 10000
stores an event every 10000 cycles and every 10000 instructions. It can be run against a new command or an existing pid. It records to perf.data
in current directory.
Analyzing the data is another matter. Using perf script
gets you quite close:
ls 16040 2152149.005813: cycles: c113a068 ([kernel.kallsyms])
ls 16040 2152149.005820: cycles: c1576af0 ([kernel.kallsyms])
ls 16040 2152149.005827: cycles: c10ed6aa ([kernel.kallsyms])
ls 16040 2152149.005831: instructions: c1104b30 ([kernel.kallsyms])
ls 16040 2152149.005835: cycles: c11777c1 ([kernel.kallsyms])
ls 16040 2152149.005842: cycles: c10702a8 ([kernel.kallsyms])
...
You need to write a script that takes a bunch of lines from that output and counts the number of 'cycles' and 'instructions' events in that set. You can adjust the resolution by changing the parameter -c 10000
in the recording command.
I verified the analysis by running perf stat
and perf record
against ls /
. Stat reported 2 634 205 cycles, 1 725 255 instructions, while script output had 410 cycles events and 189 instructions events. The smaller the -c
value, the more overhead there seems to be in the cycles reading.
There is also a -F
option to perf record
, which samples at regular intervals. However, I could not find a way to retrieve the counter values when using this option.
Edit: perf stat
apparently works on pids also, and captures data until ctrl-c is pressed. It should be quite easy to modify the source so that it always captures for N seconds and then run it in a loop.
Use linux perf utility to report counters every second like vmstat
There is perf stat
option "interval-print" of -I N
where N is millisecond interval to do interval counter printing every N milliseconds (N>=10): http://man7.org/linux/man-pages/man1/perf-stat.1.html
-I msecs, --interval-print msecs
Print count deltas every N milliseconds (minimum: 10ms) The
overhead percentage could be high in some cases, for instance
with small, sub 100ms intervals. Use with caution. example: perf
stat -I 1000 -e cycles -a sleep 5
For best results it is usually a good idea to use it with interval
mode like -I 1000, as the bottleneck of workloads can change often.
There is also importing results in machine-readable form, and with -I
first field is datetime:
With -x, perf stat is able to output a not-quite-CSV format output ... optional usec time stamp in fractions of second (with -I xxx)
vmstat
, systat-family tools iostat
, mpstat
, etc periodic printing is -I 1000
of perf stat (every second), for example system-wide (add -A to separate cpu counters):
perf stat -a -I 1000
The option is implemented in builtin-stat.c http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8 __run_perf_stat
function
531 static int __run_perf_stat(int argc, const char **argv)
532 {
533 int interval = stat_config.interval;
For perf stat -I 1000
with some program argument (forks=1
), for example perf stat -I 1000 sleep 10
there is interval loop (ts
is the millisecond interval converted to struct timespec
):
639 enable_counters();
641 if (interval) {
642 while (!waitpid(child_pid, &status, WNOHANG)) {
643 nanosleep(&ts, NULL);
644 process_interval();
645 }
646 }
666 disable_counters();
For variant of system-wide hardware performance monitor counting and forks=0
there is other interval loop
658 enable_counters();
659 while (!done) {
660 nanosleep(&ts, NULL);
661 if (interval)
662 process_interval();
663 }
666 disable_counters();
process_interval()
http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8#L347 from the same file uses read_counters();
which loops over event list and invokes read_counter()
which loops over all known threads and all cpus and starts actual reading function:
306 for (thread = 0; thread < nthreads; thread++) {
307 for (cpu = 0; cpu < ncpus; cpu++) {
...
310 count = perf_counts(counter->counts, cpu, thread);
311 if (perf_evsel__read(counter, cpu, thread, count))
312 return -1;
perf_evsel__read
is the real counter read while program is still running:
1207 int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
1208 struct perf_counts_values *count)
1209 {
1210 memset(count, 0, sizeof(*count));
1211
1212 if (FD(evsel, cpu, thread) < 0)
1213 return -EINVAL;
1214
1215 if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
1216 return -errno;
1217
1218 return 0;
1219 }
ARMv7 instructions to access performance counters directly from assembly language
There are some examples of direct PMU performance counters usage on ARM, for example
armv7: http://neocontra.blogspot.com/2013/05/user-mode-performance-counters-for.html
armv8: http://zhiyisun.github.io/2016/03/02/How-to-Use-Performance-Monitor-Unit-(PMU)-of-64-bit-ARMv8-A-in-Linux.html
So the first thing is to create a kernel module to enable user-mode access to PMU counters. Below is the code to set PMU register PMUSERENR_EL0 to enable user-mode access.
/*Enable user-mode access to counters. */
asm volatile("msr pmuserenr_el0, %0" : : "r"((u64)ARMV8_PMUSERENR_EN_EL0|ARMV8_PMUSERENR_ER|ARMV8_PMUSERENR_CR));
/* Performance Monitors Count Enable Set register bit 30:0 disable, 31 enable. Can also enable other event counters here. */
asm volatile("msr pmcntenset_el0, %0" : : "r" (ARMV8_PMCNTENSET_EL0_ENABLE));
/* Enable counters */
u64 val=0;
asm volatile("mrs %0, pmcr_el0" : "=r" (val));
asm volatile("msr pmcr_el0, %0" : : "r" (val|ARMV8_PMCR_E));
But performance counters are privileged part of system, by default they are only accessible from kernel mode. You can't just use assembly instructions in user space code to use them, and only result you will get is SIGSEGV or other variant of permission denied. To enable access from user-space, some work should be done in kernel mode. It can be any of existing PMU driver: perf or oprofile (older pmu access tool), or it can be some custom kernel module which will enable user-space access to PMU registers. But to compile your module you still need most of kernel development infrastructure for your kernel (I expect that standard chromebook kernel has no kernel includes "kbuild" to do module build, and this kernel may not accept unsigned modules in standard configuration).
What can you do:
- Use another machine, something more recent than your outdated chromebook. Your project may have some machines in remote access, you can try to buy some small and popular ARM single-board computer with linux (like raspberry pi 3/4). That popular board will have more recent arm cpu core, and it will have ubuntu or debian
- Check oprofile subsystem, it may be enabled in your kernel. Oprofile tools are older than perf but can access PMU counters too.
- Recompile linux kernel with perf_events subsystem enabled. You need only correct kernel which will boot on your chromebook, and any compiler to rebuild perf out-of-tree from https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/ (use any version of perf). Or use
perf_event_open
syscall directly - Check for
/lib/modules/`uname -r`/build
directory. If it exists, you can try to build custom kernel module to enable user-space direct access
TRM on pmcr_el0 and other PMU registers: https://developer.arm.com/documentation/100442/0100/debug-registers/aarch64-pmu-registers/pmcr-el0--performance-monitors-control-register--el0 https://developer.arm.com/docs/ddi0595/h/aarch64-system-registers/pmcr_el0 https://developer.arm.com/docs/ddi0595/h/aarch32-system-registers/pmccntr https://developer.arm.com/documentation/ddi0535/c/performance-monitoring-unit and some overview https://people.inf.ethz.ch/markusp/teaching/263-2300-ETH-spring14/slides/08-perfcounters.pdf
Performance Counters and IMC Counter Not Matching
Actually, it was mostly caused by the GPU device. This was the reason for exclusion from performance counters. Here is the relevant output for a sample execution of PCM
on a relatively idle system with resolution 3840x2160
and refresh rate 60
using xrandr:
And this is for the situation with resolution 800x600
and the same refresh rate (i.e., 60
):
As can be seen, changing screen resolution reduced read and IO traffic considerably (more than 100x
!).
How does a system wide profiler (e.g. perf) correlate counters with instructions?
So I think there's some kernel module that launches software interrupts at a certain sampling rate.
Perf is not module, it is part of the Linux kernel, implemented in
kernel/events/core.c and for every supported architecture and cpu model, for example arch/x86/kernel/cpu/perf_event*.c. But Oprofile was a module, with similar approach.
Perf generally works by asking PMU (Performance monitoring unit) of CPU to generate interrupt after N events of some hardware performance counter (Yokohama, slide 5 "• Interrupt when threshold reached: allows sampling"). Actually it may be implemented as:
- select some PMU counter
- initialize it to
-N
, where N is the sampling period (we want interrupt after N events, for example, after 2 millions of cyclesperf record -c 2000000 -e cycles
, or some N computed and tuned by perf when no extra option is set or-F
is given) - set this counter to wanted event, and ask PMU to generate interrupt on overflow (ARCH_PERFMON_EVENTSEL_INT). It will happen after N increments of our counter.
All modern Intel chips supports this, for example, Nehalem: https://software.intel.com/sites/default/files/76/87/30320 - Nehalem Performance Monitoring Unit Programming Guide
EBS - Event Based Sampling. A technique in which counters are pre-loaded with a large negative count, and they are configured to interrupt the processor on overflow. When the counter overflows the interrupt service routine capture profiling data.
So, when you use hardware PMU, there is no additional work at timer interrupt with special reading of hardware PMU counters. There is some work to save/restore PMU state at task switch, but this (*_sched_in
/*_sched_out
of kernel/events/core.c) will not change PMU counter value for current thread nor will export it to user-space.
There is a handler: arch/x86/kernel/cpu/perf_event.c: x86_pmu_handle_irq
which finds the overflowed counter and calls perf_sample_data_init(&data, 0, event->hw.last_period);
to record the current time, IP of last executed command (it can be inexact because of out-of-order nature of most Intel microarchitetures, there is limited workaround for some events - PEBS, perf record -e cycles:pp
), stacktrace data (if -g
was used in record), etc. Then handler resets the counter value to the -N
(x86_perf_event_set_period, wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
- note the minus before left
)
The lower the sampling rate, the lower the profiler overhead.
Perf allows you to set target sampling rate with -F
option, -F 1000
means around 1000 irq/s. High rates are not recommended due to high overhead. Ten years ago Intel VTune recommended not more than 1000 irq/s (http://www.cs.utah.edu/~mhall/cs4961f09/VTune-1.pdf "Try to get about a 1000 samples per second per logical CPU."), perf usually don't allow too high rate for non-root (autotuned to lower rate when "perf interrupt took too long" - check in your dmesg
; also check sysctl -a|grep perf
, for example kernel.perf_cpu_time_max_percent=25
- which means that perf will try to use not more then 25 % of CPU)
Can you interrogate for example the task scheduler to find out what was running when you interrupted him?
No. But you can enable tracepoint at sched_switch
or other sched event (list all available in sched: perf list 'sched:*'
), and use it as profiling event for the perf. You can even ask perf to record stacktrace at this tracepoint:
perf record -a -g -e "sched:sched_switch" sleep 10
Won't that affect the execution of the scheduler
Enabled tracepoint will make add some perf event sampling work to the function with tracepoint
Is the list of task_struct objects available?
Only via ftrace...
Information about context switches
This is software perf event, just call to perf_sw_event
with PERF_COUNT_SW_CONTEXT_SWITCHES
event from sched/core.c (indirectly). Example of direct call - migration software event: kernel/sched/core.c set_task_cpu(): p->se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
PS: there are good slides on perf, ftrace and other profiling and tracing subsystems in Linux by Gregg: http://www.brendangregg.com/linuxperf.html
How to Configure and Sample Intel Performance Counters In-Process
It seems the best way -- for Linux at least -- is to use the msr device node.
You simply open a device node, seek to the address of the MSR required, and read or write 8 bytes.
OpenBSD is harder, since (at the time of writing) there is no user-space proxy to the MSRs. So you would need to write a kernel module or implement a sysctl by hand.
Related Topics
How to Disable Qt's Behavior on Linux of Capturing Arrow Keys for Widget Focus Navigation
How to Open The Default Text Editor in Linux
Listening for New Processes in Linux Kernel Module
What Algorithm How to Use to Generate a 48-Bit Hash for Unique MAC Addresses
How to Include Cutil.H in Linux
Monodevelop - Runs Only Using Sudo
Cannot Compile Mergevec.Cpp from Haartraining Tutorial
Curl Error "No Alternative Certificate.."
How to Join 2 CSV Files with a Shell Script
Why Sizeof(Spinlock_T) Is Greater Than Zero on Uni-Processor
Can 'Vim' Open a Large File in Read Only Mode as Fast as 'Less'
U-Boot: Cannot Boot Linux Kernel Despite Kernel Being Less Than Maximum Bootm_Len
Git - How to Remove Branch from Checkout Autocomplete
Securing a Simple Linux Server That Holds a MySQL Database
Notify-Send Command Doesn't Launch The Notification Through Systemd Service
Notify Gpio Interrupt to User Space from a Kernel Module
Which Signal Was Delivered to Process Deadlocked in Signal Handler