Perf Stat for Part of Program

using Perf stat to profile both process and system-wide events simultaneously

The system-wide option -a is an attribute of the perf command itself rather than an attribute of an event to be profiled, so you can't collect event counts per-process and system-wide using a single perf command.

However, you can run multiple perf processes simultaneously with overlapping events. For example, if you want to profile a specific process and the entire system, you can launch two perf stat commands, one for each purpose. Internally, the perf_event subsystem schedules all the enabled events on each core. If all specified events cannot be scheduled at the same time, either because there are conflicts between the constraints of some events or the number of hardware events is larger than the number of performance monitoring countering registers, multiplexing is used. Each event from each perf stat command requires a dedicated PMC.

On modern Intel processors, you can have 8 general-purpose PMCs per logical core instead of just 4 if you disable hyperthreading (or if HT is not supported by the processor model). This may help eliminate or reduce multiplexing.

Otherwise, I don't think there is an easy to go about this.

What does perf stat output mean?


[root@root test]# perf stat -a -e "r81d0","r82d0" -v ./a
r81d0: 71800964 1269047979 1269006431
r82d0: 26655201 1284214869 1284214869

This is output from verbose option, as defined in tools/perf/builtin-stat.c file of the kernel:

391 /*
392 * Read out the results of a single counter:
393 * aggregate counts across CPUs in system-wide mode
394 */
395 static int read_counter_aggr(struct perf_evsel *counter)

408 if (verbose) {
409 fprintf(output, "%s: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
410 perf_evsel__name(counter), count[0], count[1], count[2]);
411 }

count is from struct perf_counts_values, defined as http://lxr.free-electrons.com/source/tools/perf/util/evsel.h?v=3.18#L12 with array of three uint64_t values, named as val, ena, run

Three count values are filled by kernel and are read from fd, opened with perf_event_open() syscall. There is related part of man perf_event_open: http://man7.org/linux/man-pages/man2/perf_event_open.2.html

   read_format
This field specifies the format of the data returned by
read(2) on a perf_event_open() file descriptor.

PERF_FORMAT_TOTAL_TIME_ENABLED
Adds the 64-bit time_enabled field. This can be used
to calculate estimated totals if the PMU is
overcommitted and multiplexing is happening.

PERF_FORMAT_TOTAL_TIME_RUNNING
Adds the 64-bit time_running field. This can be used
to calculate estimated totals if the PMU is
overcommitted and multiplexing is happening. ...

perf stat enables all TIME flags if scale is true -

298         if (scale)
299 attr->read_format = PERF_FORMAT_TOTAL_TIME_ENABLED |
300 PERF_FORMAT_TOTAL_TIME_RUNNING;

So, first counter is raw event count; second is proportional to the time when this event was collected and last is proportional to the total running time. This is needed when you asks perf to stat on high number of events, which can't be monitored at once (hardware usually has up to 5-7 performance monitors). In such case in-kernel perf will run subsets of required event for some parts of execution; and subsets will be changed several times. With ena and run counts, perf can estimate how inaccurate event monitoring was in case of multiplexing.

 Performance counter stats for './a':

71,800,964 r81d0 [100.00%]
26,655,201 r82d0

And in your case two events were mapped in the same time without needs of multiplexing; your ena and run counters are close. And print_aggr function prints their ratio:

1137                                 val += counter->counts->cpu[cpu].val;
1138 ena += counter->counts->cpu[cpu].ena;
1139 run += counter->counts->cpu[cpu].run;

Print_noise will output in case of -r N option to rerun task N times to get statistics (man: --repeat=<n> repeat command and print average + stddev (max: 100))

1176                                 print_noise(counter, 1.0);

And there is [100.00%] printer:

1178                                 if (run != ena)
1179 fprintf(output, " (%.2f%%)",
1180 100.0 * run / ena);

It will not print 100% if both run and ena times are equal, and your r82d0 event have equal. Your r81d0 event have slightly different run and ena, so 100% is printed in one line.

I know that perf stat -d can be inaccurate, because it asks for too much events; and there will be not 100% mulitplexing, but something like 53%. It means "this event was counted only in 53% of the program runtime in some random parts of it"; and if you program have several separated computing stages, events with low run/ena ratio will be less accurate.

Using perf probe to monitor performance stats during a particular function

I think that the instructions you are following are not yet included into the mainline Linux kernel. As a consequence, perf is telling you that the events are not supported: perf doesn't know the "toggle" mechanism mentioned on this page.

I can see two workarounds:

  1. If you have access to the source code you want to profile you can use the perf_event_open system call directly from your source code to start and stop counting on function entry and exit.
  2. Clone jolsa repository git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/jolsa/perf switch the core_toggle branch git co remotes/origin/perf/core_toggle and then compile and run the kernel with this support.

Regarding 2, I am not familiar at all with kernel versions and development and I think that this solution may be quie complex to use and maintain. Maybe you should ask on the perf users mailing list if there are any plans for the toggle feature to be integrated into the mainline kernel.



Related Topics



Leave a reply



Submit