what's the meaning of cycles annotation in perf stat
This is a thing I hate very much on perf
, that the documentation and manual pages are outdated and searching for meaning of some values is pretty complicated. I did search for them once so I add my findings:
what's the meaning of 1.478 GHz
To my knowledge, the value after #
is recalculation of the native counter value (the value in the first column) to the user-readable form. This value should roughly correspond to clock speed of your processor:
grep MHz /proc/cpuinfo
should give similar value. It is printed from tools/perf/util/stat-shadow.c.
and [46.17%] in cycles's annotation?
This value should correspond the the portion of time, the hardware counter was active. Perf allows to start more hardware counters and multiplex them in runtime, which makes it easier for programmer.
I didn't found the actual place in the code, but it is described in one recently proposed patch as (part of csv
format):
+ - percentage of measurement time the counter was running
Understanding Linux perf FP counters and computation of FLOPS in a C++ program
The normal way for C++ compilers to do FP math on x86-64 is with scalar versions of SSE instructions, e.g. addsd xmm0, [rdi]
(https://www.felixcloutier.com/x86/addsd). Only legacy 32-bit builds default to using the x87 FPU for scalar math.
If your compiler didn't manage to auto-vectorize anything (e.g. you didn't use g++ -O3 -march=native
), and the only math you do is with double
not float
, then all the math operations will be done with scalar-double instructions.
Each such instruction will be counted by the fp_arith_inst_retired.double
, .scalar
, and .scalar-double
events. They overlap, basically sub-filters of the same event. (FMA operations count as two, even though they're still only one instruction, so these are FLOP counts not uops or instructions).
So you have 4,493,140,957
FLOPs over 65.86
seconds.4493140957 / 65.86 / 1e9
~= 0.0682 GFLOP/s, i.e. very low.
If you had had any counts for 128b_packed_double
, you'd multiply those by 2. As noted in the perf list
description: "each count represents 2 computation operations, one for each element" because a 128-bit vector holds two 64-bit double
elements. So each count for this even is 2 FLOPs. Similarly for others, follow the scale factors described in the perf list
output, e.g. times 8 for 256b_packed_single
.
So you do need to separate the SIMD events by type and width, but you could just look at .scalar
without separating single and double.
See also FLOP measurement, one of the duplicates of FLOPS in Python using a Haswell CPU (Intel Core Processor (Haswell, no TSX)) which was linked on your previous question
(36.37%)
is how much of the total time that even was programmed on a HW counter. You used more events than there are counters, so perf multiplexed them for you, swapping every so often and extrapolating based on that statistical sampling to estimate the total over the run-time. See Perf tool stat output: multiplex and scaling of "cycles".
You could get exact counts for the non-zero non-redundant events by leaving out the ones that are zero for a given build.
Performance monitoring with perf
ANSWER #1
Yes mostly. perf report
does show you a summary of the trace collected. Samples collected by perf record
are saved into a binary file called, by default, perf.data
. The perf report
command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first. However, you can do much more detailed profiling also using this report.
ANSWER #2
You should ideally use perf script -D
to get a trace of all data. The timestamp is in microseconds. Although, in kernels newer than the one you specify, with the help of a command line switch (-ns) you can display the time in nanoseconds as well. Here is the source -
Timestamp
It is quite hard to tell this without looking at what kind of "deltas" are you getting. Remember the period of collecting samples is usually tuned. There are two ways of specifying the rate at which to collect samples --
You can use the perf record (--c for count)
to specify the period at which to collect samples. This will mean that for every c occurrences of the event that you are measuring, you will have a sample for that. You can then modify the sampling period and test various values. This means that at every two occurences of the event for which you are measuring, the counter will overflow and you will record a sample.
The other way around to express the sampling period, is to specify the average rate of samples per second (frequency) - which you can do using perf record -F
. So perf record -F 1000
will record around 1000 samples per second and these samples will be generated when the hardware/PMU counter corresponding to the event overflows. This means that the kernel will dynamically adjust the sampling period. And you will get sample times at different random moments.
You can see for yourself in code here:
How perf dynamically updates time
ANSWER #3
Why not ? Ideally you should get the number of event samples collected if you do a perf report
and just do a deeper analysis. Also when you do a perf record
and finish recording samples, you would get a notification on the command line about the number of samples collected corresponding to the event you measured. (This may not be available in the kernel module you use, I would suggest you switch to a newer linux version if possible!). The number of samples should be the raw count - not the period.
If your period is 100 - it means that for the whole duration of the trace, perf
recorded every 100th event. That means, if a total of 1000 events happened for the trace duration, perf
approximately collected event 1, 100, 200, 300...1000.
Yes the samples recorded are not only from the application. In fact, you can use switches like this : perf record -e <event-name:u> or <event-name:k>
(u for userspace and k for kernel) to record events. Additionally perf
records samples from shared libraries as well. (Please consult the perf
man-page for more details).
As I said previously, perf report
should be an ideal tool to calculate the number of samples of event cycles
recorded by perf
. The number of events collected/recorded is not exact because it is simply not possible for hardware to record all cycle
events. This is because recording and preparing details of all the events require the kernel to maintain a ring buffer which gets written to periodically as and when the counter overflows. This writing to the buffer happens via interrupts. They take up a fraction of CPU time- this time is lost and could have been used to record events which will now be lost as the CPU was busy servicing interrupts. You can get a really great estimate by perf
even then, though.
CONCLUSION
perf
does especially what it intends to do given the limitations of hardware resources we have at hand currently. I would suggest going through the man-pages for each command to understand better.
QUESTIONS
I assume you are looking at
perf report
. I also assume you are talking about the overhead % inperf report
. Theoretically, it can be considered to be an arrangement of data from the highest to least occurrence as you specified. But, there are many underlying details that you need to consider and understand to properly make sense of the output. It represents which function has the most overhead (in terms of the number of events that occurred in that function ). There is also a parent-child relationship, based on which function calls which function, between all the functions and their overheads. Please use the Perf Report link to understand more.As you know already events are being sampled, not counted. So you cannot accurately get the number of events, but you will get the number of samples and based on the tuned frequency of collecting samples, you will also get the raw count of the number of events ( Everything should be available to you with the
perf report
output ).
Related Topics
Is an Operating System Kernel an Interpeter for All Other Programs
How to Get Gcc to Skip Errors, But Still Output Them
Combine Two CSV Files Based on Common Column Using Awk or Sed
How to Hide Password from Jenkins Shell Output
Displaying or Redirecting a Shell's Job Control Messages
Are the 'Dot' and 'Dot Dot' Files in Unix and Linux Real Files
Bash Script Properties File Using '.' in Variable Name
How to Make Linux Power Off When Halt Is Run
Cmsg_Nxthdr() Returns Null Even Though There Are More Cmsghdr Objects
How to Suspend and Resume a Sequence of Commands in Bash
Gatttool Non-Interactive Mode --Char-Write
Differentiate Between Exit and Session Timeout
Django on Apache Wtih Mod_Wsgi (Linux) - 403 Forbidden
Stop Being Root in the Middle of a Script That Was Run with Sudo