Linux Perf: How to Interpret and Find Hotspots

linux perf: how to interpret and find hotspots

You should give hotspot a try:
https://www.kdab.com/hotspot-gui-linux-perf-profiler/

It's available on github: https://github.com/KDAB/hotspot

It is for example able to generate flamegraphs for you.

flamegraph

Understanding Linux perf FP counters and computation of FLOPS in a C++ program

The normal way for C++ compilers to do FP math on x86-64 is with scalar versions of SSE instructions, e.g. addsd xmm0, [rdi] (https://www.felixcloutier.com/x86/addsd). Only legacy 32-bit builds default to using the x87 FPU for scalar math.

If your compiler didn't manage to auto-vectorize anything (e.g. you didn't use g++ -O3 -march=native), and the only math you do is with double not float, then all the math operations will be done with scalar-double instructions.

Each such instruction will be counted by the fp_arith_inst_retired.double, .scalar, and .scalar-double events. They overlap, basically sub-filters of the same event. (FMA operations count as two, even though they're still only one instruction, so these are FLOP counts not uops or instructions).

So you have 4,493,140,957 FLOPs over 65.86 seconds.

4493140957 / 65.86 / 1e9 ~= 0.0682 GFLOP/s, i.e. very low.

If you had had any counts for 128b_packed_double, you'd multiply those by 2. As noted in the perf list description: "each count represents 2 computation operations, one for each element" because a 128-bit vector holds two 64-bit double elements. So each count for this even is 2 FLOPs. Similarly for others, follow the scale factors described in the perf list output, e.g. times 8 for 256b_packed_single.

So you do need to separate the SIMD events by type and width, but you could just look at .scalar without separating single and double.

See also FLOP measurement, one of the duplicates of FLOPS in Python using a Haswell CPU (Intel Core Processor (Haswell, no TSX)) which was linked on your previous question

(36.37%) is how much of the total time that even was programmed on a HW counter. You used more events than there are counters, so perf multiplexed them for you, swapping every so often and extrapolating based on that statistical sampling to estimate the total over the run-time. See Perf tool stat output: multiplex and scaling of "cycles".

You could get exact counts for the non-zero non-redundant events by leaving out the ones that are zero for a given build.

How to collect some readable stack traces with perf?

perf does offer callstack recording with three different techniques

By default is uses the frame pointer (fp). This is generally supported and performs well, but it doesn't work with certain optimizations. Compile your applications with -fno-omit-frame-pointer etc. to make sure it works well.
dwarf uses a dump of the sack for each sample for post-processing. That has a significant performance penalty
Modern systems can use hardware-supported last branch record, lbr.

The stack is accessible in perf analysis tools such as perf report or perf script.

For more details check out man perf-record.

How to interpret LLC-Load-Misses from perf stats

The number of LLC-load-misses should be interpreted as the number of loads that miss in the last level cache (typically the L3 for modern Intel chips) over the interval measured.

At the level this is measured, I believe loads going to the same cache line have already been "combined" by the line fill buffers: if you access several values all the same cache line which isn't presented in the LLC, these all "miss" from the point of view of your process (the use of any of those values will wait for the full miss duration), but I believe this is only counted as one miss for the LLC-load-misses counter.

Understanding perf detail when comparing two different implementations of a BFS algorithm

Unfortunately perf stat often doesn't given enough information to really determine where the bottleneck in your application is. It is possible to have two applications with wildly different underlying bottlenecks but with very similar perf stat profiles. For example, two applications may have the same number or fraction of L2 cache misses, and yet one might be dominated by this effect and the other way may almost be not impacted at all, depending on the amount and nature of overlapping work.

So if you try to analyze in depth from these high level counters, you are often just taking stabs in the dark. Still we can make a few observations. You mention:

The optimized algorithm has a much higher last level cache miss which
I thought would explain the primary issue with BFS / graph algorithms
with low locality but performance seems to be unaffected by this and
my unoptimized is significantly lower.

First, LLC misses are ~620 million for the optimized algorithm and ~380 for your algorithm, but you are running the optimized algorithm 10 times in this benchmark and yours only once. So the optimized algorithm has perhaps 62 million misses, and your algorithm has six times the number of LLC misses. Yes, your algorithm has a lower LLC miss rate - but the absolute number of LLC misses is what counts for performance. The lower miss rates just means that you are making every more total accesses than the 6x figure: basically you make many, many more memory accesses than the optimized version, which leads to a higher hit rate but more total misses.

All of this points to accessing more total memory in your unoptimized algorithm, or perhaps accessing it in a much more cache unfriendly fashion. That's would also explain the much higher number of page faults. Overall, both algorithms have low IPC, and yours is particularly low (0.49 IPC) and given that there aren't branch prediction problems, and that you've already identified these as graph algorithms with locality/memory access problems, stalls while waiting for memory are very likely.

Luckily, there is a better way that just trying to reverse engineer what might be the bottleneck based on perf stat output. Intel has developed a whole methodology which tries to this type of top-down analysis in a way that determines the true bottlenecks. It's not perfect, but it's far and away better than looking at the plain perf stat counters. VTune isn't free, but you can get a similar analysis based on the same methodology effect using Andi Kleen's toplev. I highly recommend you start there.

Linux Perf: How to Interpret and Find Hotspots