What Are Stalled-Cycles-Frontend and Stalled-Cycles-Backend in 'Perf Stat' Result

what's the meaning of cycles annotation in perf stat

This is a thing I hate very much on perf, that the documentation and manual pages are outdated and searching for meaning of some values is pretty complicated. I did search for them once so I add my findings:

what's the meaning of 1.478 GHz

To my knowledge, the value after # is recalculation of the native counter value (the value in the first column) to the user-readable form. This value should roughly correspond to clock speed of your processor:

grep MHz /proc/cpuinfo

should give similar value. It is printed from tools/perf/util/stat-shadow.c.

and [46.17%] in cycles's annotation?

This value should correspond the the portion of time, the hardware counter was active. Perf allows to start more hardware counters and multiplex them in runtime, which makes it easier for programmer.

I didn't found the actual place in the code, but it is described in one recently proposed patch as (part of csv format):

+   - percentage of measurement time the counter was running

Interpretation of perf stat output

The first thing you want to make sure is that no other compute intensive process is running on your machine. That's a server CPU so I thought that could be a problem.

If you use multi-threading in your program, and you distribute equal amount of work between threads, you might be interested collecting metrics only on one CPU.

I suggest disabling hyper-threading in the optimization phase as it can lead to confusion when interpreting the profiling metrics. (e.g. increased #cycles spent in the back-end). Also if you distribute work to 3 threads, you have a high chance that 2 threads will share the resources of one core and the 3rd will have the entire core for itself - and it will be faster.

Perf has never been very good at explaining the metrics. Judging by the order of magnitude, the cache references are the L2 misses that hit the LLC. A high LLC miss number compared with LLC references is not always a bad thing if the number of LLC references / #Instructions is low. In your case, you have 0.018 so that means that most of your data is being used from L2. The high LLC miss ratio means that you still need to get data from RAM and write it back.

Regarding #Cycles BE and FE bound, I'm a bit concerned about the values because they don't sum to 100% and to the total number of cycles. You have 8G but staying 6G cycles in the FE and 4G cycles in the BE. That does not seem very correct.

If the FE cycles is high, that means you have misses in the instruction cache or bad branch speculation. If the BE cycles is high, that means you wait for data.

Anyway, regarding your question. The most relevant metric to asses the performance of your code is Instructions / Cycle (IPC). Your CPU can execute up to 4 instructions / cycle. You only execute 0.8. That means resources are underutilized, except for the case where you have many vector instructions. After IPC you need to check branch misses and L1 misses (data and code) because those generate most penalties.

A final suggestion: you may be interested in trying Intel's vTune Amplifier. It gives a much better explaining on the metrics and points you to the eventual problems in your code.



Related Topics



Leave a reply



Submit