Finding Threading Bottlenecks and Optimizing for Wall-Time with Perf

Finding threading bottlenecks and optimizing for wall-time with perf

KDAB Hotspot is a GUI that can analyze perf record output and also show context switches and core utilization if the profiles have been recorded with -e sched:sched_switch --switch-events --sample-cpu

Sample Image

Performance monitoring with perf

ANSWER #1

Yes mostly. perf report does show you a summary of the trace collected. Samples collected by perf record are saved into a binary file called, by default, perf.data. The perf report command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first. However, you can do much more detailed profiling also using this report.

ANSWER #2

You should ideally use perf script -D to get a trace of all data. The timestamp is in microseconds. Although, in kernels newer than the one you specify, with the help of a command line switch (-ns) you can display the time in nanoseconds as well. Here is the source -

Timestamp

It is quite hard to tell this without looking at what kind of "deltas" are you getting. Remember the period of collecting samples is usually tuned. There are two ways of specifying the rate at which to collect samples --

You can use the perf record (--c for count) to specify the period at which to collect samples. This will mean that for every c occurrences of the event that you are measuring, you will have a sample for that. You can then modify the sampling period and test various values. This means that at every two occurences of the event for which you are measuring, the counter will overflow and you will record a sample.

The other way around to express the sampling period, is to specify the average rate of samples per second (frequency) - which you can do using perf record -F. So perf record -F 1000 will record around 1000 samples per second and these samples will be generated when the hardware/PMU counter corresponding to the event overflows. This means that the kernel will dynamically adjust the sampling period. And you will get sample times at different random moments.

You can see for yourself in code here:

How perf dynamically updates time

ANSWER #3

Why not ? Ideally you should get the number of event samples collected if you do a perf report and just do a deeper analysis. Also when you do a perf record and finish recording samples, you would get a notification on the command line about the number of samples collected corresponding to the event you measured. (This may not be available in the kernel module you use, I would suggest you switch to a newer linux version if possible!). The number of samples should be the raw count - not the period.

If your period is 100 - it means that for the whole duration of the trace, perf recorded every 100th event. That means, if a total of 1000 events happened for the trace duration, perf approximately collected event 1, 100, 200, 300...1000.

Yes the samples recorded are not only from the application. In fact, you can use switches like this : perf record -e <event-name:u> or <event-name:k> (u for userspace and k for kernel) to record events. Additionally perf records samples from shared libraries as well. (Please consult the perf man-page for more details).

As I said previously, perf report should be an ideal tool to calculate the number of samples of event cycles recorded by perf. The number of events collected/recorded is not exact because it is simply not possible for hardware to record all cycle events. This is because recording and preparing details of all the events require the kernel to maintain a ring buffer which gets written to periodically as and when the counter overflows. This writing to the buffer happens via interrupts. They take up a fraction of CPU time- this time is lost and could have been used to record events which will now be lost as the CPU was busy servicing interrupts. You can get a really great estimate by perf even then, though.

CONCLUSION

perf does especially what it intends to do given the limitations of hardware resources we have at hand currently. I would suggest going through the man-pages for each command to understand better.

QUESTIONS

I assume you are looking at perf report. I also assume you are talking about the overhead % in perf report. Theoretically, it can be considered to be an arrangement of data from the highest to least occurrence as you specified. But, there are many underlying details that you need to consider and understand to properly make sense of the output. It represents which function has the most overhead (in terms of the number of events that occurred in that function ). There is also a parent-child relationship, based on which function calls which function, between all the functions and their overheads. Please use the Perf Report link to understand more.
As you know already events are being sampled, not counted. So you cannot accurately get the number of events, but you will get the number of samples and based on the tuned frequency of collecting samples, you will also get the raw count of the number of events ( Everything should be available to you with the perf report output ).

Thread Utilization profiling on linux

Oracle's Developer Studio Performance Analyzer might do exactly what you're looking for. (Were you running on Solaris, I know it would do exactly what you're looking for, but I've never used it on Linux, and I don't have access right now to a Linux system suitable to try it on).

This is a screenshot of a multithreaded IO test program, running on an x86 Solaris 11 system:

Screenshot of multithreaded IO performance test prorgam

Note that you can see the call stack of every thread along with seeing exactly how the threads interact - in the posted example, you can see where the threads that actually perform the IO start, and you can see each of the threads as they perform.

This is a view that shows exactly where thread 2 is at the highlighted moment:

Sample Image

This view has synchronization event view enabled, showing that thread 2 is stuck in a sem_wait call for the highlighted period. Note the additional rows of graphical data, showing the synchronization events (sem_wait(), pthread_cond_wait(), pthread_mutex_lock() etc):

Sample Image

Other views include a call tree:

Sample Image

a thread overview (not very useful with only a handful of threads, but likely very useful if you have hundreds or more

Sample Image

and a view showing function CPU utilization

Sample Image

And you can see how much time is spent on each line of code:

Sample Image

Unsurprisingly, a process that's writing a large file to test IO performance spent almost all its time in the write() function.

The full Oracle brief is at https://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf

Quick usage overview:

collect performance data using the collect utility. See https://docs.oracle.com/cd/E77782_01/html/E77798/afadm.html#scrolltoc
Start the analyzer GUI to analyze the data collected above.

Optimizing double-long conversion with memory operand using perf?

To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question?

I think there is two main reason for this instruction to be slow. 1. There is a dependency chain and the latency of this instruction is a problem since the processor is waiting on it to execute other instructions. 2. There is a cache miss (saturating the memory with such instruction is improbable unless many cores are doing memory-based operations).

First of all, tracking what is going on on a specific instruction is hard (especially if the instruction is not executed a lot of time). You need to use precise events to track the root of the problem, that is, events for which the exact instruction addresses that caused the event are available. Only a (small) subset of all events are precise one.

Regarding (1), the latency of the instruction should be about 12 cycles on both architecture (although it might be slightly more on the AMD processor, I do not expect a 44% difference). The target processor are able to execute multiple instruction at the same time in a given cycle. Instructions are executed on different port and are also pipelined. The port usage matters to understand what is going on. This means all the instruction in the hot loop matters. You cannot isolate this specific instruction. Modern processors are insanely complex so a basic analysis can be tricky. On Ice Lake processors, you can measure the average port usage with events like UOPS_DISPATCHED.PORT_XXX where XXX can be 0, 1, 2_3, 4_9, 5, 6, 7_8. Only the first three matters for this instruction. The EXE_ACTIVITY.XXX events may also be useful. You should check if a port is saturated and which one. AFAIK, none of these events are precise so you can only analyse a block of code (typically the hot loop). On Zen 3, the ports are FP23 and FP45. IDK what are the useful events on this architecture (I am not very familiar with it).

On Ice Lake, you can check the FRONTEND_RETIRED.LATENCY_GE_XXX events where XXX is a power of two integer (which should be precise one so you can see if this instruction is the one impacting the events). This help you to see whether the front-end or the back-end is the limiting factor.

Regarding (2), you can check the latency of the memory accesses as well as the number of L1/L2/L3 cache hits/misses. On Ice Lake, you can use events like MEM_LOAD_RETIRED.XXX where XXX can be for example L1_MISS L1_HIT, L2_MISS, L2_HIT, L3_MISS and L3_HIT. Still on Ice Lake, t may be useful to track the latency of the memory operation with MEM_TRANS_RETIRED.LOAD_LATENCY_GT_XXX where XXX is again a power of two integer.

You can also use LLVM-MCA to simulate the scheduling of the loop instruction statically on the target architecture (do not consider branches). This is very useful to understand deeply what the scheduler can do pretty easily.

What could explain this discrepancy?

The latency and reciprocal throughput should be about the same on the two platform or at least close. That being said, for the same core count, the two certainly do not operate at the same frequency. If this is not coming from that, then I doubt this instruction is actually the problem alone (tricky scheduling issues, wrong/inaccurate profiling results, etc.).

CPU counter results show 1.5% cache misses per instruction

The thing is the cache-misses event is certainly not very informative here. Indeed, it references the last-level cache (L3) misses. Thus, it does not give any information about the L1/L2 misses (previous events do).

how should I proceed to optimize this?

If the code is latency bound, the solution is to first break any dependency chain in this loop. Unrolling the loop dans rewriting it so to make it more SIMD-friendly can help a lot to improve performance (the reciprocal throughput of this instruction is about 1 cycle as opposed to 12 for the latency so there is a room for improvements in this case).

If the code is memory bound, they you should care about data locality. Data should fit in the L1 cache if possible. There are many tips to do so but it is hard to guide you without more context. This includes for example sorting data, reordering loop iterations, using smaller data types.

There are many possible source of weird unusual unexpected behaviours that can occurs. If such a thing happens, then it is nearly impossible to understand what is going on without the exact code executed. All details matter in this case.

Measure the time to reach the main function using perf?

First, you have to consider that perf doesn't really measure time - it records events. Now you can do some profiling and look at call stacks and derive some information about initialization, but in order to measure a specific time, we need to record the beginning and end timestamp.

In case of the time to reach the main function, we can use

1) A dynamic tracepoint on main:

$ sudo perf probe -x ./gctor main Added new event:   probe_gctor:main  (on main in ./gctor)

You can now use it in all perf tools, such as:

perf record -e probe_gctor:main -aR sleep

This does require pretty high privileges, I'll just use root in the example.

2) A sensible point for the "start" of your binary.

I suggest the tracepoint syscalls:sys_exit_execve. This is basically right after perf record started to execute your binary. This works in my version (5.3.7) - if it doesn't for you, you may need to tinker around. You could of course just use -e cycles, but then you get spammed later on with events you don't want.

Putting it together:

sudo perf record -e probe_gctor:main -e syscalls:sys_exit_execve ./gctor
                  ^ this is what perf probe told you earlier

And then look at it with perf script --header

# time of first sample : 77582.919313
# time of last sample : 77585.150377
# sample duration :   2231.064 ms
[....]
# ========
#
           gctor 238828 [007] 77582.919313: syscalls:sys_exit_execve: 0x0
           gctor 238828 [001] 77585.150377:         probe_gctor:main: (5600ea33414d)

You can either compute it from these two samples, or use the sample duration if there are really only the two samples in your trace.

For completeness: Here's a way to do it with gdb:

gdb ./gctor -ex 'b main' -ex 'python import time' -ex 'python ts=time.time()' -ex 'run' -ex 'python print(time.time()-ts)'

This is much less accurate, has about 100 ms overhead on my system, but it doesn't require higher privileges. You could of course improve on this by just building your own runner with fork/ptrace/exec in C.

Measuring daemon CPU utilization over a portion of its wall clock run time

The /proc/<pid>/stat file contains the necessary information - you're after the utime and stime fields. Those are cumulative counters of the process's user-mode and kernel-mode CPU time used; read them at the start of the measuring interval, then read them again at the end and calculate the difference.

That will give you used CPU time in jiffies. To determine the total elapsed wallclock time in jiffies (so you can convert to an average utilisation), sum the numbers on the cpu0 line in /proc/stat (before and after, just like /proc/<pid>/stat).

This is the layout of the first few fields in /proc/<pid>/stat, from Documentation/filesystems/proc.txt in the Linux source:

Table 1-3: Contents of the stat files (as of 2.6.22-rc3)
..............................................................................
 Field          Content
  pid           process id
  tcomm         filename of the executable
  state         state (R is running, S is sleeping, D is sleeping in an
                uninterruptible wait, Z is zombie, T is traced or stopped)
  ppid          process id of the parent process
  pgrp          pgrp of the process
  sid           session id
  tty_nr        tty the process uses
  tty_pgrp      pgrp of the tty
  flags         task flags
  min_flt       number of minor faults
  cmin_flt      number of minor faults with child's
  maj_flt       number of major faults
  cmaj_flt      number of major faults with child's
  utime         user mode jiffies
  stime         kernel mode jiffies
  cutime        user mode jiffies with child's
  cstime        kernel mode jiffies with child's

Finding Threading Bottlenecks and Optimizing for Wall-Time with Perf