Source Line Numbers in Perf Call Graph

Source line numbers in perf call graph?

I accidentally found it loosely documented in perf script, but it applies to other commands as well: -F option accepts srcline. So you can do -F+srcline to add the line number to existing columns.

Example: perf report -g fractal -F+period,srcline

Samples: 22K of event 'cycles:u', Event count (approx.): 13031011295
Children Self Period Source:Line Command Shared Object Symbol
+ 99.98% 38.76% 5051224000 test.cpp:7 a a [.] fib
+ 96.42% 0.00% 0 _start+94372992700461 a a [.] _start
+ 96.42% 0.00% 0 __libc_start_main+140304673091826 a libc-2.29.so [.] __libc_start_main
+ 96.42% 0.00% 0 test.cpp:13 a a [.] main
+ 21.47% 21.47% 2797741850 test.cpp:8 a a [.] fib
+ 16.69% 16.69% 2174469736 test.cpp:4 a a [.] fib
+ 16.37% 16.36% 2132462705 test.cpp:6 a a [.] fib
+ 6.69% 6.69% 871128215 test.cpp:5 a a [.] fib

Perf Imprecise Call-Graph Report

  1. The section of the manual you quoted talks about BTS, not LBR: they are not the same thing. Later in that same thread you quoted Andi Kleen seems to indicate that the LBR snap time is actually the moment of the PMI (the interrupt that runs the handler) and not the PEBS moment. So I think all three stack approaches have the same problem.

  2. DWARF stack captures definitely do not correspond exactly to the PEBS entry. The PEBS event is recorded by the hardware at runtime, and then only some time later is the CPU interrupted, at which point the stack is unwound. If the PEBS buffer is configured to hold only a single entry, these two things should at least be close and if you are lucky, the PEBS IP will be in the same function that is still at the top of the stack when the handler runs. In that case, the stack is basically correct. Since perf shows you the actual PEBS IP at the top, plus the frames below that from the capture, this ends up working in that case.

  3. If you aren't lucky, the function will have changed between the PEBS capture and the handler running. In this case you get a franken-stack that doesn't make sense: the top function may not be callable from the second-from-the-top function (or something). It is not totally corrupted: it's just that everything except the top frame comes from a point after the PEBS stack was captured, and the top frame comes from PEBS, or something like that. This applies also to --call-graph fp, for the same reasons.

Most likely you never saw an invalid IP because perf shows the IP from the PEBS sample (that's the theme of that whole thread). I think if you look into the raw sample, you can see both the PEBS IP, and the handler IP, and you can see they usually won't match.

Overall, you can trust the backtraces for "time" or "cycle" profiling since they are in some sense an accurate sampling representation of execution time: it's just that they don't correspond to the PEBS moment but some time later (but why is that later time any worse than the PEBS time). Basically for this type of profiling you don't really need PEBS at all.

If you are using a different type of event, and you want fine-grained accounting of where the event took place, that's what PEBS is for. You often don't need a stack trace: just the top frame is enough. If you want stack traces, use them, but know they come from a moment in time a bit later, or use --lbr (if that works).

linux perf: how to interpret and find hotspots

You should give hotspot a try:
https://www.kdab.com/hotspot-gui-linux-perf-profiler/

It's available on github: https://github.com/KDAB/hotspot

It is for example able to generate flamegraphs for you.

flamegraph

Weird Backtrace in Perf

TL;DR perf backtracing process may stop at some function if there is no frame pointer saved in the stack or no CFI tables for dwarf method. Recompile libraries with -fno-omit-frame-pointer or with -g or get debuginfo. With release binaries and libs perf often will stop backtrace early without chance to reach main() or _start or clone()/start_thread() top functions.

perf profiling tool in Linux is statistical sampling profiler (without binary instrumentation): it programs software timer or event source or hardware performance monitoring unit (PMU) to generate periodic interrupt. In your example
-c 10000 -e mem_load_uops_retired.l3_miss:uppp is used to select hardware PMU in x86_64 in some kind of PEBS mode (https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR) to generate interrupt after 10000 of mem_load_uops_retired (with l3_miss mask). Generated interrupt is handled by Linux Kernel (perf_events subsystem, kernel/events and arch/x86/events). In this handler PMU is reset (reprogrammed) to generate next interrupt after 10000 more events and sample is generated. Sample data dump is saved into perf.data file by perf report command, but every wake of tool can save thousands of samples; samples can be read by perf script or perf script -D.

perf_events interrupt handler, something near __perf_event_overflow of kernel/events/core.c, has full access to the registers of current function, and has some time to do additional data retrieval to record current time, pid, etc. Part of such process is https://en.wikipedia.org/wiki/Call_stack data collection. But with x86_64 and -fomit-frame-pointer (often enabled for many system libraries of Debian/Ubuntu/others) there is no default place in registers or in function stack to store frame pointers:

https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/Optimize-Options.html#index-fomit_002dframe_002dpointer-692

-fomit-frame-pointer
Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and
restore frame pointers; it also makes an extra register available in
many functions. It also makes debugging impossible on some machines.

Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit Linux x86 and 32-bit Darwin x86
targets has been changed to -fomit-frame-pointer. The default can be
reverted to -fno-omit-frame-pointer by configuring GCC with the
--enable-frame-pointer configure option.

With frame pointers saved in the function stack backtracing/unwinding is easy. But for some functions modern gcc (and other compilers) may not generate frame pointer. So backtracing code like in perf_events handler either will stop backtrace at such function or needs another method of frame pointer recovery. Option -g method (--call-graph) of perf record selects the method to be used. It is documented in man perf-record http://man7.org/linux/man-pages/man1/perf-record.1.html:

--call-graph Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Default is "fp".

Allows specifying "fp" (frame pointer) or "dwarf" (DWARF's CFI -
Call Frame Information) or "lbr" (Hardware Last Branch Record
facility) as the method to collect the information used to show the
call graphs.

In some systems, where binaries are build with gcc

--fomit-frame-pointer, using the "fp" method will produce bogus call graphs, using "dwarf", if available (perf tools linked to the
libunwind or libdw library) should be used instead. Using the "lbr"
method doesn't require any compiler options. It will produce call
graphs from the hardware LBR registers. The main limitation is that
it is only available on new Intel platforms, such as Haswell. It
can only get user call chain. It doesn't work with branch stack
sampling at the same time.

When "dwarf" recording is used, perf also records (user) stack dump
when sampled. Default size of the stack dump is 8192 (bytes). User
can change the size by passing the size after comma like

"--call-graph dwarf,4096".

So, dwarf method reuses CFI tables to find stack frame sizes and find caller's stack frame. I'm not sure are CFI tables stripped from release libraries by default or not; but debuginfo probably will have them. LBR will not help because it is rather short hardware buffer. Dwarf split processing (kernel handler saves part of stack and perf user-space tool will parse it with libdw+libunwind) may lose some parts of call stack, so try also to increase dwarf stack dumps by using --call-graph dwarf,10240 or --call-graph dwarf,81920 etc.

Backtracing is implemented in arch-dependent part of perf_events: arch/x86/events/core.c:perf_callchain_user(); called from kernel/events/callchain.c:get_perf_callchain() <- perf_callchain <- perf_prepare_sample <-
__perf_event_output <- *(event->overflow_handler) <- READ_ONCE(event->overflow_handler)(event, data, regs); of __perf_event_overflow.

Gregg did warn about incomplete call stacks of perf: http://www.brendangregg.com/blog/2014-06-22/perf-cpu-sample.html

Incomplete stacks usually mean -fomit-frame-pointer was used – a compiler optimization that makes little positive difference in the real world, but breaks stack profilers. Always compile with -fno-omit-frame-pointer. More recent perf has a -g dwarf option, to use the alternate libunwind/dwarf method for retrieving stacks.

I also did write about backtraces in perf with some additional links: How does linux's perf utility understand stack traces?



Related Topics



Leave a reply



Submit