How to Get the Python Call Stack with the Linux Perf

How to collect some readable stack traces with perf?

perf does offer callstack recording with three different techniques

  • By default is uses the frame pointer (fp). This is generally supported and performs well, but it doesn't work with certain optimizations. Compile your applications with -fno-omit-frame-pointer etc. to make sure it works well.
  • dwarf uses a dump of the sack for each sample for post-processing. That has a significant performance penalty
  • Modern systems can use hardware-supported last branch record, lbr.

The stack is accessible in perf analysis tools such as perf report or perf script.

For more details check out man perf-record.

How can you get frame-pointer perf call stacks/flamegraphs involving the C++ standard library?

With your code, 20.04 x86_64 ubuntu, perf record --call-graph fp with and without -e cycles:u I have similar flamegraph as viewed with https://speedscope.app (prepare data with perf script > out.txt and select out.txt in the webapp).

Is it possible to get correct fp call stacks with libstdc++ without compiling it myself (which seems like a lot of work)?

No, call-graph method 'fp' is implemented in linux kernel code in very simple way: https://elixir.bootlin.com/linux/v5.4/C/ident/perf_callchain_user - https://elixir.bootlin.com/linux/v5.4/source/arch/x86/events/core.c#L2464

perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
{
...
fp = (unsigned long __user *)regs->bp;
perf_callchain_store(entry, regs->ip);
...
// where max_stack is probably around 127 = PERF_MAX_STACK_DEPTH https://elixir.bootlin.com/linux/v5.4/source/include/uapi/linux/perf_event.h#L1021
while (entry->nr < entry->max_stack) {
...
if (!valid_user_frame(fp, sizeof(frame)))
break;
bytes = __copy_from_user_nmi(&frame.next_frame, fp, sizeof(*fp));
bytes = __copy_from_user_nmi(&frame.return_address, fp + 1, sizeof(*fp));

perf_callchain_store(entry, frame.return_address);
fp = (void __user *)frame.next_frame;
}
}

It can't find correct frames for -fomit-frame-pointer compiled code.

For incorrect call stacks with main -> __memcmp_avx2_movbe there is only call stack data generated by kernel in perf.data file, no copy of user stack fragment, no register data:

setarch x86_64 -R env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/debug:${LD_LIBRARY_PATH} perf record -F 1000 --call-graph fp  -- ./6_stl.bin
perf script -D | less

869122666352078 0xae0 [0x58]: PERF_RECORD_SAMPLE(IP, 0x4002): 12267/12267: 0x7ffff7d51670 period: 2332683 addr: 0
... FP chain: nr:5
..... 0: fffffffffffffe00
..... 1: 00007ffff7d51670
..... 2: 0000555555556452
..... 3: 00007ffff7be90fb
..... 4: 00005555555564de
... thread: 6_stl.bin:12267
...... dso: /usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so
6_stl.bin 12267 869122.666352: 2332683 cycles:
7ffff7d51670 __memcmp_avx2_movbe+0x140 (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
555555556452 main+0x12 (/home/user/so/68259699/6_stl.bin)
7ffff7be90fb __libc_start_main+0x10b (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
5555555564de _start+0x2e (/home/user/so/68259699/6_stl.bin)

So, with this method user-space perf tool can't use any additional information to fix the call stack. With dwarf method there are registers and partial dump of user stack data on every sample event.

Gdb has full access to live process and can use any information, all registers, read any amount of user process stack, read additional debug info for program and libraries. And doing advanced and slow backtrace in gdb is not limited by time or security or uninterruptible context. Linux kernel should record perf sample in small time, it can't access swapped data or debug sections or debug info files, it should not do complex parsing (which can have some bugs).

Debug version of libstdc++ may help (sudo apt install libstdc++6-9-dbg), but it is slow. And it did not help me to find lost backtrace of this asm-implemented __memcmp_avx2_movbe (libc: sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S)

If you want full backtrace, I think you should find how to recompile a world (or only all libraries used by your target application). Probably it will be easier not with Ubuntu but with something like gentoo or arch or apline?

If you are interested only in performance why do you want the flamegraph? Flat profile will catch most performance data; non-ideal flamegraph can be useful too.

How does linux's perf utility understand stack traces?

There is short introduction about stack traces in perf by Gregg:
http://www.brendangregg.com/perf.html

4.4 Stack Traces

Always compile with frame pointers. Omitting frame pointers is an evil compiler optimization that breaks debuggers, and sadly, is often the default. Without them, you may see incomplete stacks from perf_events ... There are two ways to fix this: either using dwarf data to unwind the stack, or returning the frame pointers.

Dwarf

Since about the 3.9 kernel, perf_events has supported a workaround for missing frame pointers in user-level stacks: libunwind, which uses dwarf. This can be enabled using "-g dwarf".
... compiler optimizations (-O2), which in this case has omitted the frame pointer. ... recompiling .. with -fno-omit-frame-pointer:

Non C-style languages may have different frame format, or may omit frame pointers too:

4.3. JIT Symbols (Java, Node.js)

Programs that have virtual machines (VMs), like Java's JVM and node's v8, execute their own virtual processor, which has its own way of executing functions and managing stacks. If you profile these using perf_events, you'll see symbols for the VM engine .. perf_events has JIT support to solve this, which requires the VM to maintain a /tmp/perf-PID.map file for symbol translation.

Note that Java may not show full stacks to begin with, due to hotspot on x86 omitting the frame pointer (just like gcc). On newer versions (JDK 8u60+), you can use the -XX:+PreserveFramePointer option to fix this behavior, ...

The Gregg's blog post about Java and stack traces:
http://techblog.netflix.com/2015/07/java-in-flames.html ("Fixing Frame Pointers" - fixed in some JDK8 versions and in JDK9 by adding option on program start)

Now, your questions:

How does linux's perf utility understand stack traces?

perf utility basically (in early versions) just parses data returned from linux kernel's subsystem "perf_events" (or sometimes "events"), accessed with syscall perf_event_open. For call stack trace there are options PERF_SAMPLE_CALLCHAIN / PERF_SAMPLE_STACK_USER:

sample_type
PERF_SAMPLE_CALLCHAIN
Records the callchain (stack backtrace).

          PERF_SAMPLE_STACK_USER (since Linux 3.7)
Records the user level stack, allowing stack unwinding.

Does the Linux kernel natively understand stack traces?

It may understand (if implemented) and may not, depending on your cpu architecture. The function of sampling (getting/reading call stack from live process) callchain is defined in architecture-independent part of kernel as __weak with empty body:

http://lxr.free-electrons.com/source/kernel/events/callchain.c?v=4.4#L26

 27 __weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
28 struct pt_regs *regs)
29 {
30 }
31
32 __weak void perf_callchain_user(struct perf_callchain_entry *entry,
33 struct pt_regs *regs)
34 {
35 }

In 4.4 kernel user-space callchain sampler is redefined in architecture-dependent part of kernel for x86/x86_64, ARC, SPARC, ARM/ARM64, Xtensa, Tilera TILE, PowerPC, Imagination Meta:

http://lxr.free-electrons.com/ident?v=4.4;i=perf_callchain_user

arch/x86/kernel/cpu/perf_event.c, line 2279
arch/arc/kernel/perf_event.c, line 72
arch/sparc/kernel/perf_event.c, line 1829
arch/arm/kernel/perf_callchain.c, line 62
arch/xtensa/kernel/perf_event.c, line 339
arch/tile/kernel/perf_event.c, line 995
arch/arm64/kernel/perf_callchain.c, line 109
arch/powerpc/perf/callchain.c, line 490
arch/metag/kernel/perf_callchain.c, line 59

Reading of call chain from user stack may be not trivial for some architectures and/or for some modes.

What CPU architecture you use? What languages and VM are used?

Where can I read more about how a tool is able to introspect into stack traces of processes, even if processes are written in completely different languages?

You may try gdb and/or debuggers for the language or backtrace function of libc or support of read-only unwinding in libunwind (there is local backtrace example in libunwind, show_backtrace()).

They may have better support of frame parsing / better integration with virtual machine of the language or with unwind info. If gdb (with backtrace command) or other debuggers can't get stack traces from running program, there may be no way of getting stack trace at all.

If they can get call trace, but perf can't (even after recompiling with -fno-omit-frame-pointer for C/C++), it may be possible to add support of such combination of architecture + frame format into perf_events and perf.

There are several blogs with some info about generic backtracing problems and solutions:

  • http://eli.thegreenplace.net/2015/programmatic-access-to-the-call-stack-in-c/ - local backtrace with libunwind
  • http://codingrelic.geekhold.com/2009/05/pre-mortem-backtracing.html gcc's __builtin_return_address(N) vs glibc's backtrace() vs libunwind's local backtrace
  • http://lucumr.pocoo.org/2014/10/30/dont-panic/ backtrace and unwinding in rust
  • https://github.com/gperftools/gperftools/wiki/gperftools'-stacktrace-capturing-methods-and-their-issues same problem of backtracing in gperftools software-timer based profiler library

Dwarf support for perf_events/perf:

  • https://lwn.net/Articles/499116/ [RFCv4 00/16] perf: Add backtrace post dwarf unwind, may 2012
  • https://lwn.net/Articles/507753/ [PATCHv7 00/17] perf: Add backtrace post dwarf unwind, Jul 2012
  • https://wiki.linaro.org/LEG/Engineering/TOOLS/perf-callstack-unwinding - Dwarf unwinding on ARM 7/8 for perf
  • https://wiki.linaro.org/KenWerner/Sandbox/libunwind#libunwind_ARM_unwind_methods - non-dwarf methods too


Related Topics



Leave a reply



Submit