How to Get Complete Stack Dump from Profiler in Every Sample for Use in Flame Graph

How to get complete stack dump from profiler in every sample for use in flame graph?

Try Linux perf_events (aka the "perf" command), which is part of the mainline Linux kernel, and usually installed via the linux-tools-common (or similar) package. I often use it to create flame graphs on Linux.

I wrote up some instructions for creating flame graphs with perf on: http://www.brendangregg.com/perf.html#FlameGraphs

How to get Java profiling dump for creating flame graphs on the mac?

I created 2 little shell scripts based on @cello's answer. They generate hot/cold flame graphs.

Get them from this Gist.

Usage:

ps ax | grep java # find the PID of your process
./profile.sh 20402 stacks.txt
./gen.sh stacks.txt

Alternatively, to measure application from startup (in this, case, my gradle build that also needed to be run in another directory and with some input stream) I used:

cd ../my-project; ./gradlew --no-daemon clean build < /dev/zero &; cd -; ./profile.sh $! stacks.txt
./gen.sh stacks.txt

Results:

flame graphs

In this example, I can clearly see that my application is I/O bound (notice blue bars on top).

Construct flamegraph with start and end timestamps

a) No, because flamegraphs need call stacks, and b) flamegraphs are pretty but useless for finding speedups. Speed problems easily hide in them, and they usually ignore I/O. Also here.

Generate Flame Graph from JFR dump with IntelliJ IDEA

Sure, you can open any exisiting jfr report (Main menu/Run/Open Profiler Snapshot), and if it contains sampling, flame graph will also be available.

How can you get frame-pointer perf call stacks/flamegraphs involving the C++ standard library?

With your code, 20.04 x86_64 ubuntu, perf record --call-graph fp with and without -e cycles:u I have similar flamegraph as viewed with https://speedscope.app (prepare data with perf script > out.txt and select out.txt in the webapp).

Is it possible to get correct fp call stacks with libstdc++ without compiling it myself (which seems like a lot of work)?

No, call-graph method 'fp' is implemented in linux kernel code in very simple way: https://elixir.bootlin.com/linux/v5.4/C/ident/perf_callchain_user - https://elixir.bootlin.com/linux/v5.4/source/arch/x86/events/core.c#L2464

perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
{ 
    ...
    fp = (unsigned long __user *)regs->bp;
    perf_callchain_store(entry, regs->ip);
    ...
    // where max_stack is probably around 127 = PERF_MAX_STACK_DEPTH     https://elixir.bootlin.com/linux/v5.4/source/include/uapi/linux/perf_event.h#L1021
    while (entry->nr < entry->max_stack) {
        ...
        if (!valid_user_frame(fp, sizeof(frame)))
            break;
        bytes = __copy_from_user_nmi(&frame.next_frame, fp, sizeof(*fp));
        bytes = __copy_from_user_nmi(&frame.return_address, fp + 1, sizeof(*fp));

        perf_callchain_store(entry, frame.return_address);
        fp = (void __user *)frame.next_frame;
    }
}

It can't find correct frames for -fomit-frame-pointer compiled code.

For incorrect call stacks with main -> __memcmp_avx2_movbe there is only call stack data generated by kernel in perf.data file, no copy of user stack fragment, no register data:

setarch x86_64 -R env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/debug:${LD_LIBRARY_PATH} perf record -F 1000 --call-graph fp  -- ./6_stl.bin
perf script -D | less

869122666352078 0xae0 [0x58]: PERF_RECORD_SAMPLE(IP, 0x4002): 12267/12267: 0x7ffff7d51670 period: 2332683 addr: 0
... FP chain: nr:5
.....  0: fffffffffffffe00
.....  1: 00007ffff7d51670
.....  2: 0000555555556452
.....  3: 00007ffff7be90fb
.....  4: 00005555555564de
 ... thread: 6_stl.bin:12267
 ...... dso: /usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so
6_stl.bin 12267 869122.666352:    2332683 cycles: 
            7ffff7d51670 __memcmp_avx2_movbe+0x140 (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
            555555556452 main+0x12 (/home/user/so/68259699/6_stl.bin)
            7ffff7be90fb __libc_start_main+0x10b (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
            5555555564de _start+0x2e (/home/user/so/68259699/6_stl.bin)

So, with this method user-space perf tool can't use any additional information to fix the call stack. With dwarf method there are registers and partial dump of user stack data on every sample event.

Gdb has full access to live process and can use any information, all registers, read any amount of user process stack, read additional debug info for program and libraries. And doing advanced and slow backtrace in gdb is not limited by time or security or uninterruptible context. Linux kernel should record perf sample in small time, it can't access swapped data or debug sections or debug info files, it should not do complex parsing (which can have some bugs).

Debug version of libstdc++ may help (sudo apt install libstdc++6-9-dbg), but it is slow. And it did not help me to find lost backtrace of this asm-implemented __memcmp_avx2_movbe (libc: sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S)

If you want full backtrace, I think you should find how to recompile a world (or only all libraries used by your target application). Probably it will be easier not with Ubuntu but with something like gentoo or arch or apline?

If you are interested only in performance why do you want the flamegraph? Flat profile will catch most performance data; non-ideal flamegraph can be useful too.

asyncprofiler malloc undefined category

Container environment is not related here.

It seems like libc (where malloc implementation resides) on your system is compiled without frame pointers. So the standard stack walking mechanism in the kernel is unable to find a parent of malloc frame.

I've recently implemented an alternative stack walking algorithm that relies on DWARF unwinding information. New version has not been yet released, but you may try to build it from sources. Or, for your convenience, I prepared the new build here: async-profiler-2.6-dwarf-linux-x64.tar.gz

Then add --cstack dwarf option, and all malloc stack traces should be in place.

How to Get Complete Stack Dump from Profiler in Every Sample for Use in Flame Graph