How to get complete stack dump from profiler in every sample for use in flame graph?
Try Linux perf_events (aka the "perf" command), which is part of the mainline Linux kernel, and usually installed via the linux-tools-common (or similar) package. I often use it to create flame graphs on Linux.
I wrote up some instructions for creating flame graphs with perf on: http://www.brendangregg.com/perf.html#FlameGraphs
How to get Java profiling dump for creating flame graphs on the mac?
I created 2 little shell scripts based on @cello's answer. They generate hot/cold flame graphs.
Get them from this Gist.
Usage:
ps ax | grep java # find the PID of your process
./profile.sh 20402 stacks.txt
./gen.sh stacks.txt
Alternatively, to measure application from startup (in this, case, my gradle build that also needed to be run in another directory and with some input stream) I used:
cd ../my-project; ./gradlew --no-daemon clean build < /dev/zero &; cd -; ./profile.sh $! stacks.txt
./gen.sh stacks.txt
Results:
In this example, I can clearly see that my application is I/O bound (notice blue bars on top).
Construct flamegraph with start and end timestamps
a) No, because flamegraphs need call stacks, and b) flamegraphs are pretty but useless for finding speedups. Speed problems easily hide in them, and they usually ignore I/O. Also here.
Generate Flame Graph from JFR dump with IntelliJ IDEA
Sure, you can open any exisiting jfr report (Main menu/Run/Open Profiler Snapshot), and if it contains sampling, flame graph will also be available.
How can you get frame-pointer perf call stacks/flamegraphs involving the C++ standard library?
With your code, 20.04 x86_64 ubuntu, perf record --call-graph fp
with and without -e cycles:u
I have similar flamegraph as viewed with https://speedscope.app (prepare data with perf script > out.txt
and select out.txt in the webapp).
Is it possible to get correct fp call stacks with libstdc++ without compiling it myself (which seems like a lot of work)?
No, call-graph method 'fp' is implemented in linux kernel code in very simple way: https://elixir.bootlin.com/linux/v5.4/C/ident/perf_callchain_user - https://elixir.bootlin.com/linux/v5.4/source/arch/x86/events/core.c#L2464
perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
{
...
fp = (unsigned long __user *)regs->bp;
perf_callchain_store(entry, regs->ip);
...
// where max_stack is probably around 127 = PERF_MAX_STACK_DEPTH https://elixir.bootlin.com/linux/v5.4/source/include/uapi/linux/perf_event.h#L1021
while (entry->nr < entry->max_stack) {
...
if (!valid_user_frame(fp, sizeof(frame)))
break;
bytes = __copy_from_user_nmi(&frame.next_frame, fp, sizeof(*fp));
bytes = __copy_from_user_nmi(&frame.return_address, fp + 1, sizeof(*fp));
perf_callchain_store(entry, frame.return_address);
fp = (void __user *)frame.next_frame;
}
}
It can't find correct frames for -fomit-frame-pointer compiled code.
For incorrect call stacks with main -> __memcmp_avx2_movbe there is only call stack data generated by kernel in perf.data file, no copy of user stack fragment, no register data:
setarch x86_64 -R env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/debug:${LD_LIBRARY_PATH} perf record -F 1000 --call-graph fp -- ./6_stl.bin
perf script -D | less
869122666352078 0xae0 [0x58]: PERF_RECORD_SAMPLE(IP, 0x4002): 12267/12267: 0x7ffff7d51670 period: 2332683 addr: 0
... FP chain: nr:5
..... 0: fffffffffffffe00
..... 1: 00007ffff7d51670
..... 2: 0000555555556452
..... 3: 00007ffff7be90fb
..... 4: 00005555555564de
... thread: 6_stl.bin:12267
...... dso: /usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so
6_stl.bin 12267 869122.666352: 2332683 cycles:
7ffff7d51670 __memcmp_avx2_movbe+0x140 (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
555555556452 main+0x12 (/home/user/so/68259699/6_stl.bin)
7ffff7be90fb __libc_start_main+0x10b (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
5555555564de _start+0x2e (/home/user/so/68259699/6_stl.bin)
So, with this method user-space perf tool can't use any additional information to fix the call stack. With dwarf method there are registers and partial dump of user stack data on every sample event.
Gdb has full access to live process and can use any information, all registers, read any amount of user process stack, read additional debug info for program and libraries. And doing advanced and slow backtrace in gdb is not limited by time or security or uninterruptible context. Linux kernel should record perf sample in small time, it can't access swapped data or debug sections or debug info files, it should not do complex parsing (which can have some bugs).
Debug version of libstdc++ may help (sudo apt install libstdc++6-9-dbg
), but it is slow. And it did not help me to find lost backtrace of this asm-implemented __memcmp_avx2_movbe (libc: sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
)
If you want full backtrace, I think you should find how to recompile a world (or only all libraries used by your target application). Probably it will be easier not with Ubuntu but with something like gentoo or arch or apline?
If you are interested only in performance why do you want the flamegraph? Flat profile will catch most performance data; non-ideal flamegraph can be useful too.
asyncprofiler malloc undefined category
Container environment is not related here.
It seems like libc
(where malloc
implementation resides) on your system is compiled without frame pointers. So the standard stack walking mechanism in the kernel is unable to find a parent of malloc
frame.
I've recently implemented an alternative stack walking algorithm that relies on DWARF unwinding information. New version has not been yet released, but you may try to build it from sources. Or, for your convenience, I prepared the new build here: async-profiler-2.6-dwarf-linux-x64.tar.gz
Then add --cstack dwarf
option, and all malloc
stack traces should be in place.
Related Topics
Process Control Block in Linux
How The Buffering Work in Socket on Linux
Jenkins Cannot Run Firefox: No Protocol Specified Error: Cannot Open Display:: 0
Securing a Simple Linux Server That Holds a MySQL Database
How to Avoid High CPU Usage While Reading/Writing Character Device
Notify-Send Command Doesn't Launch The Notification Through Systemd Service
How to Run an Image Windows Docker on Ubuntu
How to Open The Default Text Editor in Linux
X11 Forwarding Through Google Colab
What Algorithm How to Use to Generate a 48-Bit Hash for Unique MAC Addresses
Raspberry Pi: Spi Not Working, Spi_Bcm2835 Not Showing with Lsmod
Linux Set End of File (Shrink, Truncate, Cut Out Some Data @ End)
Does a Fully Qualified Domain Name Need a Period
Merge Multiple Lines to 1 Row with Awk(Or Familiar)
Why Sizeof(Spinlock_T) Is Greater Than Zero on Uni-Processor
Assembly Divisions and Floating Points
Parsing Result of Diff in Shell Script
What Is The Minimal Nesesary File Subset, Required to Amd Opencl Work on Linux