Gdb Backtrace by Walking Frame Pointers

How can I examine the stack frame with GDB?

For the current stack frame:

info frame lists general info about the frame (where things start in memory, etc.)
info args lists arguments to the function
info locals lists local variables stored in the frame

GDB - How does it know about function calling stack?

Debuggers like GDB have two primary means of walking the stack in order to print a backtrace. They either assume the value in the frame pointer register (RBP) is a pointer to start of a linked list of stack frames, or they use special unwind info stored in the executable that describes how to walk (unwind) the stack.

Using the frame pointer

When using the frame pointer, the assumption is that it points to where the current function saved the value of its caller's frame pointer. It also assumes that just before that saved frame pointer is where the return address for the current function is stored. So that's how it knows both what the RBP value of the calling function was, and what function called the current function, which it can easily determine from the return address. It can then find all the previous stack frames and functions on the stack by walking the linked RBP values.

However, this assumes that functions use the frame pointer this way, and generally there's no guarantee that they will. Basically it's assuming that the function prologue and epilogue looks something like this:

func:
    push %rbp         # save previous frame pointer
    mov  %rsp, %rbp   # new frame pointer points to previous value
    sub  $24, %rsp    # allocate stack space for this funciton

    ...

    pop %rbp          # restore previous frame pointer
    ret

But when optimizing many compilers won't do this, because they rarely need to use a frame pointer and instead will treat RBP like any other general purpose register and use it for something else.

Using unwind info

So to generate a backtrace across functions that don't use RBP as a frame pointer, a debugger can potentially use unwind information. This is special data stored in executables (and dynamic libraries) that describes exactly how to virtually undo all the stack operations performed by a function at any point in the execution of that function. The format and location of the unwind info varies based on the executable format and CPU type. For ELF x86-64 executables the unwind info is stored in the .eh_frame section in a format based on the DWARF debugging format's unwind info. This format is too complex to describe here, but you can read the System V AMD64 ABI for more details.

Weird Backtrace in Perf

TL;DR perf backtracing process may stop at some function if there is no frame pointer saved in the stack or no CFI tables for dwarf method. Recompile libraries with -fno-omit-frame-pointer or with -g or get debuginfo. With release binaries and libs perf often will stop backtrace early without chance to reach main() or _start or clone()/start_thread() top functions.

perf profiling tool in Linux is statistical sampling profiler (without binary instrumentation): it programs software timer or event source or hardware performance monitoring unit (PMU) to generate periodic interrupt. In your example
-c 10000 -e mem_load_uops_retired.l3_miss:uppp is used to select hardware PMU in x86_64 in some kind of PEBS mode (https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR) to generate interrupt after 10000 of mem_load_uops_retired (with l3_miss mask). Generated interrupt is handled by Linux Kernel (perf_events subsystem, kernel/events and arch/x86/events). In this handler PMU is reset (reprogrammed) to generate next interrupt after 10000 more events and sample is generated. Sample data dump is saved into perf.data file by perf report command, but every wake of tool can save thousands of samples; samples can be read by perf script or perf script -D.

perf_events interrupt handler, something near __perf_event_overflow of kernel/events/core.c, has full access to the registers of current function, and has some time to do additional data retrieval to record current time, pid, etc. Part of such process is https://en.wikipedia.org/wiki/Call_stack data collection. But with x86_64 and -fomit-frame-pointer (often enabled for many system libraries of Debian/Ubuntu/others) there is no default place in registers or in function stack to store frame pointers:

https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/Optimize-Options.html#index-fomit_002dframe_002dpointer-692

-fomit-frame-pointer
Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and
restore frame pointers; it also makes an extra register available in
many functions. It also makes debugging impossible on some machines.

Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit Linux x86 and 32-bit Darwin x86
targets has been changed to -fomit-frame-pointer. The default can be
reverted to -fno-omit-frame-pointer by configuring GCC with the
--enable-frame-pointer configure option.

With frame pointers saved in the function stack backtracing/unwinding is easy. But for some functions modern gcc (and other compilers) may not generate frame pointer. So backtracing code like in perf_events handler either will stop backtrace at such function or needs another method of frame pointer recovery. Option -g method (--call-graph) of perf record selects the method to be used. It is documented in man perf-record http://man7.org/linux/man-pages/man1/perf-record.1.html:

--call-graph Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Default is "fp".

Allows specifying "fp" (frame pointer) or "dwarf" (DWARF's CFI -
Call Frame Information) or "lbr" (Hardware Last Branch Record
facility) as the method to collect the information used to show the
call graphs.

In some systems, where binaries are build with gcc

--fomit-frame-pointer, using the "fp" method will produce bogus call graphs, using "dwarf", if available (perf tools linked to the
libunwind or libdw library) should be used instead. Using the "lbr"
method doesn't require any compiler options. It will produce call
graphs from the hardware LBR registers. The main limitation is that
it is only available on new Intel platforms, such as Haswell. It
can only get user call chain. It doesn't work with branch stack
sampling at the same time.

When "dwarf" recording is used, perf also records (user) stack dump
when sampled. Default size of the stack dump is 8192 (bytes). User
can change the size by passing the size after comma like

"--call-graph dwarf,4096".

So, dwarf method reuses CFI tables to find stack frame sizes and find caller's stack frame. I'm not sure are CFI tables stripped from release libraries by default or not; but debuginfo probably will have them. LBR will not help because it is rather short hardware buffer. Dwarf split processing (kernel handler saves part of stack and perf user-space tool will parse it with libdw+libunwind) may lose some parts of call stack, so try also to increase dwarf stack dumps by using --call-graph dwarf,10240 or --call-graph dwarf,81920 etc.

Backtracing is implemented in arch-dependent part of perf_events: arch/x86/events/core.c:perf_callchain_user(); called from kernel/events/callchain.c:get_perf_callchain() <- perf_callchain <- perf_prepare_sample <-
__perf_event_output <- *(event->overflow_handler) <- READ_ONCE(event->overflow_handler)(event, data, regs); of __perf_event_overflow.

Gregg did warn about incomplete call stacks of perf: http://www.brendangregg.com/blog/2014-06-22/perf-cpu-sample.html

Incomplete stacks usually mean -fomit-frame-pointer was used – a compiler optimization that makes little positive difference in the real world, but breaks stack profilers. Always compile with -fno-omit-frame-pointer. More recent perf has a -g dwarf option, to use the alternate libunwind/dwarf method for retrieving stacks.

I also did write about backtraces in perf with some additional links: How does linux's perf utility understand stack traces?

When do we create base pointer in a function - before or after local variables?

In one case, it is done before the local variables.

_start is not a function. It's your entry point. There's no return address, and no caller's value of %ebp to save.

The i386 System V ABI doc suggests (in section 2.3.1 Initial Stack and Register State) that you might want to zero %ebp to mark the deepest stack frame. (i.e. before your first call instruction, so the linked list of saved ebp values has a NULL terminator when that first function pushes the zeroed ebp. See below).

Does C have a specific convention when it creates the machine code?

No, unlike in some other x86 systems, the i386 System V ABI doesn't require much about your stack-frame layout. (Linux uses the System V ABI / calling convention, and the book you're using (PGU) is for Linux.)

In some calling conventions, setting up ebp is not optional, and the function entry sequence has to push ebp just below the return address. This creates a linked list of stack frames which allows an exception handler (or debugger) to backtrace up the stack. (How to generate the backtrace by looking at the stack values?). I think this is required in 32-bit Windows code for SEH (structured exception handling), at least in some cases, but IDK the details.

The i386 SysV ABI defines an alternate mechanism for stack unwinding which makes frame pointers optional, using metadata in another section (.eh_frame and .eh_frame_hdr which contains metadata created by .cfi_... assembler directives, which in theory you could write yourself if you wanted stack-unwinding through your function to work. i.e. if you were calling any C++ code which expected throw to work.)

If you want to use the traditional frame-walking in current gdb, you have to actually do it yourself by defining a GDB function like gdb backtrace by walking frame pointers or Force GDB to use frame-pointer based unwinding. Or apparently if your executable has no .eh_frame section at all, gdb will use the EBP-based stack-walking method.

If you compile with gcc -fno-omit-frame-pointer, your call stack will have this linked-list property, because when C compilers do make proper stack frames, they push ebp first.

IIRC, perf has a mode for using the frame-pointer chain to get backtraces while profiling, and apparently this can be more reliable than the default .eh_frame stuff for correctly accounting which functions are responsible for using the most CPU time. (Or causing the most cache misses, branch mispredicts, or whatever else you're counting with performance counters.)

Wouldn't both just work even if I mix them up in different functions of a program? One function does it before, the other does it after etc.

Yes, it would work fine. In fact setting up ebp at all is optional, but when writing by hand it's easier to have a fixed base (unlike esp which moves around when you push/pop).

For the same reason, it's easier to stick to the convention of mov %esp, %ebp after one push (of the old %ebp), so the first function arg is always at ebp+8. See What is stack frame in assembly? for the usual convention.

But you could maybe save code size by having ebp point in the middle of some space you reserved, so all the memory addressable with an ebp + disp8 addressing mode is usable. (disp8 is a signed 8-bit displacement: -128 to +124 if we're limiting to 4-byte aligned locations). This saves code bytes vs. needing a disp32 to reach farther. So you might do

bigfunc:
    push   %ebp
    lea    -112(%esp), %ebp   # first arg at ebp+8+112 = 120(%ebp)
    sub    $236, %esp         # locals from -124(%ebp) ... 108(%ebp)
                              # saved EBP at 112(%ebp), ret addr at 116(%ebp)
                              # 236 was chosen to leave %esp 16-byte aligned.

Or delay saving any registers until after reserving space for locals, so we aren't using up any of the locations (other than the ret addr) with saved values we never want to address.

bigfunc2:                     # first arg at 4(%esp)
    sub    $252, %esp         # first arg at 252+4(%esp)
    push   %ebp               # first arg at 252+4+4(%esp)
    lea    140(%esp), %ebp    # first arg at 260-140 = 120(%ebp)

    push   %edi              # save the other call-preserved regs
    push   %esi
    push   %ebx
             # %esp is 16-byte aligned after these pushes, in case that matters

(Remember to be careful how you restore registers and clean up. You can't use leave because esp = ebp isn't right. With the "normal" stack frame sequence, you might restore other pushed registers (from near the saved EBP) with mov, then use leave. Or restore esp to point at the last push (with add), and use pop instructions.)

But if you're going to do this, there's no advantage to using ebp instead of ebx or something. In fact, there's a disadvantage to using ebp: the 0(%ebp) addressing mode requires a disp8 of 0, instead of no displacement, but %ebx wouldn't. So use %ebp for a non-pointer scratch register. Or at least one that you don't dereference without a displacement. (This quirk is irrelevant with a real frame pointer: (%ebp) is the saved EBP value. And BTW, the encoding that would mean (%ebp) with no displacement is how the ModRM byte encodes a disp32 with no base register, like (12345) or my_label)

These example are pretty artifical; you usually don't need that much space for locals unless it's an array, and then you'd use indexed addressing modes or pointers, not just a disp8 relative to ebp. But maybe you need space for a few 32-byte AVX vectors. In 32-bit code with only 8 vector registers, that's plausible.

AVX512 compressed disp8 mostly defeats this argument for 64-byte AVX512 vectors, though. (But AVX512 in 32-bit mode can still only use 8 vector registers, zmm0-zmm7, so you could easily need to spill some. You only get x/ymm8-15 and zmm8-31 in 64-bit mode.)