Understanding Stack Frame of Function Call in C/C++

Explain the concept of a stack frame in a nutshell

A stack frame is a frame of data that gets pushed onto the stack. In the case of a call stack, a stack frame would represent a function call and its argument data.

If I remember correctly, the function return address is pushed onto the stack first, then the arguments and space for local variables. Together, they make the "frame," although this is likely architecture-dependent. The processor knows how many bytes are in each frame and moves the stack pointer accordingly as frames are pushed and popped off the stack.

EDIT:

There is a big difference between higher-level call stacks and the processor's call stack.

When we talk about a processor's call stack, we are talking about working with addresses and values at the byte/word level in assembly or machine code. There are "call stacks" when talking about higher-level languages, but they are a debugging/runtime tool managed by the runtime environment so that you can log what went wrong with your program (at a high level). At this level, things like line numbers and method and class names are often known. By the time the processor gets the code, it has absolutely no concept of these things.

How exactly does the callstack work?

The call stack could also be called a frame stack.

The things that are stacked after the LIFO principle are not the local variables but the entire stack frames ("calls") of the functions being called. The local variables are pushed and popped together with those frames in the so-called function prologue and epilogue, respectively.

Inside the frame the order of the variables is completely unspecified; Compilers "reorder" the positions of local variables inside a frame appropriately to optimize their alignment so the processor can fetch them as quickly as possible. The crucial fact is that the offset of the variables relative to some fixed address is constant throughout the lifetime of the frame - so it suffices to take an anchor address, say, the address of the frame itself, and work with offsets of that address to the variables. Such an anchor address is actually contained in the so-called base or frame pointer which is stored in the EBP register. The offsets, on the other hand, are clearly known at compile time and are therefore hardcoded into the machine code.

This graphic from Wikipedia shows what the typical call stack is structured like¹:

Picture of a stack

Add the offset of a variable we want to access to the address contained in the frame pointer and we get the address of our variable. So shortly said, the code just accesses them directly via constant compile-time offsets from the base pointer; It's simple pointer arithmetic.

Example

#include <iostream>

int main()
{
    char c = std::cin.get();
    std::cout << c;
}

gcc.godbolt.org gives us

main:
    pushq   %rbp
    movq    %rsp, %rbp
    subq    $16, %rsp

    movl    std::cin, %edi
    call    std::basic_istream<char, std::char_traits<char> >::get()
    movb    %al, -1(%rbp)
    movsbl  -1(%rbp), %eax
    movl    %eax, %esi
    movl    std::cout, %edi
    call    [... the insertion operator for char, long thing... ]

    movl    $0, %eax
    leave
    ret

.. for main. I divided the code into three subsections.
The function prologue consists of the first three operations:

Base pointer is pushed onto the stack.
The stack pointer is saved in the base pointer
The stack pointer is subtracted to make room for local variables.

Then cin is moved into the EDI register² and get is called; The return value is in EAX.

So far so good. Now the interesting thing happens:

The low-order byte of EAX, designated by the 8-bit register AL, is taken and stored in the byte right after the base pointer: That is -1(%rbp), the offset of the base pointer is -1. This byte is our variable c. The offset is negative because the stack grows downwards on x86. The next operation stores c in EAX: EAX is moved to ESI, cout is moved to EDI and then the insertion operator is called with cout and c being the arguments.

Finally,

The return value of main is stored in EAX: 0. That is because of the implicit return statement.
You might also see xorl rax rax instead of movl.
leave and return to the call site. leave is abbreviating this epilogue and implicitly
- Replaces the stack pointer with the base pointer and
- Pops the base pointer.

After this operation and ret have been performed, the frame has effectively been popped, although the caller still has to clean up the arguments as we're using the cdecl calling convention. Other conventions, e.g. stdcall, require the callee to tidy up, e.g. by passing the amount of bytes to ret.

Frame Pointer Omission

It is also possible not to use offsets from the base/frame pointer but from the stack pointer (ESB) instead. This makes the EBP-register that would otherwise contain the frame pointer value available for arbitrary use - but it can make debugging impossible on some machines, and will be implicitly turned off for some functions. It is particularly useful when compiling for processors with only few registers, including x86.

This optimization is known as FPO (frame pointer omission) and set by -fomit-frame-pointer in GCC and -Oy in Clang; note that it is implicitly triggered by every optimization level > 0 if and only if debugging is still possible, since it doesn't have any costs apart from that.
For further information see here and here.

¹ As pointed out in the comments, the frame pointer is presumably meant to point to the address after the return address.

² Note that the registers that start with R are the 64-bit counterparts of the ones that start with E. EAX designates the four low-order bytes of RAX. I used the names of the 32-bit registers for clarity.

confusion about function call stack

Your drawing is not correct. The local stack variables for a function are all below any return addresses. Otherwise, as you have observed, the locals would get lost when you call a function.

It should be like this:

| parameters for foo() <int x>  |
| return address of foo()       |
| local of spam() <int c>       |
| local of spam() <int b>       |
| local of spam() <int a>       |
| parameters for spam() <None>  |
| return address of spam()      |
| locals of main() <None>       | 
| parameters for main() <None>  |

I think the confusion is that you believe that variable declarations are treated as statements and executed in order. In fact the compiler will typically analyse a function to decide how much stack space is needed for all the local variables. Then it emits code to adjust the stack pointer accordingly and that adjustment is made on entry to the function. Any calls to other functions can then push onto the stack without interfering with this function's stack frame.

what's the difference between callstack and stack?

Stack denotes a data structure which consists of a set of usually similar items and has the following abilities:

push: adds an item to the "top" of the stack, which will become the new top
pop: removes the item from the top, thus the item which was previously the top will be the top again; this operation returns the item you remove
top: gets the top item of the stack, without modifying your stack

Your memory has a section called "the stack", where, as you correctly understood, functions are stored. That is the call stack. However, stack and call stack are not 100% equivalent, since stacks are used in many other cases, basically whenever you need a LIFO (Last In First Out) processing. It's used at the LIFO policy of the CPU, for example, or, when you do a depth-first search in a graph. In short, stack is a data structure, which can be applied in many cases, the call stack in memory being a prominent example.

So, why stack is in use to store function calls in memory. Let's take into consideration embedded function calls, like:

f1(f2(...(fn(x))...))

It's a rule of thumb that in order to evaluate fi, 1 <= i < n, you need to evaluate fj, where 1 < j <= n, assuming that i < j. As a result, f1 is the first to be called, but the last to be evaluated. Hence, you have a LIFO processing, which is best done via using a stack.

Is the stack frame required for all functions in C on x86-64?

No, the stack frame is, at least in theory, not always required. An optimizing compiler might in some cases avoid using the call stack. Notably when it is able to inline a called function (in some specific call site), or when the compiler successfully detects a tail call (which reuses the caller's frame).

Read the ABI of your platform to understand requirements related to the stack.

You might try to compile your program with link time optimization (e.g. compile and link with gcc -flto -O2) to get more optimizations.

In principle, one could imagine a compiler clever enough to (for some programs) avoid using any call stack.

BTW, I just compiled a naive recursive long fact(int n) factorial function with GCC 7.1 (on Debian/Sid/x86-64) at -O3 (i.e. gcc -fverbose-asm -S -O3 fact.c). The resulting assembler code fact.s contains no call machine instruction.

Base of Global Call Stack in C/C++

First of all, there is no such thing as a "global call stack". Each thread has a stack, and the stack for the main thread is often looking quite different from the thread of any thread spawned later on. And mostly, each of these "stacks" is just an arbitrary memory segment currently declared to be used as such, sub-allocated from any arbitrary suitable memory pool.

And due to compiler optimizations, many function calls will not even end up on the stack, usually. Meaning there isn't necessarily a distinguishable stack frame. You are only guaranteed that you can reference variables you put on the stack, but not that the compiler must preserve anything you didn't explicitly reference.

There is not even a guarantee that the memory layout for your call stack must even be organized in distinguishable frames. Function pointers are never guaranteed to be part of the stack frame, just happens to be an implementation detail in architectures where data and function pointers may co-exist in the address space. (As there are architectures which require return addresses to be stored in a different address space than the data used in the call stack.)

That aside, yes, there is code which is executed outside of the main() function. Specifically initializers for global static variables, code to set up the runtime environment (env, call parameters, stdin/stdout) etc.

E.g. when having linked to libc, there is __libc_start_main which will call your main function after initialization is done. And clean up when your main function returns.

__libc_start_main is about the point where "stack" starts being used, as far as you can see from within the program. That's not actually true though, there has already been some loader code been executed in kernel space, for reserving memory for your process to operate in initially (including memory for the future stack), initializing registers and memory to well defined values etc.

Right before actually "starting" your process, after dropping out of kernel mode, arbitrary pointers to a future stack, and the first instruction of your program, are loaded into the corresponding processor registers. Effectively, that's where __libc_start_main (or any other initialization function, depending on your runtime) starts running, and the stack visible to you starts building up.

Getting back into the kernel usually involves an interrupt now, which doesn't follow the stack either, but may just directly access processor registers to simply swap the contents of the corresponding processor registers. (E.g. if you call a function from the kernel, the memory required by the call stack inside the function call is not allocated from your stack, but from one you don't even have access to.)

Either way, everything that happens before main() is called, and whenever you enter a syscall, is implementation dependent, and you are not guaranteed any specific observable behavior. And messing around with processor registers, and thereby alternating the program flow, is also far outside defined behavior as far as a pure C / C++ run time is concerned.

Is new stack frame allocated for the block(conditional or unconditional) code in c/c++?

First, note this is a pure implementation detail, the compiler is free to do what it will so long as the result behaves as-if the original code were executed. Stack frames are simply a detail real hardware use to get there.

That said, most real-world compilers will simply calculate all the stack space a given function might need and allocate that on function entry. It won't necessarily all be used (a particular block might not even be entered on a given function call, as an example) but the stack space for the variables that block requires will still have been reserved as part of the function prologue. This is done rather than mess with changing the stack space at several points in a function execution.