Why Gcc 4.X Default Reserve 8 Bytes for Stack on Linux When Calling a Method

Why does GCC allocate more stack memory than needed?

(This answer is a summary of comments posted above by Antti Haapala, klutt and Peter Cordes.)

GCC allocates more space than "necessary" in order to ensure that the stack is properly aligned for the call to proc: the stack pointer must be adjusted by a multiple of 16, plus 8 (i.e. by an odd multiple of 8). Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?

What's strange is that the code in the book doesn't do that; the code as shown would violate the ABI and, if proc actually relies on proper stack alignment (e.g. using aligned SSE2 instructions), it may crash.

So it appears that either the code in the book was incorrectly copied from compiler output, or else the authors of the book are using some unusual compiler flags which alter the ABI.

Modern GCC 11.2 emits nearly identical asm (Godbolt) using -Og -mpreferred-stack-boundary=3 -maccumulate-outgoing-args, the former of which changes the ABI to maintain only 2^3 byte stack alignment, down from the default 2^4. (Code compiled this way can't safely call anything compiled normally, even standard library functions.) -maccumulate-outgoing-args used to be the default in older GCC, but modern CPUs have a "stack engine" that makes push/pop single-uop so that option isn't the default anymore; push for stack args saves a bit of code size.

One difference from the book's asm is a movl $0, %eax before the call, because there's no prototype so the caller has to assume it might be variadic and pass AL = the number of FP args in XMM registers. (A prototype that matches the args passed would prevent that.) The other instructions are all the same, and in the same order as whatever older GCC version the book used, except for choice of registers after call proc returns: it ends up using movslq %edx, %rdx instead of cltq (sign-extend with RAX).

CS:APP 3e global edition is notorious for errors in practice problems introduced by the publisher (not the authors), but apparently this code is present in the North American edition, too. So this may be the author's mistake / choice to use actual compiler output with weird options. Unlike some of the bad global edition practice problems, this code could have come unmodified from some GCC version, but only with non-standard options.

Related: Why does GCC allocate more space than necessary on the stack, beyond what's needed for alignment? - GCC has a missed-optimization bug where it sometimes reserves an additional 16 bytes that it truly didn't need to. That's not what's happening here, though.

Why does the compiler reserve a little stack space but not the whole array size?

Below the stack area used by a function, there is a 128-byte red zone that is reserved for program use. Since main calls no other function, it has no need to move the stack pointer by more than it needs, though it doesn't matter in this case. It only subtracts enough from rsp to ensure that the array is protected by the red zone.

You can see the difference by adding a function call to main

int test() {
  int arr[120];
  return arr[0]+arr[119];
}

int main() {
  int arr[120];
  test();
  return arr[0]+arr[119];
}

This gives:

test:
  push rbp
  mov rbp, rsp
  sub rsp, 360
  mov edx, DWORD PTR [rbp-480]
  mov eax, DWORD PTR [rbp-4]
  add eax, edx
  leave
  ret
main:
  push rbp
  mov rbp, rsp
  sub rsp, 480
  mov eax, 0
  call test
  mov edx, DWORD PTR [rbp-480]
  mov eax, DWORD PTR [rbp-4]
  add eax, edx
  leave
  ret

You can see that the main function subtracts by 480 because it needs the array to be in its stack space, but test doesn't need to because it doesn't call any functions.

The additional usage of array elements does not significantly change the output, but it was added to make it clear that it's not pretending that those elements don't exist.

gcc 8.2+ doesn't always align the stack before a call on x86?

Since you provided a definition of the function in the same translation unit, apparently GCC sees that the function doesn't care about stack alignment and doesn't bother much with it. And apparently this basic inter-procedural analysis / optimization (IPA) is on by default even at -O0.

Turns out this option even has an obvious name when I searched for "ipa" options in the manual: -fipa-stack-alignment is on by default even at -O0. Manually turning it off with -fno-ipa-stack-alignment results in what you expected, a second sub whose value depends on the number of pushes (Godbolt), making sure ESP is aligned by 16 before a call like modern Linux versions of the i386 SysV ABI use.

Or if you change the definition to just a declaration, then the resulting asm is as expected, fully respecting -mpreferred-stack-boundary.

void callee(void* a, void* b) {
}

void callee(void* a, void* b);

Using -fPIC also forces GCC to not assume anything about the callee, so it does respect the possibility of function interposition (e.g. via LD_PRELOAD) with the appropriate option.

Without compiling for a shared library, GCC is allowed to assume that any definition it sees for a global function is the definition, thanks to ISO C's one-definition-rule.

If you use __attribute__((noipa)) on the function definition, then call sites won't assume anything based on the definition. Just like if you'd renamed the definition (so you could still look at it) and provided only a declaration of the name the caller uses.

If you just want to stop inlining, you can use __attribute__((noinline,noclone)) instead, to still allow the callsite to be like it would if the optimizer simply chose not to inline, but could still see this definition. That may or may not be what you want.

See also How to remove "noise" from GCC/clang assembly output? re: writing functions whose asm is interesting to look at, and compiler options.

And BTW, I found it easiest to change the declaration / definition to variadic, so I could add or remove args with only a change to the caller. I was still able to reproduce your result of that not changing the sub amount even when the push amount changes with an extra arg, when there's a definition, but not with just a declaration.

void callee(void* a, ...)  // {}   // comment out a body or not
;

Stack allocation, why the extra space?

It's alignment. I assumed for some reason that esp would be aligned from the start, which it clearly isn't.

gcc aligns stack frames to 16 bytes per default, which is what happened.

20 bytes are reserved on the stack for no apparent reason when C code is compiled into machine code

Without the optimizations, stack is going to be used to preserve and restore base pointers. In x86_64 calling conventions (https://en.wikipedia.org/wiki/X86_calling_conventions) stack must be aligned by 16 byte boundaries when calling functions, so this is most likely what happens in your case. At least, this is what happens in my case, when I compile your code on my system. Here is the ASM for this:

callee_fun(int): # @callee_fun(int)
  pushq %rbp
  movq %rsp, %rbp
  movl %edi, -4(%rbp)
  movl -4(%rbp), %eax
  popq %rbp
  retq
caller_fun(): # @caller_fun()
  pushq %rbp
  movq %rsp, %rbp
  subq $16, %rsp
  movl $57054, %edi # imm = 0xDEDE
  callq callee_fun(int)
  movl %eax, -4(%rbp) # 4-byte Spill
  addq $16, %rsp
  popq %rbp
  retq

It is worth noting, that when optimizations are turned on, there is no stack usage or modifications at all:

callee_fun(int): # @callee_fun(int)
  movl %edi, %eax
  retq
caller_fun(): # @caller_fun()
  retq

Last, but not least, when playing with ASM listing, do not disassemble object file or executable file. Instead, direct your compiler to generate assembly listing. This will give you much more context.

If you are using gcc, a good command to do this would be

gcc -fverbose-asm -S -O

Why does my buffer have more memory allocated on the stack than I asked for?

I am guessing you are using gcc and compiling without optimizations, like this (godbolt).

There are a couple things going on here:

First, when compiling without optimizations, the compiler tries to ensure that every local variable has an address in memory, so that it can easily be inspected or modified by a debugger. This includes function parameters, which on x86-64 are otherwise passed in registers. So the compiler needs to allocate additional stack space where the argc and argv parameters can be "spilled". You can see the spilling at lines 5 and 6 of the assembly:

        movl    %edi, -516(%rbp)
        movq    %rsi, -528(%rbp)

If you look carefully, you may note that the compiler wasted 4 bytes by placing argc (from %edi) at address -516(%rbp) when -520(%rbp) was otherwise available. It's not entirely clear why, but after all, it's not optimizing! So that gets us to 516 bytes.

The other issue is that the x86-64 ABI requires 16-byte stack alignment; see Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?. In this case, to make a long story short, it implies that our stack adjustment needs to be a multiple of 16 bytes. (The return address and pushed rbp add a further 16 bytes which doesn't disturb this alignment.) So our 516 must be rounded up to the next multiple of 16, which is 528.

If the compiler had been more careful and not wasted that 4 bytes in between argc and argv, we could have got away with only 512 bytes. One benefit of using 528, though, is that the buffer str ends up 16-byte aligned. This isn't required for an array of char, whose minimum alignment is just 1, but it can make it more efficient for string functions like strcpy to use fast SIMD algorithms. I am not sure if the compiler is doing this deliberately or if it's just a coincidence.