Trying to Understand Gcc's Complicated Stack-Alignment at the Top of Main That Copies the Return Address

Trying to understand gcc's complicated stack-alignment at the top of main that copies the return address

I've had a go at it:

;# As you have already noticed, the compiler wants to align the stack
;# pointer on a 16 byte boundary before it pushes anything. That's
;# because certain instructions' memory access needs to be aligned
;# that way.
;# So in order to first save the original offset of esp (+4), it
;# executes the first instruction:
lea    ecx,[esp+0x4]

;# Now alignment can happen. Without the previous insn the next one
;# would have made the original esp unrecoverable:
and    esp,0xfffffff0

;# Next it pushes the return addresss and creates a stack frame. I
;# assume it now wants to make the stack look like a normal
;# subroutine call:
push   DWORD PTR [ecx-0x4]
push   ebp
mov    ebp,esp

;# Remember that ecx is still the only value that can restore the
;# original esp. Since ecx may be garbled by any subroutine calls,
;# it has to save it somewhere:
push   ecx

Why is GCC pushing an extra return address on the stack?

Update: gcc8 simplifies this at least for normal use-cases (-fomit-frame-pointer, and no alloca or C99 VLAs that require variable-size allocation). Perhaps motivated by increasing usage of AVX leading to more functions wanting a 32-byte aligned local or array.

Also, probably a duplicate of What's up with gcc weird stack manipulation when it wants extra stack alignment?

This complicated prologue is fine if it only ever runs a couple times (e.g. at the start of main in 32-bit code), but the more it appears the more worthwhile it is to optimize it. GCC sometimes still over-aligns the stack in functions where all >16-byte aligned objects are optimized into registers, which is a missed optimization already but less bad when the stack alignment is cheaper.

gcc makes some clunky code when aligning the stack within a function, even with optimization enabled. I have a possible theory (see below) on why gcc might be copying the return address to just above where it saves ebp to make a stack frame (and yes, I agree that's what gcc is doing). It doesn't look necessary in this function, and clang doesn't do anything like that.

Besides that, the nonsense with ecx is probably just gcc not optimizing away unneeded parts of its align-the-stack boilerplate. (The pre-alignment value of esp is needed to reference args on the stack, so it makes sense that it puts the address of the first would-be arg into a register).

You see the same thing with optimization in 32-bit code (where gcc makes a main that doesn't assume 16B stack alignment, even though the current version of the ABI requires that at process startup, and the CRT code that calls main either aligns the stack itself or preserves the initial alignment provided by the kernel, I forget). You also see this in functions that align the stack to more than 16B (e.g. functions that use __m256 types, sometimes even if they never spill them to the stack. Or functions with an array declared with C++11 alignas(32), or any other way of requesting alignment.) In 64-bit code, gcc always seems to use r10 for this, not rcx.

There's nothing required for ABI compliance about the way gcc does it, because clang does something much simpler.

I added an aligned variable (with volatile as a simple way to force the compiler to actually reserve aligned space for it on the stack, instead of optimizing it away). I put your code on the Godbolt compiler explorer, to look at the asm with -O3. I see the same behaviour from gcc 4.9, 5.3, and 6.1, but different behaviour with clang.

int main(){
    __attribute__((aligned(32))) volatile int v = 1;
    return 0;
}

Clang3.8's -O3 -m32 output is functionally identical to its -m64 output. Note that -O3 enables -fomit-frame-pointer, but some functions make stack frames anyway.

    push    ebp
    mov     ebp, esp                # make a stack frame *before* aligning, so ebp-relative addressing can only access stack args, not aligned locals.
    and     esp, -32
    sub     esp, 32                 # esp is 32B aligned with 32 or 48B above esp reserved (depending on incoming alignment)
    mov     dword ptr [esp], 1      # store v
    xor     eax, eax                # return 0
    mov     esp, ebp                # leave
    pop     ebp
    ret

gcc's output is nearly the same between -m32 and -m64, but it puts v in the red-zone with -m64 so the -m32 output has two extra instructions:

    # gcc 6.1 -m32 -O3 -fverbose-asm.  Most of gcc's comment lines are empty.  I guess that means it has no idea why it's emitting those insns :P
    lea     ecx, [esp+4]      #,   get a pointer to where the first arg would be
    and     esp, -32  #,          align
    xor     eax, eax  #           return 0
    push    DWORD PTR [ecx-4]       #  No clue WTF this is for; this looks batshit insane, but happens even in 64bit mode.
    push    ebp     #             make a stackframe, even though -fomit-frame-pointer is on by default and we can already restore the original esp from ecx (unlike clang)
    mov     ebp, esp  #,
    push    ecx     #             save the old esp value (even though this function doesn't clobber ecx...)
    sub     esp, 52   #,          reserve space for v  (not present with -m64)
    mov     DWORD PTR [ebp-56], 1     # v,
    add     esp, 52   #,          unreserve (not present with -m64)
    pop     ecx       #           restore ecx (even though nothing clobbered it)
    pop     ebp       #           at least it knows it can just pop instead of `leave`
    lea     esp, [ecx-4]      #,  restore pre-alignment esp
    ret

It seems that gcc wants to make its stack frame (with push ebp) after aligning the stack. I guess that makes sense, so it can reference locals relative to ebp. Otherwise it would have to use esp-relative addressing, if it wanted aligned locals.

My theory on why gcc does this:

The extra copy of the return address after aligning but before pushing ebp means that the return address is copied to the expected place relative to the saved ebp value (and the value that will be in ebp when child functions are called). So this does potentially help code that wants to unwind the stack by following the linked list of stack frames, and looking at return-addresses to find out what function is involved.

I'm not sure whether this matters with modern stack-unwind info that allows stack-unwinding (backtraces / exception handling) with -fomit-frame-pointer. (It's metadata in the .eh_frame section. This is what the .cfi_* directives around every modification to esp are for.) I should look at what clang does when it has to align the stack in a non-leaf function.

The original value of esp would be needed inside the function to reference function args on the stack. I think gcc doesn't know how to optimize away unneeded parts of its align-the-stack method. (e.g. out main doesn't look at its args (and is declared not to take any))

This kind of code-gen is typical of what you see in a function that needs to align the stack; it's not extra weird because of using a volatile with automatic storage.

What does it mean to align the stack?

Assume the stack looks like this on entry to _main (the address of the stack pointer is just an example):

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230

Push %ebp, and subtract 8 from %esp to reserve some space for local variables:

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230
|      %ebp       |
+-----------------+  <--- 0xbfff122c
:    reserved     :
:     space       :
+-----------------+  <--- 0xbfff1224

Now, the andl instruction zeroes the low 4 bits of %esp, which may decrease it; in this particular example, it has the effect of reserving an additional 4 bytes:

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230
|      %ebp       |
+-----------------+  <--- 0xbfff122c
:    reserved     :
:     space       :
+ - - - - - - - - +  <--- 0xbfff1224
:   extra space   :
+-----------------+  <--- 0xbfff1220

The point of this is that there are some "SIMD" (Single Instruction, Multiple Data) instructions (also known in x86-land as "SSE" for "Streaming SIMD Extensions") which can perform parallel operations on multiple words in memory, but require those multiple words to be a block starting at an address which is a multiple of 16 bytes.

In general, the compiler can't assume that particular offsets from %esp will result in a suitable address (because the state of %esp on entry to the function depends on the calling code). But, by deliberately aligning the stack pointer in this way, the compiler knows that adding any multiple of 16 bytes to the stack pointer will result in a 16-byte aligned address, which is safe for use with these SIMD instructions.

gcc subtracting from esp before call

As I mentioned in my comments:

The first few lines (plus the push ecx) are to ensure the stack is aligned on a 16-byte boundary which is required by the Linux System V i386 ABI. The pop ecx and lea before the ret in main is to undo that alignment work.

@RossRidge has provided a link to another Stackoverflow answer that details this quite well.

In this case you seem to be doing real mode development. GCC isn't well suited for this but it can work and I will assume you know what you are doing. I mention some of the pitfalls of using -m16 in this Stackoverflow answer. I put this warning in that answer regarding real mode development with GCC:

There are so many pitfalls in doing this that I recommend against it.

If you remain undeterred and wish to continue forward you can do a few things to minimize the code. The 16-byte alignment of the stack at the point a function call is made is part of the more recent Linux System V i386 ABIs. Since you are generating code for a non-Linux environment you can change the stack alignment to 4 using compiler option -mpreferred-stack-boundary=2 . The GCC manual says:

-mpreferred-stack-boundary=num

Attempt to keep the stack boundary aligned to a 2 raised to num byte boundary. If -mpreferred-stack-boundary is not specified, the default is 4 (16 bytes or 128 bits).

If we add that to your GCC command we get gcc -Wall -T lscript.ld -m16 -nostdlib main.c hello.c -o main.o -mpreferred-stack-boundary=2:

00001000 <main>:
    1000:       66 55                   push   ebp
    1002:       66 89 e5                mov    ebp,esp
    1005:       66 e8 04 00 00 00       call   100f <hello>
    100b:       66 5d                   pop    ebp
    100d:       66 c3                   ret

0000100f <hello>:
    100f:       66 55                   push   ebp
    1011:       66 89 e5                mov    ebp,esp
    1014:       66 5d                   pop    ebp
    1016:       66 c3                   ret

Now all the extra alignment code to get it on a 16-byte boundary has disappeared. We are left with typical function frame pointer prologue and epilogue code. This is often in the form of push ebp and mov ebp,esp pop ebp. we can remove these with the -fomit-frame-pointer define in the GCC manual as:

The option -fomit-frame-pointer removes the frame pointer for all functions which might make debugging harder.

If we add that option we get gcc -Wall -T lscript.ld -m16 -nostdlib main.c hello.c -o main.o -mpreferred-stack-boundary=2 -fomit-frame-pointer:

00001000 <main>:
    1000:       66 e8 02 00 00 00       call   1008 <hello>
    1006:       66 c3                   ret

00001008 <hello>:
    1008:       66 c3                   ret

You can then optimize for size with -Os. The GCC manual says this:

-Os

Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.

This has a side effect that main will be placed into a section called .text.startup. If we display both with objdump -w -j .text -j .text.startup -D -mi386 -Maddr16,data16,intel main.o we get:

Disassembly of section .text:

00001000 <hello>:
    1000:       66 c3                   ret

Disassembly of section .text.startup:

00001002 <main>:
    1002:       e9 fb ff                jmp    1000 <hello>

If you have functions in separate objects you can alter the calling convention so the first 3 Integer class parameters are passed in registers rather than the stack. The Linux kernel uses this method as well. Information on this can be found in the GCC documentation:

regparm (number)

On the Intel 386, the regparm attribute causes the compiler to pass arguments number one to number if they are of integral type in registers EAX, EDX, and ECX instead of on the stack. Functions that take a variable number of arguments will continue to be passed all of their arguments on the stack.

I wrote a Stackoverflow answer with code that uses __attribute__((regparm(3))) that may be a useful source of further information.

Other Suggestions

I recommend you consider compiling each object individually rather than altogether. This is also advantageous since it can be more easily be done in a Makefile later on.

If we look at your command line with the extra options mentioned above you'd have:

gcc -Wall -T lscript.ld -m16 -nostdlib main.c hello.c -o main.o \
    -mpreferred-stack-boundary=2 -fomit-frame-pointer -Os

I recommend you do it this way:

gcc -c -Os -Wall -m16 -ffreestanding -nostdlib -mpreferred-stack-boundary=2 \
    -fomit-frame-pointer main.c -o main.o
gcc -c -Os -Wall -m16 -ffreestanding -nostdlib -mpreferred-stack-boundary=2 \
    -fomit-frame-pointer hello.c -o hello.o

The -c option (I added it to the beginning) forces the compiler to just generate the object file from the source and not to perform linking. You will also notice the -T lscript.ld has been removed. We have created .o files above. We can now use GCC to link all of them together:

gcc -ffreestanding -nostdlib -Wl,--build-id=none -m16 \
    -Tlscript.ld main.o hello.o -o main.elf

The -ffreestanding will force the linker to not use the C runtime, the -Wl,--build-id=none will tell the compiler not to generate some noise in the executable for build notes. In order for this to really work you'll need a slightly more complex linker script that places the .text.startup before .text. This script also adds the .data section, the .rodata and .bss sections. The DISCARD option removes exception handling data and other unneeded information.

ENTRY(main)
SECTIONS{

    .text 0x1000 : SUBALIGN(4) {
        *(.text.startup);
        *(.text);
    }
    .data : SUBALIGN(4) {
        *(.data);
        *(.rodata);
    }
    .bss : SUBALIGN(4) {
        __bss_start = .;
        *(COMMON);
        *(.bss);
    }
    . = ALIGN(4);
    __bss_end = .;

    /DISCARD/ : {
        *(.eh_frame);
        *(.comment);
        *(.note.gnu.build-id);
    }
}

If we look at a complete OBJDUMP with objdump -w -D -mi386 -Maddr16,data16,intel main.elf we would see:

Disassembly of section .text:

00001000 <main>:
    1000:       e9 01 00                jmp    1004 <hello>
    1003:       90                      nop

00001004 <hello>:
    1004:       66 c3                   ret

If you want to convert main.elf to a binary file that you can place in a disk image and read it (ie. via BIOS interrupt 0x13), you can create it this way:

objcopy -O binary main.elf main.bin

If you dump main.bin with NDISASM using ndisasm -b16 -o 0x1000 main.bin you'd see:

00001000  E90100            jmp word 0x1004
00001003  90                nop
00001004  66C3              o32 ret

Cross Compiler

I can't stress this enough but you should consider using a GCC cross compiler. The OSDev Wiki has information on building one. It also has this to say about why:

Why do I need a Cross Compiler?

You need to use a cross-compiler unless you are developing on your own operating system. The compiler must know the correct target platform (CPU, operating system), otherwise you will run into trouble. If you use the compiler that comes with your system, then the compiler won't know it is compiling something else entirely. Some tutorials suggest using your system compiler and passing a lot of problematic options to the compiler. This will certainly give you a lot of problems in the future and the solution is build a cross-compiler.

Why segmentation fault doesn't occur with smaller stack boundary?

You're not overwriting the saved eip, it's true. But you are overwriting a pointer that the function is using to find the saved eip. You can actually see this in your i f output; look at "Previous frame's sp" and notice how the two low bytes are 00 35; ASCII 0x35 is 5 and 00 is the terminating null. So although the saved eip is perfectly intact, the machine is fetching its return address from somewhere else, thus the crash.

In more detail:

GCC apparently doesn't trust the startup code to align the stack to 16 bytes, so it takes matters into its own hands (and $0xfffffff0,%esp). But it needs to keep track of the previous stack pointer value, so that it can find its parameters and the return address when needed. This is the lea 0x4(%esp),%ecx, which loads ecx with the address of the dword just above the saved eip on the stack. gdb calls this address "Previous frame's sp", I guess because it was the value of the stack pointer immediately before the caller executed its call main instruction. I will call it P for short.

After aligning the stack, the compiler pushes -0x4(%ecx) which is the argv parameter from the stack, for easy access since it's going to need it later. Then it sets up its stack frame with push %ebp; mov %esp, %ebp. We can keep track of all addresses relative to %ebp from now on, in the way compilers usually do when not optimizing.

The push %ecx a couple lines down stores the address P on the stack at offset -0x8(%ebp). The sub $0x20, %esp makes 32 more bytes of space on the stack (ending at -0x28(%ebp)), but the question is, where in that space does buffer end up being placed? We see it happen after the call to dumb_function, with lea -0x20(%ebp), %eax; push %eax; this is the first argument to strcpy being pushed, which is buffer, so indeed buffer is at -0x20(%ebp), not at -0x28 as you might have guessed. So when you write 24 (=0x18) bytes there, you overwrite two bytes at -0x8(%ebp) which is our stored P pointer.

It's all downhill from here. The corrupted value of P (call it Px) is popped into ecx, and just before the return, we do lea -0x4(%ecx), %esp. Now %esp is garbage and points somewhere bad, so the following ret is sure to lead to trouble. Maybe Px points to unmapped memory and just attempting to fetch the return address from there causes the fault. Maybe it points to readable memory, but the address fetched from that location does not point to executable memory, so the control transfer faults. Maybe the latter does point to executable memory, but the instructions located there are not the ones we want to be executing.

If you take out the call to dumb_function(), the stack layout changes slightly. It's no longer necessary to push ebx around the call to dumb_function(), so the P pointer from ecx now winds up at -4(%ebp), there are 4 bytes of unused space (to maintain alignment), and then buffer is at -0x20(%ebp). So your two-byte overrun goes into space that's not used at all, hence no crash.

And here is the generated assembly with -mpreferred-stack-boundary=2. Now there is no need to re-align the stack, because the compiler does trust the startup code to align the stack to at least 4 bytes (it would be unthinkable for this not to be the case). The stack layout is simpler: push ebp, and subtract 24 more bytes for buffer. Thus your overrun overwrites two bytes of the saved ebp. This is eventually popped from the stack back into ebp, and so main returns to its caller with a value in ebp that is
not the same as on entry. That's naughty, but it so happens that the system startup code doesn't use the value in ebp for anything (indeed in my tests it is set to 0 on entry to main, likely to mark the top of the stack for backtraces), and so nothing bad happens afterwards.

I'm confused about GNU Assembler starting and ending section in a function declaration

This is a function perilogue. The Frequently Given Answer giving the gen on function perilogues explains what's happening, stack frames, and the frame pointer register.

Help with understanding a very basic main() disassembly in GDB

Stack frames

The code at the beginning of the function body:

push  %ebp
mov   %esp, %ebp

is to create the so-called stack frame, which is a "solid ground" for referencing parameters and objects local to the procedure. The %ebp register is used (as its name indicates) as a base pointer, which points to the base (or bottom) of the local stack inside the procedure.

After entering the procedure, the stack pointer register (%esp) points to the return address stored on the stack by the call instruction (it is the address of the instruction just after the call). If you'd just invoke ret now, this address would be popped from the stack into the %eip (instruction pointer) and the code would execute further from that address (of the next instruction after the call). But we don't return yet, do we? ;-)

You then push %ebp register to save its previous value somewhere and not lose it, because you'll use it for something shortly. (BTW, it usually contains the base pointer of the caller function, and when you peek that value, you'll find a previously stored %ebp, which would be again a base pointer of the function one level higher, so you can trace the call stack that way.) When you save the %ebp, you can then store the current %esp (stack pointer) there, so that %ebp will point to the same address: the base of the current local stack. The %esp will move back and forth inside the procedure when you'll be pushing and popping values on the stack or reserving & freeing local variables. But %ebp will stay fixed, still pointing to the base of the local stack frame.

Accessing parameters

Parameters passed to the procedure by the caller are "burried just uner the ground" (that is, they have positive offsets relative to the base, because stack grows down). You have in %ebp the address of the base of the local stack, where lies the previous value of the %ebp. Below it (that is, at 4(%ebp) lies the return address. So the first parameter will be at 8(%ebp), the second at 12(%ebp) and so on.

Local variables

And local variables could be allocated on the stack above the base (that is, they'd have negative offsets relative to the base). Just subtract N to the %esp and you've just allocated N bytes on the stack for local variables, by moving the top of the stack above (or, precisely, below) this region :-) You can refer to this area by negative offsets relative to %ebp, i.e. -4(%ebp) is the first word, -8(%ebp) is second etc. Remember that (%ebp) points to the base of the local stack, where the previous %ebp value has been saved. So remember to restore the stack to the previous position before you try to restore the %ebp through pop %ebp at the end of the procedure. You can do it two ways:

1. You can free only the local variables by adding back the N to the %esp (stack pointer), that is, moving the top of the stack as if these local variables had never been there. (Well, their values will stay on the stack, but they'll be considered "freed" and could be overwritten by subsequent pushes, so it's no longer safe to refer them. They're dead bodies ;-J )

2. You can flush the stack down to the ground and free all local space by simply restoring the %esp from the %ebp which has been fixed earlier to the base of the stack. It'll restore the stack pointer to the state it has just after entering the procedure and saving the %esp into %ebp. It's like loading the previously saved game when you've messed something ;-)

Turning off frame pointers

It's possible to have a less messy assembly from gcc -S by adding a switch -fomit-frame-pointer. It tells GCC to not assemble any code for setting/resetting the stack frame until it's really needed for something. Just remember that it can confuse debuggers, because they usually depend on the stack frame being there to be able to track up the call stack. But it won't break anything if you don't need to debug this binary. It's perfectly fine for release targets and it saves some spacetime.

Call Frame Information

Sometimes you can meet some strange assembler directives starting from .cfi interleaved with the function header. This is a so-called Call Frame Information. It's used by debuggers to track the function calls. But it's also used for exception handling in high-level languages, which needs stack unwinding and other call-stack-based manipulations. You can turn it off too in your assembly, by adding a switch -fno-dwarf2-cfi-asm. This tells the GCC to use plain old labels instead of those strange .cfi directives, and it adds a special data structures at the end of your assembly, refering to those labels. This doesn't turn off the CFI, just changes the format to more "transparent" one: the CFI tables are then visible to the programmer.

Trying to Understand Gcc's Complicated Stack-Alignment at the Top of Main That Copies the Return Address