Trying to understand gcc's complicated stack-alignment at the top of main that copies the return address
I've had a go at it:
;# As you have already noticed, the compiler wants to align the stack
;# pointer on a 16 byte boundary before it pushes anything. That's
;# because certain instructions' memory access needs to be aligned
;# that way.
;# So in order to first save the original offset of esp (+4), it
;# executes the first instruction:
lea ecx,[esp+0x4]
;# Now alignment can happen. Without the previous insn the next one
;# would have made the original esp unrecoverable:
and esp,0xfffffff0
;# Next it pushes the return addresss and creates a stack frame. I
;# assume it now wants to make the stack look like a normal
;# subroutine call:
push DWORD PTR [ecx-0x4]
push ebp
mov ebp,esp
;# Remember that ecx is still the only value that can restore the
;# original esp. Since ecx may be garbled by any subroutine calls,
;# it has to save it somewhere:
push ecx
Why is GCC pushing an extra return address on the stack?
Update: gcc8 simplifies this at least for normal use-cases (-fomit-frame-pointer
, and no alloca
or C99 VLAs that require variable-size allocation). Perhaps motivated by increasing usage of AVX leading to more functions wanting a 32-byte aligned local or array.
Also, probably a duplicate of What's up with gcc weird stack manipulation when it wants extra stack alignment?
This complicated prologue is fine if it only ever runs a couple times (e.g. at the start of main
in 32-bit code), but the more it appears the more worthwhile it is to optimize it. GCC sometimes still over-aligns the stack in functions where all >16-byte aligned objects are optimized into registers, which is a missed optimization already but less bad when the stack alignment is cheaper.
gcc makes some clunky code when aligning the stack within a function, even with optimization enabled. I have a possible theory (see below) on why gcc might be copying the return address to just above where it saves ebp
to make a stack frame (and yes, I agree that's what gcc is doing). It doesn't look necessary in this function, and clang doesn't do anything like that.
Besides that, the nonsense with ecx
is probably just gcc not optimizing away unneeded parts of its align-the-stack boilerplate. (The pre-alignment value of esp
is needed to reference args on the stack, so it makes sense that it puts the address of the first would-be arg into a register).
You see the same thing with optimization in 32-bit code (where gcc makes a main
that doesn't assume 16B stack alignment, even though the current version of the ABI requires that at process startup, and the CRT code that calls main
either aligns the stack itself or preserves the initial alignment provided by the kernel, I forget). You also see this in functions that align the stack to more than 16B (e.g. functions that use __m256
types, sometimes even if they never spill them to the stack. Or functions with an array declared with C++11 alignas(32)
, or any other way of requesting alignment.) In 64-bit code, gcc always seems to use r10
for this, not rcx
.
There's nothing required for ABI compliance about the way gcc does it, because clang does something much simpler.
I added an aligned variable (with volatile
as a simple way to force the compiler to actually reserve aligned space for it on the stack, instead of optimizing it away). I put your code on the Godbolt compiler explorer, to look at the asm with -O3
. I see the same behaviour from gcc 4.9, 5.3, and 6.1, but different behaviour with clang.
int main(){
__attribute__((aligned(32))) volatile int v = 1;
return 0;
}
Clang3.8's -O3 -m32
output is functionally identical to its -m64
output. Note that -O3
enables -fomit-frame-pointer
, but some functions make stack frames anyway.
push ebp
mov ebp, esp # make a stack frame *before* aligning, so ebp-relative addressing can only access stack args, not aligned locals.
and esp, -32
sub esp, 32 # esp is 32B aligned with 32 or 48B above esp reserved (depending on incoming alignment)
mov dword ptr [esp], 1 # store v
xor eax, eax # return 0
mov esp, ebp # leave
pop ebp
ret
gcc's output is nearly the same between -m32
and -m64
, but it puts v
in the red-zone with -m64
so the -m32
output has two extra instructions:
# gcc 6.1 -m32 -O3 -fverbose-asm. Most of gcc's comment lines are empty. I guess that means it has no idea why it's emitting those insns :P
lea ecx, [esp+4] #, get a pointer to where the first arg would be
and esp, -32 #, align
xor eax, eax # return 0
push DWORD PTR [ecx-4] # No clue WTF this is for; this looks batshit insane, but happens even in 64bit mode.
push ebp # make a stackframe, even though -fomit-frame-pointer is on by default and we can already restore the original esp from ecx (unlike clang)
mov ebp, esp #,
push ecx # save the old esp value (even though this function doesn't clobber ecx...)
sub esp, 52 #, reserve space for v (not present with -m64)
mov DWORD PTR [ebp-56], 1 # v,
add esp, 52 #, unreserve (not present with -m64)
pop ecx # restore ecx (even though nothing clobbered it)
pop ebp # at least it knows it can just pop instead of `leave`
lea esp, [ecx-4] #, restore pre-alignment esp
ret
It seems that gcc wants to make its stack frame (with push ebp
) after aligning the stack. I guess that makes sense, so it can reference locals relative to ebp
. Otherwise it would have to use esp
-relative addressing, if it wanted aligned locals.
My theory on why gcc does this:
The extra copy of the return address after aligning but before pushing ebp
means that the return address is copied to the expected place relative to the saved ebp
value (and the value that will be in ebp
when child functions are called). So this does potentially help code that wants to unwind the stack by following the linked list of stack frames, and looking at return-addresses to find out what function is involved.
I'm not sure whether this matters with modern stack-unwind info that allows stack-unwinding (backtraces / exception handling) with -fomit-frame-pointer
. (It's metadata in the .eh_frame
section. This is what the .cfi_*
directives around every modification to esp
are for.) I should look at what clang does when it has to align the stack in a non-leaf function.
The original value of esp
would be needed inside the function to reference function args on the stack. I think gcc doesn't know how to optimize away unneeded parts of its align-the-stack method. (e.g. out main
doesn't look at its args (and is declared not to take any))
This kind of code-gen is typical of what you see in a function that needs to align the stack; it's not extra weird because of using a volatile
with automatic storage.
What does it mean to align the stack?
Assume the stack looks like this on entry to _main
(the address of the stack pointer is just an example):
| existing |
| stack content |
+-----------------+ <--- 0xbfff1230
Push %ebp
, and subtract 8 from %esp
to reserve some space for local variables:
| existing |
| stack content |
+-----------------+ <--- 0xbfff1230
| %ebp |
+-----------------+ <--- 0xbfff122c
: reserved :
: space :
+-----------------+ <--- 0xbfff1224
Now, the andl
instruction zeroes the low 4 bits of %esp
, which may decrease it; in this particular example, it has the effect of reserving an additional 4 bytes:
| existing |
| stack content |
+-----------------+ <--- 0xbfff1230
| %ebp |
+-----------------+ <--- 0xbfff122c
: reserved :
: space :
+ - - - - - - - - + <--- 0xbfff1224
: extra space :
+-----------------+ <--- 0xbfff1220
The point of this is that there are some "SIMD" (Single Instruction, Multiple Data) instructions (also known in x86-land as "SSE" for "Streaming SIMD Extensions") which can perform parallel operations on multiple words in memory, but require those multiple words to be a block starting at an address which is a multiple of 16 bytes.
In general, the compiler can't assume that particular offsets from %esp
will result in a suitable address (because the state of %esp
on entry to the function depends on the calling code). But, by deliberately aligning the stack pointer in this way, the compiler knows that adding any multiple of 16 bytes to the stack pointer will result in a 16-byte aligned address, which is safe for use with these SIMD instructions.
gcc subtracting from esp before call
As I mentioned in my comments:
The first few lines (plus the push ecx) are to ensure the stack is aligned on a 16-byte boundary which is required by the Linux System V i386 ABI. The
pop ecx
andlea
before theret
in main is to undo that alignment work.
@RossRidge has provided a link to another Stackoverflow answer that details this quite well.
In this case you seem to be doing real mode development. GCC isn't well suited for this but it can work and I will assume you know what you are doing. I mention some of the pitfalls of using -m16
in this Stackoverflow answer. I put this warning in that answer regarding real mode development with GCC:
There are so many pitfalls in doing this that I recommend against it.
If you remain undeterred and wish to continue forward you can do a few things to minimize the code. The 16-byte alignment of the stack at the point a function call is made is part of the more recent Linux System V i386 ABIs
. Since you are generating code for a non-Linux environment you can change the stack alignment to 4 using compiler option -mpreferred-stack-boundary=2
. The GCC manual says:
-mpreferred-stack-boundary=num
Attempt to keep the stack boundary aligned to a 2 raised to num byte boundary. If -mpreferred-stack-boundary is not specified, the default is 4 (16 bytes or 128 bits).
If we add that to your GCC command we get gcc -Wall -T lscript.ld -m16 -nostdlib main.c hello.c -o main.o -mpreferred-stack-boundary=2
:
00001000 <main>:
1000: 66 55 push ebp
1002: 66 89 e5 mov ebp,esp
1005: 66 e8 04 00 00 00 call 100f <hello>
100b: 66 5d pop ebp
100d: 66 c3 ret
0000100f <hello>:
100f: 66 55 push ebp
1011: 66 89 e5 mov ebp,esp
1014: 66 5d pop ebp
1016: 66 c3 ret
Now all the extra alignment code to get it on a 16-byte boundary has disappeared. We are left with typical function frame pointer prologue and epilogue code. This is often in the form of push ebp
and mov ebp,esp
pop ebp
. we can remove these with the -fomit-frame-pointer
define in the GCC manual as:
The option -fomit-frame-pointer removes the frame pointer for all functions which might make debugging harder.
If we add that option we get gcc -Wall -T lscript.ld -m16 -nostdlib main.c hello.c -o main.o -mpreferred-stack-boundary=2 -fomit-frame-pointer
:
00001000 <main>:
1000: 66 e8 02 00 00 00 call 1008 <hello>
1006: 66 c3 ret
00001008 <hello>:
1008: 66 c3 ret
You can then optimize for size with -Os
. The GCC manual says this:
-Os
Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
This has a side effect that main
will be placed into a section called .text.startup
. If we display both with objdump -w -j .text -j .text.startup -D -mi386 -Maddr16,data16,intel main.o
we get:
Disassembly of section .text:
00001000 <hello>:
1000: 66 c3 ret
Disassembly of section .text.startup:
00001002 <main>:
1002: e9 fb ff jmp 1000 <hello>
If you have functions in separate objects you can alter the calling convention so the first 3 Integer class parameters are passed in registers rather than the stack. The Linux kernel uses this method as well. Information on this can be found in the GCC documentation:
regparm (number)
On the Intel 386, the regparm attribute causes the compiler to pass arguments number one to number if they are of integral type in registers EAX, EDX, and ECX instead of on the stack. Functions that take a variable number of arguments will continue to be passed all of their arguments on the stack.
I wrote a Stackoverflow answer with code that uses __attribute__((regparm(3))) that may be a useful source of further information.
Other Suggestions
I recommend you consider compiling each object individually rather than altogether. This is also advantageous since it can be more easily be done in a Makefile
later on.
If we look at your command line with the extra options mentioned above you'd have:
gcc -Wall -T lscript.ld -m16 -nostdlib main.c hello.c -o main.o \
-mpreferred-stack-boundary=2 -fomit-frame-pointer -Os
I recommend you do it this way:
gcc -c -Os -Wall -m16 -ffreestanding -nostdlib -mpreferred-stack-boundary=2 \
-fomit-frame-pointer main.c -o main.o
gcc -c -Os -Wall -m16 -ffreestanding -nostdlib -mpreferred-stack-boundary=2 \
-fomit-frame-pointer hello.c -o hello.o
The -c
option (I added it to the beginning) forces the compiler to just generate the object file from the source and not to perform linking. You will also notice the -T lscript.ld
has been removed. We have created .o
files above. We can now use GCC to link all of them together:
gcc -ffreestanding -nostdlib -Wl,--build-id=none -m16 \
-Tlscript.ld main.o hello.o -o main.elf
The -ffreestanding
will force the linker to not use the C runtime, the -Wl,--build-id=none
will tell the compiler not to generate some noise in the executable for build notes. In order for this to really work you'll need a slightly more complex linker script that places the .text.startup
before .text
. This script also adds the .data
section, the .rodata
and .bss
sections. The DISCARD option removes exception handling data and other unneeded information.
ENTRY(main)
SECTIONS{
.text 0x1000 : SUBALIGN(4) {
*(.text.startup);
*(.text);
}
.data : SUBALIGN(4) {
*(.data);
*(.rodata);
}
.bss : SUBALIGN(4) {
__bss_start = .;
*(COMMON);
*(.bss);
}
. = ALIGN(4);
__bss_end = .;
/DISCARD/ : {
*(.eh_frame);
*(.comment);
*(.note.gnu.build-id);
}
}
If we look at a complete OBJDUMP with objdump -w -D -mi386 -Maddr16,data16,intel main.elf
we would see:
Disassembly of section .text:
00001000 <main>:
1000: e9 01 00 jmp 1004 <hello>
1003: 90 nop
00001004 <hello>:
1004: 66 c3 ret
If you want to convert main.elf
to a binary file that you can place in a disk image and read it (ie. via BIOS interrupt 0x13), you can create it this way:
objcopy -O binary main.elf main.bin
If you dump main.bin
with NDISASM using ndisasm -b16 -o 0x1000 main.bin
you'd see:
00001000 E90100 jmp word 0x1004
00001003 90 nop
00001004 66C3 o32 ret
Cross Compiler
I can't stress this enough but you should consider using a GCC cross compiler. The OSDev Wiki has information on building one. It also has this to say about why:
Why do I need a Cross Compiler?
You need to use a cross-compiler unless you are developing on your own operating system. The compiler must know the correct target platform (CPU, operating system), otherwise you will run into trouble. If you use the compiler that comes with your system, then the compiler won't know it is compiling something else entirely. Some tutorials suggest using your system compiler and passing a lot of problematic options to the compiler. This will certainly give you a lot of problems in the future and the solution is build a cross-compiler.
Why segmentation fault doesn't occur with smaller stack boundary?
You're not overwriting the saved eip, it's true. But you are overwriting a pointer that the function is using to find the saved eip. You can actually see this in your i f
output; look at "Previous frame's sp" and notice how the two low bytes are 00 35
; ASCII 0x35 is 5
and 00
is the terminating null. So although the saved eip is perfectly intact, the machine is fetching its return address from somewhere else, thus the crash.
In more detail:
GCC apparently doesn't trust the startup code to align the stack to 16 bytes, so it takes matters into its own hands (and $0xfffffff0,%esp
). But it needs to keep track of the previous stack pointer value, so that it can find its parameters and the return address when needed. This is the lea 0x4(%esp),%ecx
, which loads ecx with the address of the dword just above the saved eip on the stack. gdb calls this address "Previous frame's sp", I guess because it was the value of the stack pointer immediately before the caller executed its call main
instruction. I will call it P for short.
After aligning the stack, the compiler pushes -0x4(%ecx)
which is the argv
parameter from the stack, for easy access since it's going to need it later. Then it sets up its stack frame with push %ebp; mov %esp, %ebp
. We can keep track of all addresses relative to %ebp
from now on, in the way compilers usually do when not optimizing.
The push %ecx
a couple lines down stores the address P on the stack at offset -0x8(%ebp)
. The sub $0x20, %esp
makes 32 more bytes of space on the stack (ending at -0x28(%ebp)
), but the question is, where in that space does buffer
end up being placed? We see it happen after the call to dumb_function
, with lea -0x20(%ebp), %eax; push %eax
; this is the first argument to strcpy
being pushed, which is buffer
, so indeed buffer
is at -0x20(%ebp)
, not at -0x28
as you might have guessed. So when you write 24 (=0x18
) bytes there, you overwrite two bytes at -0x8(%ebp)
which is our stored P pointer.
It's all downhill from here. The corrupted value of P (call it Px) is popped into ecx, and just before the return, we do lea -0x4(%ecx), %esp
. Now %esp
is garbage and points somewhere bad, so the following ret
is sure to lead to trouble. Maybe Px
points to unmapped memory and just attempting to fetch the return address from there causes the fault. Maybe it points to readable memory, but the address fetched from that location does not point to executable memory, so the control transfer faults. Maybe the latter does point to executable memory, but the instructions located there are not the ones we want to be executing.
If you take out the call to dumb_function()
, the stack layout changes slightly. It's no longer necessary to push ebx around the call to dumb_function()
, so the P pointer from ecx now winds up at -4(%ebp)
, there are 4 bytes of unused space (to maintain alignment), and then buffer
is at -0x20(%ebp)
. So your two-byte overrun goes into space that's not used at all, hence no crash.
And here is the generated assembly with -mpreferred-stack-boundary=2
. Now there is no need to re-align the stack, because the compiler does trust the startup code to align the stack to at least 4 bytes (it would be unthinkable for this not to be the case). The stack layout is simpler: push ebp, and subtract 24 more bytes for buffer
. Thus your overrun overwrites two bytes of the saved ebp. This is eventually popped from the stack back into ebp, and so main
returns to its caller with a value in ebp that is
not the same as on entry. That's naughty, but it so happens that the system startup code doesn't use the value in ebp for anything (indeed in my tests it is set to 0 on entry to main, likely to mark the top of the stack for backtraces), and so nothing bad happens afterwards.
I'm confused about GNU Assembler starting and ending section in a function declaration
This is a function perilogue. The Frequently Given Answer giving the gen on function perilogues explains what's happening, stack frames, and the frame pointer register.
Help with understanding a very basic main() disassembly in GDB
Stack frames
The code at the beginning of the function body:
push %ebp
mov %esp, %ebp
is to create the so-called stack frame, which is a "solid ground" for referencing parameters and objects local to the procedure. The %ebp
register is used (as its name indicates) as a base pointer, which points to the base (or bottom) of the local stack inside the procedure.
After entering the procedure, the stack pointer register (%esp
) points to the return address stored on the stack by the call instruction (it is the address of the instruction just after the call). If you'd just invoke ret
now, this address would be popped from the stack into the %eip
(instruction pointer) and the code would execute further from that address (of the next instruction after the call
). But we don't return yet, do we? ;-)
You then push %ebp
register to save its previous value somewhere and not lose it, because you'll use it for something shortly. (BTW, it usually contains the base pointer of the caller function, and when you peek that value, you'll find a previously stored %ebp
, which would be again a base pointer of the function one level higher, so you can trace the call stack that way.) When you save the %ebp
, you can then store the current %esp
(stack pointer) there, so that %ebp
will point to the same address: the base of the current local stack. The %esp
will move back and forth inside the procedure when you'll be pushing and popping values on the stack or reserving & freeing local variables. But %ebp
will stay fixed, still pointing to the base of the local stack frame.
Accessing parameters
Parameters passed to the procedure by the caller are "burried just uner the ground" (that is, they have positive offsets relative to the base, because stack grows down). You have in %ebp
the address of the base of the local stack, where lies the previous value of the %ebp
. Below it (that is, at 4(%ebp)
lies the return address. So the first parameter will be at 8(%ebp)
, the second at 12(%ebp)
and so on.
Local variables
And local variables could be allocated on the stack above the base (that is, they'd have negative offsets relative to the base). Just subtract N to the %esp
and you've just allocated N
bytes on the stack for local variables, by moving the top of the stack above (or, precisely, below) this region :-) You can refer to this area by negative offsets relative to %ebp
, i.e. -4(%ebp)
is the first word, -8(%ebp)
is second etc. Remember that (%ebp)
points to the base of the local stack, where the previous %ebp
value has been saved. So remember to restore the stack to the previous position before you try to restore the %ebp
through pop %ebp
at the end of the procedure. You can do it two ways:
1. You can free only the local variables by adding back the N
to the %esp
(stack pointer), that is, moving the top of the stack as if these local variables had never been there. (Well, their values will stay on the stack, but they'll be considered "freed" and could be overwritten by subsequent pushes, so it's no longer safe to refer them. They're dead bodies ;-J )
2. You can flush the stack down to the ground and free all local space by simply restoring the %esp
from the %ebp
which has been fixed earlier to the base of the stack. It'll restore the stack pointer to the state it has just after entering the procedure and saving the %esp
into %ebp
. It's like loading the previously saved game when you've messed something ;-)
Turning off frame pointers
It's possible to have a less messy assembly from gcc -S
by adding a switch -fomit-frame-pointer
. It tells GCC to not assemble any code for setting/resetting the stack frame until it's really needed for something. Just remember that it can confuse debuggers, because they usually depend on the stack frame being there to be able to track up the call stack. But it won't break anything if you don't need to debug this binary. It's perfectly fine for release targets and it saves some spacetime.
Call Frame Information
Sometimes you can meet some strange assembler directives starting from .cfi
interleaved with the function header. This is a so-called Call Frame Information. It's used by debuggers to track the function calls. But it's also used for exception handling in high-level languages, which needs stack unwinding and other call-stack-based manipulations. You can turn it off too in your assembly, by adding a switch -fno-dwarf2-cfi-asm
. This tells the GCC to use plain old labels instead of those strange .cfi
directives, and it adds a special data structures at the end of your assembly, refering to those labels. This doesn't turn off the CFI, just changes the format to more "transparent" one: the CFI tables are then visible to the programmer.
Related Topics
How to Avoid Transparent_Hugepage/Defrag Warning from Mongodb
"Max Open Files" for Working Process
Linux Time Command Microseconds or Better Accuracy
Should I Use Libc++ or Libstdc++
Grep a Large List Against a Large File
What Is the Purpose of the "-I" and "-T" Options for the "Docker Exec" Command
Why Do Shells Ignore Sigint and Sigquit in Backgrounded Processes
How to Get 'Find' to Ignore .Svn Directories
Difference Between Posix Aio and Libaio on Linux
Pack Shared Libraries into the Elf
What's the Point of Eval/Bash -C as Opposed to Just Evaluating a Variable
Multiple Websites on Nginx & Sites-Available
Why Do We Need a Bootloader in an Embedded Device
How to Allocate, in User Space, a Non Cacheable Block of Memory on Linux