Gcc -Mpreferred-Stack-Boundary Option

gcc -mpreferred-stack-boundary option

I want to know what's the use of -mpreferred-stack-boundary option during compilation in GNU debugger.

The option has absolutely nothing to do with the debugger.

It affects generated code in your binary. By default, GCC will arrange things so that every function, immediately upon entry, has its stack pointer aligned on 16-byte boundary (this may be important if you have local variables, and enable sse2 instructions).

If you change the default to e.g. -mpreferred-stack-boundary=2, then GCC will align stack pointer on 4-byte boundary. This will reduce stack requirements of your routines, but will crash if your code (or code you call) does use sse2, so is generally not safe.

Understanding stack alignment enforcement

Looking at -O0-generated machine code is usually a futile exercise. The compiler will emit whatever works, in the simplest possible way. This often leads to bizarre artifacts.

Stack alignment only refers to alignment of the stack frame. It is not directly related to the alignment of objects on the stack. GCC will allocate on-stack objects with the required alignment. This is simpler if GCC knows that the stack frame already provides sufficient alignment, but if not, GCC will use a frame pointer and perform explicit alignment.

Why segmentation fault doesn't occur with smaller stack boundary?

You're not overwriting the saved eip, it's true. But you are overwriting a pointer that the function is using to find the saved eip. You can actually see this in your i f output; look at "Previous frame's sp" and notice how the two low bytes are 00 35; ASCII 0x35 is 5 and 00 is the terminating null. So although the saved eip is perfectly intact, the machine is fetching its return address from somewhere else, thus the crash.


In more detail:

GCC apparently doesn't trust the startup code to align the stack to 16 bytes, so it takes matters into its own hands (and $0xfffffff0,%esp). But it needs to keep track of the previous stack pointer value, so that it can find its parameters and the return address when needed. This is the lea 0x4(%esp),%ecx, which loads ecx with the address of the dword just above the saved eip on the stack. gdb calls this address "Previous frame's sp", I guess because it was the value of the stack pointer immediately before the caller executed its call main instruction. I will call it P for short.

After aligning the stack, the compiler pushes -0x4(%ecx) which is the argv parameter from the stack, for easy access since it's going to need it later. Then it sets up its stack frame with push %ebp; mov %esp, %ebp. We can keep track of all addresses relative to %ebp from now on, in the way compilers usually do when not optimizing.

The push %ecx a couple lines down stores the address P on the stack at offset -0x8(%ebp). The sub $0x20, %esp makes 32 more bytes of space on the stack (ending at -0x28(%ebp)), but the question is, where in that space does buffer end up being placed? We see it happen after the call to dumb_function, with lea -0x20(%ebp), %eax; push %eax; this is the first argument to strcpy being pushed, which is buffer, so indeed buffer is at -0x20(%ebp), not at -0x28 as you might have guessed. So when you write 24 (=0x18) bytes there, you overwrite two bytes at -0x8(%ebp) which is our stored P pointer.

It's all downhill from here. The corrupted value of P (call it Px) is popped into ecx, and just before the return, we do lea -0x4(%ecx), %esp. Now %esp is garbage and points somewhere bad, so the following ret is sure to lead to trouble. Maybe Px points to unmapped memory and just attempting to fetch the return address from there causes the fault. Maybe it points to readable memory, but the address fetched from that location does not point to executable memory, so the control transfer faults. Maybe the latter does point to executable memory, but the instructions located there are not the ones we want to be executing.


If you take out the call to dumb_function(), the stack layout changes slightly. It's no longer necessary to push ebx around the call to dumb_function(), so the P pointer from ecx now winds up at -4(%ebp), there are 4 bytes of unused space (to maintain alignment), and then buffer is at -0x20(%ebp). So your two-byte overrun goes into space that's not used at all, hence no crash.

And here is the generated assembly with -mpreferred-stack-boundary=2. Now there is no need to re-align the stack, because the compiler does trust the startup code to align the stack to at least 4 bytes (it would be unthinkable for this not to be the case). The stack layout is simpler: push ebp, and subtract 24 more bytes for buffer. Thus your overrun overwrites two bytes of the saved ebp. This is eventually popped from the stack back into ebp, and so main returns to its caller with a value in ebp that is
not the same as on entry. That's naughty, but it so happens that the system startup code doesn't use the value in ebp for anything (indeed in my tests it is set to 0 on entry to main, likely to mark the top of the stack for backtraces), and so nothing bad happens afterwards.

What's the purpose of stack pointer alignment in the prologue of main()

The System V AMD64 ABI (x86-64 ABI) requires 16-byte stack alignment. double requires 8-byte alignment and SSE extensions require 16-byte alignment.

gcc documentation points it in its documentation for -mpreferred-stack-boundary option:

-mpreferred-stack-boundary=num

Attempt to keep the stack boundary aligned to a 2 raised to num byte boundary. If -mpreferred-stack-boundary is not specified, the default is 4 (16 bytes or 128 bits).

Warning: When generating code for the x86-64 architecture with SSE extensions disabled, -mpreferred-stack-boundary=3 can be used to keep the stack boundary aligned to 8 byte boundary. Since x86-64 ABI require 16 byte stack alignment, this is ABI incompatible and intended to be used in controlled environment where stack space is important limitation. This option leads to wrong code when functions compiled with 16 byte stack alignment (such as functions from a standard library) are called with misaligned stack. In this case, SSE instructions may lead to misaligned memory access traps. In addition, variable arguments are handled incorrectly for 16 byte aligned objects (including x87 long double and __int128), leading to wrong results. You must build all modules with -mpreferred-stack-boundary=3, including any libraries. This includes the system libraries and startup modules.

How can I tell GCC not to align main's stack to 16-byte boundary?

You can disable the stack alignment, by telling GCC to align to 2 bytes, rather than 16

-mpreferred-stack-boundary=2

There will likely be some performance implications.


Caveat from Peter Cordes,

This will violate the ABI for the whole rest of the program, not maintaining the 16-byte alignment that the Linux version of the i386 SysV ABI guarantees on program entry and before calls to other functions. So compiler-generated code using SSE instructions may segfault. It may be possible with a function attribute on main to tell it that the incoming stack alignment is correct. This might not be a problem for your use-case, but it's an important caveat. Also it can break _Atomic uint64_t local vars that other threads get references to.

gcc 8.2+ doesn't always align the stack before a call on x86?

Since you provided a definition of the function in the same translation unit, apparently GCC sees that the function doesn't care about stack alignment and doesn't bother much with it. And apparently this basic inter-procedural analysis / optimization (IPA) is on by default even at -O0.

Turns out this option even has an obvious name when I searched for "ipa" options in the manual: -fipa-stack-alignment is on by default even at -O0. Manually turning it off with -fno-ipa-stack-alignment results in what you expected, a second sub whose value depends on the number of pushes (Godbolt), making sure ESP is aligned by 16 before a call like modern Linux versions of the i386 SysV ABI use.


Or if you change the definition to just a declaration, then the resulting asm is as expected, fully respecting -mpreferred-stack-boundary.

void callee(void* a, void* b) {
}

to

void callee(void* a, void* b);

Using -fPIC also forces GCC to not assume anything about the callee, so it does respect the possibility of function interposition (e.g. via LD_PRELOAD) with the appropriate option.

Without compiling for a shared library, GCC is allowed to assume that any definition it sees for a global function is the definition, thanks to ISO C's one-definition-rule.


If you use __attribute__((noipa)) on the function definition, then call sites won't assume anything based on the definition. Just like if you'd renamed the definition (so you could still look at it) and provided only a declaration of the name the caller uses.

If you just want to stop inlining, you can use __attribute__((noinline,noclone)) instead, to still allow the callsite to be like it would if the optimizer simply chose not to inline, but could still see this definition. That may or may not be what you want.

See also How to remove "noise" from GCC/clang assembly output? re: writing functions whose asm is interesting to look at, and compiler options.


And BTW, I found it easiest to change the declaration / definition to variadic, so I could add or remove args with only a change to the caller. I was still able to reproduce your result of that not changing the sub amount even when the push amount changes with an extra arg, when there's a definition, but not with just a declaration.

void callee(void* a, ...)  // {}   // comment out a body or not
;


Related Topics



Leave a reply



Submit