Stack Alignment on X86

Responsibility of stack alignment in 32-bit x86 assembly

GCC only does this extra stack alignment in main; that function is special. You won't see it if you look at code-gen for any other function, unless you have a local with alignas(32) or something.

GCC is just taking a defensive approach with -m32, by not assuming that main is called with a properly 16B-aligned stack. Or this special treatment is left over from when -mpreferred-stack-boundary=4 was only a good idea, not the law.

The i386 System V ABI has guaranteed/required for years that ESP+4 is 16B-aligned on entry to a function. (i.e. ESP must be 16B-aligned before a CALL instruction, so args on the stack start at a 16B boundary. This is the same as for x86-64 System V.)

The ABI also guarantees that new 32-bit processes start with ESP aligned on a 16B boundary (e.g. at _start, the ELF entry point, where ESP points at argc, not a return address), and the glibc CRT code maintains that alignment.

As far as the calling convention is concerned, EBP is just another call-preserved register. But yes, compiler output with -fno-omit-frame-pointer does take care to push ebp before other call-preserved registers (like EBX) so the saved EBP values form a linked list. (Because it also does the mov ebp, esp part of setting up a frame pointer after that push.)


Perhaps gcc is defensive because an extremely ancient Linux kernel (from before that revision to the i386 ABI, when the required alignment was only 4B) could violate that assumption, and it's only an extra couple instructions that run once in the life-time of the process (assuming the program doesn't call main recursively).


Unlike gcc, clang assumes the stack is properly aligned on entry to main. (clang also assumes that narrow args have been sign or zero-extended to 32 bits, even though the current ABI revision doesn't specify that behaviour (yet). gcc and clang both emit code that does in the caller side, but only clang depends on it in the callee. This happens in 64-bit code, but I didn't check 32-bit.)

Look at compiler output on http://gcc.godbolt.org/ for main and functions other than main if you're curious.


I just updated the ABI links in the x86 tag wiki the other day. http://x86-64.org/ is still dead and seems to be not coming back, so I updated the System V links to point to the PDFs of the current revision in HJ Lu's github repo, and his page with links.

Note that the last version on SCO's site is not the current revision, and doesn't include the 16B-stack-alignment requirement.

I think some BSD versions still don't require / maintain 16-byte stack alignment.

What does aligning the stack mean in assembly?

Addressing is generally byte-based. A unique address points at a byte (which can be the first byte in a word or doubleword, etc, but referenced to that address).

With any numbering system the least significant digit holds the value base to the power 0 (the number 1). The next least base to the power 1, the next base to the power 2. In decimal this is the ones column the tens column the hundreds column. In binary ones, twos, fours... Alignment means evenly divisible by which also means the least significant digits are zeros.

You are always "aligned" on a byte boundary but a 16 bit boundary in binary means the least significant bit is zero, 32 bit aligned two zeros and so on.

0x1234 aligned on both a 16 and 32 bit boundary but not 64 bit

0x1235 not aligned (byte alignment really isn't a thing)

0x1236 aligned on a 16 bit boundary

0x1230 four zeros so 16, 32, 64, 128 BITS not bytes. 2,4,8,16 bytes.

The why is for performance reasons all memories have a fixed width as well as data buses, you can't magically add or remove wires in the logic once implemented, there is a physical limit, you can choose to not use all of them as part of the design but you can't add any.

So while the x86 buses are wider, let's say you had a 32 bit wide data bus as well as a 32 bit wide memory (think cache but also dram but we don't access dram directly in general).

If I want to save the 16 bits 0xAABB to address 0x1001 in a little endian machine then 0x1001 will get 0xBB and 0x1002 will get 0xAA. If I had a 32 bit data bus and a 32 bit memory on the far side of it then I could move those 16 bits if I designed the bus for this, by writing 0xXXAABBXX to address 0x1000 with a byte lane mask of 0b0110 telling the memory controller to use the 32 bits of memory associated with the BYTE based address 0x1000, and the byte lane mask on the bus telling the controller only save the middle two bytes, the outer two are don't cares.

The memory is a fixed width generally so all transactions must be full width it would read the 32 bits modify the 16 in the middle with 0xAABB and write the 32 bits back. This is of course inefficient. Even worse would be to write 0xAABB to 0x1003 that would be two bus transactions one for 0xBBXXXXXX at address 0x1000 and one for 0xXXXXXXAA at address 0x1004. That is a lot of extra cycles both on the bus and the read-modify-writes on the memory.

Now the stack alignment rules are not going to prevent read-modify-writes on writes. For the cases where larger transfers happen there are opportunities for a performance gain, for example if the bus were 32 bits and the memory and you did a 64 bit transfer to address 0x1000, that can based on bus design look like a single transfer with a length of two. The bus handshake happens then two back to back clocks the data moves, rather than handshakes and one width of the bus of data for a smaller transfer. So you get a gain there if the memory is 32 bits wide then it is two writes without a read-modify-write into the sram in the cache. Pretty clean, want to avoid the read-modify-writes.

Now do this for a while as things evolve and the hardware and the tools desire a stack alignment.

Depending on the instruction set, clearly here you are asking x86, but as a programmer you can sometimes choose to say push a byte on the stack and then adjust it to align it. Or if you are making room for local variables, depending on the instruction set (if the stack pointer is general purpose enough to be able to do math on it) you can simply subtract, so sub sp,#8 is the same as pushing two 32 bit items to the stack simply to make room for two 32 bit items.

If the rule is say 32 bit alignment and you push a byte, then you need to adjust the stack pointer by 3 to make the total change in the stack pointer a multiple of 4 bytes (32 bits).

How you know how much is you simply count it up. If it is 16 byte alignment and you push 4 then you need to push 12 more or adjust the stack pointer by 12 more.

The key here is that if everyone agrees to keep the stack aligned then you don't actually have to look at the lower bits of the stack pointer, you just keep track of what you are pushing and popping before calling something else.

If the stack is shared with the interrupt handlers (not really in your current x86 running an operating system, but still possible and possible in many other use cases for general purpose processors) I have not seen that this rule applies there as you will see the compiler do a less than aligned size push or pop then adjust with other pushes or pops or subtraction or addition. If an interrupt occurred between those the handler would see an unaligned stack.

Some architectures will fault on unaligned accesses, a further reason for keeping the stack aligned.

If your code is not messing with the stack then you don't need to mess with the stack (pointer). Only if you use the stack in your code by allocating space on the stack (pushes or math on the stack pointer), do you need to care and you need to know what the convention of the compiler you are linking this code with and conform to that. If this is all assembly language and no compiler then you decide the convention yourself and basically do whatever you want within the limitations of the processor itself.

From your title question it has nothing to do with assembly at all, nor machine code. It has to do with your code and what it does. The assembly language is simply a language in which you convey how much you want to adjust the stack pointer, the instruction doesn't care or know about any such things it takes the constant provided and uses it against the register. Assembly is one of the few if not the only that allows you to do math on the stack pointer register, so there is that connection. But alignment and assembly are not related.

gcc 8.2+ doesn't always align the stack before a call on x86?

Since you provided a definition of the function in the same translation unit, apparently GCC sees that the function doesn't care about stack alignment and doesn't bother much with it. And apparently this basic inter-procedural analysis / optimization (IPA) is on by default even at -O0.

Turns out this option even has an obvious name when I searched for "ipa" options in the manual: -fipa-stack-alignment is on by default even at -O0. Manually turning it off with -fno-ipa-stack-alignment results in what you expected, a second sub whose value depends on the number of pushes (Godbolt), making sure ESP is aligned by 16 before a call like modern Linux versions of the i386 SysV ABI use.


Or if you change the definition to just a declaration, then the resulting asm is as expected, fully respecting -mpreferred-stack-boundary.

void callee(void* a, void* b) {
}

to

void callee(void* a, void* b);

Using -fPIC also forces GCC to not assume anything about the callee, so it does respect the possibility of function interposition (e.g. via LD_PRELOAD) with the appropriate option.

Without compiling for a shared library, GCC is allowed to assume that any definition it sees for a global function is the definition, thanks to ISO C's one-definition-rule.


If you use __attribute__((noipa)) on the function definition, then call sites won't assume anything based on the definition. Just like if you'd renamed the definition (so you could still look at it) and provided only a declaration of the name the caller uses.

If you just want to stop inlining, you can use __attribute__((noinline,noclone)) instead, to still allow the callsite to be like it would if the optimizer simply chose not to inline, but could still see this definition. That may or may not be what you want.

See also How to remove "noise" from GCC/clang assembly output? re: writing functions whose asm is interesting to look at, and compiler options.


And BTW, I found it easiest to change the declaration / definition to variadic, so I could add or remove args with only a change to the caller. I was still able to reproduce your result of that not changing the sub amount even when the push amount changes with an extra arg, when there's a definition, but not with just a declaration.

void callee(void* a, ...)  // {}   // comment out a body or not
;

Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?

Note that the current version of the i386 System V ABI used on Linux also requires 16-byte stack alignment1. See https://sourceforge.net/p/fbc/bugs/659/ for some history, and my comment on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838#c91 for an attempt at summarizing the unfortunate history of how i386 GNU/Linux + GCC accidentally got into a situation where a backwards-incompat change to the i386 System V ABI was the lesser of two evils.

Windows x64 also requires 16-byte stack alignment before a call, presumably for similar motivations as x86-64 System V.

Also, semi-related: x86-64 System V requires that global arrays of 16 bytes and larger be aligned by 16. Same for local arrays of >= 16 bytes or variable size, although that detail is only relevant across functions if you know that you're being passed the address of the start of an array, not a pointer into the middle. (Different memory alignment for different buffer sizes). It doesn't let you make any extra assumptions about an arbitrary int *.


SSE2 is baseline for x86-64, and making the ABI efficient for types like __m128, and for compiler auto-vectorization, was one of the design goals, I think. The ABI has to define how such args are passed as function args, or by reference.

16-byte alignment is sometimes useful for local variables on the stack (especially arrays), and guaranteeing 16-byte alignment means compilers can get it for free whenever it's useful, even if the source doesn't explicitly request it.

If the stack alignment relative to a 16-byte boundary wasn't known, every function that wanted an aligned local would need an and rsp, -16, and extra instructions to save/restore rsp after an unknown offset to rsp (either 0 or -8) e.g. using up rbp for a frame pointer.

Without AVX, memory source operands have to be 16-byte aligned. e.g. paddd xmm0, [rsp+rdi] faults if the memory operand is misaligned. So if alignment isn't known, you'd have to either use movups xmm1, [rsp+rdi] / paddd xmm0, xmm1, or write a loop prologue / epilogue to handle the misaligned elements. For local arrays that the compiler wants to auto-vectorize over, it can simply choose to align them by 16.

Also note that early x86 CPUs (before Nehalem / Bulldozer) had a movups instruction that's slower than movaps even when the pointer does turn out to be aligned. (I.e. unaligned loads/stores on aligned data was extra slow, as well as preventing folding loads into an ALU instruction.) (See Agner Fog's optimization guides, microarch guide, and instruction tables for more about all of the above.)

These factors are why a guarantee is more useful than just "usually" keeping the stack aligned. Being allowed to make code which actually faults on a misaligned stack allows more optimization opportunities.

Aligned arrays also speed up vectorized memcpy / strcmp / whatever functions that can't assume alignment, but instead check for it and can jump straight to their whole-vector loops.

From a recent version of the x86-64 System V ABI (r252):

An array uses the same alignment as its elements, except that a local or global
array variable of length at least 16 bytes or a C99 variable-length array variable
always has alignment of at least 16 bytes.4

4 The alignment requirement allows the use of SSE instructions when operating on the array.
The compiler cannot in general calculate the size of a variable-length array (VLA), but it is expected
that most VLAs will require at least 16 bytes, so it is logical to mandate that VLAs have at
least a 16-byte alignment.

This is a bit aggressive, and mostly only helps when functions that auto-vectorize can be inlined, but usually there are other locals the compiler can stuff into any gaps so it doesn't waste stack space. And doesn't waste instructions as long as there's a known stack alignment. (Obviously the ABI designers could have left this out if they'd decided not to require 16-byte stack alignment.)



Spill/reload of __m128

Of course, it makes it free to do alignas(16) char buf[1024]; or other cases where the source requests 16-byte alignment.

And there are also __m128 / __m128d / __m128i locals. The compiler may not be able to keep all vector locals in registers (e.g. spilled across a function call, or not enough registers), so it needs to be able to spill/reload them with movaps, or as a memory source operand for ALU instructions, for efficiency reasons discussed above.

Loads/stores that actually are split across a cache-line boundary (64 bytes) have significant latency penalties, and also minor throughput penalties on modern CPUs. The load needs data from 2 separate cache lines, so it takes two accesses to the cache. (And potentially 2 cache misses, but that's rare for stack memory.)

I think movups already had that cost baked in for vectors on older CPUs where it's expensive, but it still sucks. Spanning a 4k page boundary is much worse (on CPUs before Skylake), with a load or store taking ~100 cycles if it touches bytes on both sides of a 4k boundary. (Also needs 2 TLB checks.) Natural alignment makes splits across any wider boundary impossible, so 16-byte alignment was sufficient for everything you can do with SSE2.


max_align_t has 16-byte alignment in the x86-64 System V ABI, because of long double (10-byte/80-bit x87). It's defined as padded to 16 bytes for some weird reason, unlike in 32-bit code where sizeof(long double) == 10. x87 10-byte load/store is quite slow anyway (like 1/3rd the load throughput of double or float on Core2, 1/6th on P4, or 1/8th on K8), but maybe cache-line and page split penalties were so bad on older CPUs that they decided to define it that way. I think on modern CPUs (maybe even Core2) looping over an array of long double would be no slower with packed 10-byte, because the fld m80 would be a bigger bottleneck than a cache-line split every ~6.4 elements.

Actually, the ABI was defined before silicon was available to benchmark on (back in ~2000), but those K8 numbers are the same as K7 (32-bit / 64-bit mode is irrelevant here). Making long double 16-byte does make it possible to copy a single one with movaps, even though you can't do anything with it in XMM registers. (Except manipulate the sign bit with xorps / andps / orps.)

Related: this max_align_t definition means that malloc always returns 16-byte aligned memory in x86-64 code. This lets you get away with using it for SSE aligned loads like _mm_load_ps, but such code can break when compiled for 32-bit where alignof(max_align_t) is only 8. (Use aligned_alloc or whatever.)


Other ABI factors include passing __m128 values on the stack (after xmm0-7 have the first 8 float / vector args). It makes sense to require 16-byte alignment for vectors in memory, so they can be used efficiently by the callee, and stored efficiently by the caller. Maintaining 16-byte stack alignment at all times makes it easy for functions that need to align some arg-passing space by 16.

There are types like __m128 that the ABI guarantees have 16-byte alignment. If you define a local and take its address, and pass that pointer to some other function, that local needs to be sufficiently aligned. So maintaining 16-byte stack alignment goes hand in hand with giving some types 16-byte alignment, which is obviously a good idea.

These days, it's nice that atomic<struct_of_16_bytes> can cheaply get 16-byte alignment, so lock cmpxchg16b doesn't ever cross a cache line boundary. For the really rare case where you have an atomic local with automatic storage, and you pass pointers to it to multiple threads...



Footnote 1: 32-bit Linux

Not all 32-bit platforms broke backwards compatibility with existing binaries and hand-written asm the way Linux did; some like i386 NetBSD still only use the historical 4-byte stack alignment requirement from the original version of the i386 SysV ABI.

The historical 4-byte stack alignment was also insufficient for efficient 8-byte double on modern CPUs. Unaligned fld / fstp are generally efficient except when they cross a cache-line boundary (like other loads/stores), so it's not horrible, but naturally-aligned is nice.

Even before 16-byte alignment was officially part of the ABI, GCC used to enable -mpreferred-stack-boundary=4 (2^4 = 16-bytes) on 32-bit. This currently assumes the incoming stack alignment is 16 bytes (even for cases that will fault if it's not), as well as preserving that alignment. I'm not sure if historical gcc versions used to try to preserve stack alignment without depending on it for correctness of SSE code-gen or alignas(16) objects.

ffmpeg is one well-known example that depends on the compiler to give it stack alignment: what is "stack alignment"?, e.g. on 32-bit Windows.

Modern gcc still emits code at the top of main to align the stack by 16 (even on Linux where the ABI guarantees that the kernel starts the process with an aligned stack), but not at the top of any other function. You could use -mincoming-stack-boundary to tell gcc how aligned it should assume the stack is when generating code.

Ancient gcc4.1 didn't seem to really respect __attribute__((aligned(16))) or 32 for automatic storage, i.e. it doesn't bother aligning the stack any extra in this example on Godbolt, so old gcc has kind of a checkered past when it comes to stack alignment. I think the change of the official Linux ABI to 16-byte alignment happened as a de-facto change first, not a well-planned change. I haven't turned up anything official on when the change happened, but somewhere between 2005 and 2010 I think, after x86-64 became popular and the x86-64 System V ABI's 16-byte stack alignment proved useful.

At first it was a change to GCC's code-gen to use more alignment than the ABI required (i.e. using a stricter ABI for gcc-compiled code), but later it was written in to the version of the i386 System V ABI maintained at https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI (which is official for Linux at least).


@MichaelPetch and @ThomasJager report that gcc4.5 may have been the first version to have -mpreferred-stack-boundary=4 for 32-bit as well as 64-bit. gcc4.1.2 and gcc4.4.7 on Godbolt appear to behave that way, so maybe the change was backported, or Matt Godbolt configured old gcc with a more modern config.

What does it mean to align the stack?

Assume the stack looks like this on entry to _main (the address of the stack pointer is just an example):

|    existing     |
| stack content |
+-----------------+ <--- 0xbfff1230

Push %ebp, and subtract 8 from %esp to reserve some space for local variables:

|    existing     |
| stack content |
+-----------------+ <--- 0xbfff1230
| %ebp |
+-----------------+ <--- 0xbfff122c
: reserved :
: space :
+-----------------+ <--- 0xbfff1224

Now, the andl instruction zeroes the low 4 bits of %esp, which may decrease it; in this particular example, it has the effect of reserving an additional 4 bytes:

|    existing     |
| stack content |
+-----------------+ <--- 0xbfff1230
| %ebp |
+-----------------+ <--- 0xbfff122c
: reserved :
: space :
+ - - - - - - - - + <--- 0xbfff1224
: extra space :
+-----------------+ <--- 0xbfff1220

The point of this is that there are some "SIMD" (Single Instruction, Multiple Data) instructions (also known in x86-land as "SSE" for "Streaming SIMD Extensions") which can perform parallel operations on multiple words in memory, but require those multiple words to be a block starting at an address which is a multiple of 16 bytes.

In general, the compiler can't assume that particular offsets from %esp will result in a suitable address (because the state of %esp on entry to the function depends on the calling code). But, by deliberately aligning the stack pointer in this way, the compiler knows that adding any multiple of 16 bytes to the stack pointer will result in a 16-byte aligned address, which is safe for use with these SIMD instructions.

Understanding stack alignment

rsp % 16 == 0 at _start - that's the OS entry point. It's not a function (there's no return address on the stack, instead RSP points at argc).
Unlike functions, RSP is aligned by 16 on entry to _start, as specified by the x86-64 System V ABI.

From _start, you're ready to call a function right away, without having to adjust the stack, because the stack should be aligned before call. call itself will add 8B of return address, and you can expect the rsp % 16 == 8 upon entry, one more push away from 16-byte alignment. That's guaranteed upon entry to any function1.

Upon app entry, you can trust the kernel to give you 16-byte RSP alignment, or you could align the stack manually with and rsp, -16 before calling any other code conforming to ABI. (Or if you plan to use C runtime lib, then the entry point of your app code should be main, and let libc's crt startup code code run as _start. main is a normal function like any other, so RSP & 0xF == 0x8 on entry to it when it's eventually called.)

Footnote 1: Unless you build with special options that change the ABI, like -mpreferred-stack-boundary=3 instead of the default 4. But that would make it unsafe to call functions in any code compiled without that. For example glibc scanf Segmentation faults when called from a function that doesn't align RSP



Now, after pushing the content of rsp became 0x7fffffffdce8. Is it a violation of the alignment requirements?

Yes, if you would at that point call some more complex function like for example printf with non trivial arguments (so it would use SSE instruction for implementation), it will highly likely segfault.


About push byte 0xFF:

That's not legal instruction in 64b mode (not even in 16 and 32 bit modes) (not legal in the sense of byte operand target size, byte immediate as source value is legal, but operand size can be only 16, 32 or 64 bits), so the NASM will guess the target size (any from legal ones, naturally picking qword in 64b mode), and use the guessed target size with the imm8 from source.

BTW use -w+all option to make the NASM emit (sort of weird, but at least you can investigate) warning in such case:

warning: signed byte value exceeds bounds

For example legit push word 0xFF would push only two bytes to stack, of word value 0x00FF.


How to align the stack: if you already know initial alignment, just adjust as needed before calling some ABI requiring subroutine (in common 64b code that is usually as simple as either not pushing anything, or doing one more redundant push, like push rbp).

If you are not sure about alignment, use some spare register to store original rsp (often rbp is used, so it also functions as stack frame pointer), and then and rsp,-16 to clear the bottom bits.

Keep in mind, when creating your own ABI conforming subroutines, that stack was aligned before call, so it is -8B upon entry. Again simple push rbp is often enough to resolve several issues at the same time, preserving rbp value (so mov rbp, rsp is possible "for free") and aligning stack for rest of subroutine.


EDIT: about encoding, source size, and immediate size...

Unfortunately I'm not 100% sure about how exactly this is supposed to be defined in NASM, but I think actually the push definition is so complex, that it breaks NASM syntax a bit (exhausting the current syntax to a point where you can't specify whether you mean operand size, or source immediate size, so it is silently assumed the size specifier is operand size mainly and affects immediate in certain cases).

By using push byte 0xFF the NASM will take the byte part ALSO as "operand size", not just as immediate size. And byte is not legal operand size for push, so NASM will instead choose qword as by default in 64b mode. Then it will also consider the byte as immediate size, and sign-extend the 0xFF to qword. I.e. this looks to me as a bit of undefined behaviour. NASM creators probably don't expect you to specify immediate size, because the NASM optimizes for size, so when you do push word -1, it will assemble that as "push word operand imm8". You can override that the other way, to make sure you get imm16 by push strict word -1.

See the machine code produced by the various combinations (in 64b mode) (some of them speaking strictly are worth at least of warning, or even error, like "strict qword" producing only imm32, not imm64 (as imm64 opcode does not exist of course) ... not even mentioning that the dword variants are effectively qword operand sizes, you can't use 32b operand size in 64b mode):

 6 00000000 6AFF                            push    -1
7 00000002 6AFF push strict byte 0xFF
8 ****************** warning: signed byte value exceeds bounds
9 00000004 6AFF push byte 0xFF
10 ****************** warning: signed byte value exceeds bounds
11 00000006 6AFF push strict byte -1
12 00000008 6AFF push byte -1
13 0000000A 6668FF00 push strict word 0xFF
14 0000000E 6668FF00 push word 0xFF
15 00000012 6668FFFF push strict word -1
16 00000016 666AFF push word -1
17 00000019 68FF000000 push strict dword 0xFF
18 0000001E 68FF000000 push dword 0xFF
19 00000023 68FFFFFFFF push strict dword -1
20 00000028 6AFF push dword -1
21 0000002A 68FF000000 push strict qword 0xFF
22 0000002F 68FF000000 push qword 0xFF
23 00000034 68FFFFFFFF push strict qword -1
24 00000039 6AFF push qword -1

Anyway, I guess not too many people are bothered by this, as in 64b mode you usually want qword push (rsp -= 8) with immediate encoded in shortest possible way, so you just write push -1 and let the NASM handle the imm8 optimization itself, expecting rsp to change by -8 of course. And in other case, they probably expect you to know legal operand sizes, and not to use byte at all.



Related Topics



Leave a reply



Submit