Responsibility of Stack Alignment in 32-Bit X86 Assembly

Responsibility of stack alignment in 32-bit x86 assembly

GCC only does this extra stack alignment in main; that function is special. You won't see it if you look at code-gen for any other function, unless you have a local with alignas(32) or something.

GCC is just taking a defensive approach with -m32, by not assuming that main is called with a properly 16B-aligned stack. Or this special treatment is left over from when -mpreferred-stack-boundary=4 was only a good idea, not the law.

The i386 System V ABI has guaranteed/required for years that ESP+4 is 16B-aligned on entry to a function. (i.e. ESP must be 16B-aligned before a CALL instruction, so args on the stack start at a 16B boundary. This is the same as for x86-64 System V.)

The ABI also guarantees that new 32-bit processes start with ESP aligned on a 16B boundary (e.g. at _start, the ELF entry point, where ESP points at argc, not a return address), and the glibc CRT code maintains that alignment.

As far as the calling convention is concerned, EBP is just another call-preserved register. But yes, compiler output with -fno-omit-frame-pointer does take care to push ebp before other call-preserved registers (like EBX) so the saved EBP values form a linked list. (Because it also does the mov ebp, esp part of setting up a frame pointer after that push.)

Perhaps gcc is defensive because an extremely ancient Linux kernel (from before that revision to the i386 ABI, when the required alignment was only 4B) could violate that assumption, and it's only an extra couple instructions that run once in the life-time of the process (assuming the program doesn't call main recursively).

Unlike gcc, clang assumes the stack is properly aligned on entry to main. (clang also assumes that narrow args have been sign or zero-extended to 32 bits, even though the current ABI revision doesn't specify that behaviour (yet). gcc and clang both emit code that does in the caller side, but only clang depends on it in the callee. This happens in 64-bit code, but I didn't check 32-bit.)

Look at compiler output on http://gcc.godbolt.org/ for main and functions other than main if you're curious.

I just updated the ABI links in the x86 tag wiki the other day. http://x86-64.org/ is still dead and seems to be not coming back, so I updated the System V links to point to the PDFs of the current revision in HJ Lu's github repo, and his page with links.

Note that the last version on SCO's site is not the current revision, and doesn't include the 16B-stack-alignment requirement.

I think some BSD versions still don't require / maintain 16-byte stack alignment.

What's the purpose of stack pointer alignment in the prologue of main()

The System V AMD64 ABI (x86-64 ABI) requires 16-byte stack alignment. double requires 8-byte alignment and SSE extensions require 16-byte alignment.

gcc documentation points it in its documentation for -mpreferred-stack-boundary option:

-mpreferred-stack-boundary=num

Attempt to keep the stack boundary aligned to a 2 raised to num byte boundary. If -mpreferred-stack-boundary is not specified, the default is 4 (16 bytes or 128 bits).

Warning: When generating code for the x86-64 architecture with SSE extensions disabled, -mpreferred-stack-boundary=3 can be used to keep the stack boundary aligned to 8 byte boundary. Since x86-64 ABI require 16 byte stack alignment, this is ABI incompatible and intended to be used in controlled environment where stack space is important limitation. This option leads to wrong code when functions compiled with 16 byte stack alignment (such as functions from a standard library) are called with misaligned stack. In this case, SSE instructions may lead to misaligned memory access traps. In addition, variable arguments are handled incorrectly for 16 byte aligned objects (including x87 long double and __int128), leading to wrong results. You must build all modules with -mpreferred-stack-boundary=3, including any libraries. This includes the system libraries and startup modules.

gcc x86-32 stack alignment and calling printf

The i386 System V ABI does guarantee / require 16 byte stack alignment before a call, like I said at the top of my answer that you linked. (Unless you're calling a private helper function, in which case you can make up your own rules for alignment, arg-passing, and which registers are clobbered for that function.)

Functions are allowed to crash or misbehave if you violate this ABI requirement, but are not required to. e.g. scanf in x86-64 Ubuntu glibc (as compiled by recent gcc) only recently started doing that: scanf Segmentation faults when called from a function that doesn't change RSP

Functions can depend on stack alignment for performance (to align a double or array of doubles to avoid cache-line splits when accessing them).

Usually the only case where a function depends on stack alignment for correctness is when compiled to use SSE/SSE2, so it can use 16-byte alignment-required loads/stores to copy a struct or array (movaps or movdqa), or to actually auto-vectorize a loop over a local array.

I think Ubuntu doesn't compile their 32-bit libraries with SSE (except functions like memcpy that use runtime dispatching), so they can still work on ancient CPUs like Pentium II. Multiarch libraries on an x86-64 system should assume SSE2, but with 4-byte pointers it's less likely that 32-bit functions would have 16 byte structs to copy.

Anyway, whatever the reason, obviously printf in your 32-bit build of glibc doesn't actually depend on 16-byte stack alignment for correctness, so it doesn't fault even when you misalign the stack.

Why is 0x2014 pushed instead of 0x14? What is 0x201d?

0x14 (decimal 20) is the value in memory at that location. It will be loaded at runtime, because you used push r/m32, not push $20 (or an assemble time constant like .equ testint, 20 or testint = 20).

You used gcc -m32 to make a PIE (Position Independent Executable), which is relocated at runtime, because that's the default on Ubuntu's gcc.

0x2014 is the offset relative to the start of the file. If you disassemble at runtime after running the program, you'll see a real address.

Same for call 54b. It's presuambly a call to the PLT (which is near the start of the file / text segment, hence the low address).

If you disassembled with objdump -drwC, you'd see symbol relocation info. (I like -Mintel as well, but beware it's MASM-like, not NASM).

You can link with gcc -m32 -no-pie to make classic position-dependent executables. I'd definitely recommend that especially for 32-bit code, and especially if you're compiling C, use gcc -m32 -no-pie -fno-pie to get non-PIE code-gen as well as linking into a non-PIE executable. (see 32-bit absolute addresses no longer allowed in x86-64 Linux? for more about PIEs.)

Understanding stack alignment

rsp % 16 == 0 at _start - that's the OS entry point. It's not a function (there's no return address on the stack, instead RSP points at argc).
Unlike functions, RSP is aligned by 16 on entry to _start, as specified by the x86-64 System V ABI.

From _start, you're ready to call a function right away, without having to adjust the stack, because the stack should be aligned before call. call itself will add 8B of return address, and you can expect the rsp % 16 == 8 upon entry, one more push away from 16-byte alignment. That's guaranteed upon entry to any function¹.

Upon app entry, you can trust the kernel to give you 16-byte RSP alignment, or you could align the stack manually with and rsp, -16 before calling any other code conforming to ABI. (Or if you plan to use C runtime lib, then the entry point of your app code should be main, and let libc's crt startup code code run as _start. main is a normal function like any other, so RSP & 0xF == 0x8 on entry to it when it's eventually called.)

Footnote 1: Unless you build with special options that change the ABI, like -mpreferred-stack-boundary=3 instead of the default 4. But that would make it unsafe to call functions in any code compiled without that. For example glibc scanf Segmentation faults when called from a function that doesn't align RSP

Now, after pushing the content of rsp became 0x7fffffffdce8. Is it a violation of the alignment requirements?

Yes, if you would at that point call some more complex function like for example printf with non trivial arguments (so it would use SSE instruction for implementation), it will highly likely segfault.

About push byte 0xFF:

That's not legal instruction in 64b mode (not even in 16 and 32 bit modes) (not legal in the sense of byte operand target size, byte immediate as source value is legal, but operand size can be only 16, 32 or 64 bits), so the NASM will guess the target size (any from legal ones, naturally picking qword in 64b mode), and use the guessed target size with the imm8 from source.

BTW use -w+all option to make the NASM emit (sort of weird, but at least you can investigate) warning in such case:

warning: signed byte value exceeds bounds

For example legit push word 0xFF would push only two bytes to stack, of word value 0x00FF.

How to align the stack: if you already know initial alignment, just adjust as needed before calling some ABI requiring subroutine (in common 64b code that is usually as simple as either not pushing anything, or doing one more redundant push, like push rbp).

If you are not sure about alignment, use some spare register to store original rsp (often rbp is used, so it also functions as stack frame pointer), and then and rsp,-16 to clear the bottom bits.

Keep in mind, when creating your own ABI conforming subroutines, that stack was aligned before call, so it is -8B upon entry. Again simple push rbp is often enough to resolve several issues at the same time, preserving rbp value (so mov rbp, rsp is possible "for free") and aligning stack for rest of subroutine.

EDIT: about encoding, source size, and immediate size...

Unfortunately I'm not 100% sure about how exactly this is supposed to be defined in NASM, but I think actually the push definition is so complex, that it breaks NASM syntax a bit (exhausting the current syntax to a point where you can't specify whether you mean operand size, or source immediate size, so it is silently assumed the size specifier is operand size mainly and affects immediate in certain cases).

By using push byte 0xFF the NASM will take the byte part ALSO as "operand size", not just as immediate size. And byte is not legal operand size for push, so NASM will instead choose qword as by default in 64b mode. Then it will also consider the byte as immediate size, and sign-extend the 0xFF to qword. I.e. this looks to me as a bit of undefined behaviour. NASM creators probably don't expect you to specify immediate size, because the NASM optimizes for size, so when you do push word -1, it will assemble that as "push word operand imm8". You can override that the other way, to make sure you get imm16 by push strict word -1.

See the machine code produced by the various combinations (in 64b mode) (some of them speaking strictly are worth at least of warning, or even error, like "strict qword" producing only imm32, not imm64 (as imm64 opcode does not exist of course) ... not even mentioning that the dword variants are effectively qword operand sizes, you can't use 32b operand size in 64b mode):

 6 00000000 6AFF                            push    -1
 7 00000002 6AFF                            push    strict byte 0xFF
 8          ******************       warning: signed byte value exceeds bounds
 9 00000004 6AFF                            push    byte 0xFF
10          ******************       warning: signed byte value exceeds bounds
11 00000006 6AFF                            push    strict byte -1
12 00000008 6AFF                            push    byte -1
13 0000000A 6668FF00                        push    strict word 0xFF
14 0000000E 6668FF00                        push    word 0xFF
15 00000012 6668FFFF                        push    strict word -1
16 00000016 666AFF                          push    word -1
17 00000019 68FF000000                      push    strict dword 0xFF
18 0000001E 68FF000000                      push    dword 0xFF
19 00000023 68FFFFFFFF                      push    strict dword -1
20 00000028 6AFF                            push    dword -1
21 0000002A 68FF000000                      push    strict qword 0xFF
22 0000002F 68FF000000                      push    qword 0xFF
23 00000034 68FFFFFFFF                      push    strict qword -1
24 00000039 6AFF                            push    qword -1

Anyway, I guess not too many people are bothered by this, as in 64b mode you usually want qword push (rsp -= 8) with immediate encoded in shortest possible way, so you just write push -1 and let the NASM handle the imm8 optimization itself, expecting rsp to change by -8 of course. And in other case, they probably expect you to know legal operand sizes, and not to use byte at all.

If you think this is not acceptable, I would raise this on the NASM forum/bugzilla/somewhere, how it is supposed to work exactly. As far as I'm personally concerned, the current behaviour is "good enough" for me (makes both sense, plus I give quick look to listing file from time to time to verify there's no nasty surprise in the machine code bytes and it landed as expected). That said, I mostly code size intros, so I know about every byte produced and it's purpose. If the NASM would suddenly produce imm16 instead of expected imm8, I would see it on the binary size and investigate.

What does aligning the stack mean in assembly?

Addressing is generally byte-based. A unique address points at a byte (which can be the first byte in a word or doubleword, etc, but referenced to that address).

With any numbering system the least significant digit holds the value base to the power 0 (the number 1). The next least base to the power 1, the next base to the power 2. In decimal this is the ones column the tens column the hundreds column. In binary ones, twos, fours... Alignment means evenly divisible by which also means the least significant digits are zeros.

You are always "aligned" on a byte boundary but a 16 bit boundary in binary means the least significant bit is zero, 32 bit aligned two zeros and so on.

0x1234 aligned on both a 16 and 32 bit boundary but not 64 bit

0x1235 not aligned (byte alignment really isn't a thing)

0x1236 aligned on a 16 bit boundary

0x1230 four zeros so 16, 32, 64, 128 BITS not bytes. 2,4,8,16 bytes.

The why is for performance reasons all memories have a fixed width as well as data buses, you can't magically add or remove wires in the logic once implemented, there is a physical limit, you can choose to not use all of them as part of the design but you can't add any.

So while the x86 buses are wider, let's say you had a 32 bit wide data bus as well as a 32 bit wide memory (think cache but also dram but we don't access dram directly in general).

If I want to save the 16 bits 0xAABB to address 0x1001 in a little endian machine then 0x1001 will get 0xBB and 0x1002 will get 0xAA. If I had a 32 bit data bus and a 32 bit memory on the far side of it then I could move those 16 bits if I designed the bus for this, by writing 0xXXAABBXX to address 0x1000 with a byte lane mask of 0b0110 telling the memory controller to use the 32 bits of memory associated with the BYTE based address 0x1000, and the byte lane mask on the bus telling the controller only save the middle two bytes, the outer two are don't cares.

The memory is a fixed width generally so all transactions must be full width it would read the 32 bits modify the 16 in the middle with 0xAABB and write the 32 bits back. This is of course inefficient. Even worse would be to write 0xAABB to 0x1003 that would be two bus transactions one for 0xBBXXXXXX at address 0x1000 and one for 0xXXXXXXAA at address 0x1004. That is a lot of extra cycles both on the bus and the read-modify-writes on the memory.

Now the stack alignment rules are not going to prevent read-modify-writes on writes. For the cases where larger transfers happen there are opportunities for a performance gain, for example if the bus were 32 bits and the memory and you did a 64 bit transfer to address 0x1000, that can based on bus design look like a single transfer with a length of two. The bus handshake happens then two back to back clocks the data moves, rather than handshakes and one width of the bus of data for a smaller transfer. So you get a gain there if the memory is 32 bits wide then it is two writes without a read-modify-write into the sram in the cache. Pretty clean, want to avoid the read-modify-writes.

Now do this for a while as things evolve and the hardware and the tools desire a stack alignment.

Depending on the instruction set, clearly here you are asking x86, but as a programmer you can sometimes choose to say push a byte on the stack and then adjust it to align it. Or if you are making room for local variables, depending on the instruction set (if the stack pointer is general purpose enough to be able to do math on it) you can simply subtract, so sub sp,#8 is the same as pushing two 32 bit items to the stack simply to make room for two 32 bit items.

If the rule is say 32 bit alignment and you push a byte, then you need to adjust the stack pointer by 3 to make the total change in the stack pointer a multiple of 4 bytes (32 bits).

How you know how much is you simply count it up. If it is 16 byte alignment and you push 4 then you need to push 12 more or adjust the stack pointer by 12 more.

The key here is that if everyone agrees to keep the stack aligned then you don't actually have to look at the lower bits of the stack pointer, you just keep track of what you are pushing and popping before calling something else.

If the stack is shared with the interrupt handlers (not really in your current x86 running an operating system, but still possible and possible in many other use cases for general purpose processors) I have not seen that this rule applies there as you will see the compiler do a less than aligned size push or pop then adjust with other pushes or pops or subtraction or addition. If an interrupt occurred between those the handler would see an unaligned stack.

Some architectures will fault on unaligned accesses, a further reason for keeping the stack aligned.

If your code is not messing with the stack then you don't need to mess with the stack (pointer). Only if you use the stack in your code by allocating space on the stack (pushes or math on the stack pointer), do you need to care and you need to know what the convention of the compiler you are linking this code with and conform to that. If this is all assembly language and no compiler then you decide the convention yourself and basically do whatever you want within the limitations of the processor itself.

From your title question it has nothing to do with assembly at all, nor machine code. It has to do with your code and what it does. The assembly language is simply a language in which you convey how much you want to adjust the stack pointer, the instruction doesn't care or know about any such things it takes the constant provided and uses it against the register. Assembly is one of the few if not the only that allows you to do math on the stack pointer register, so there is that connection. But alignment and assembly are not related.

Responsibility of Stack Alignment in 32-Bit X86 Assembly