What Is "Stack Alignment"

what is stack alignment?

Alignment of variables in memory (a short history).

In the past computers had an 8 bits databus. This means, that each clock cycle 8 bits of information could be processed. Which was fine then.

Then came 16 bit computers. Due to downward compatibility and other issues, the 8 bit byte was kept and the 16 bit word was introduced. Each word was 2 bytes. And each clock cycle 16 bits of information could be processed. But this posed a small problem.

Let's look at a memory map:

+----+
|0000| 
|0001|
+----+
|0002|
|0003|
+----+
|0004|
|0005|
+----+
| .. |

At each address there is a byte which can be accessed individually.
But words can only be fetched at even addresses. So if we read a word at 0000, we read the bytes at 0000 and 0001. But if we want to read the word at position 0001, we need two read accesses. First 0000,0001 and then 0002,0003 and we only keep 0001,0002.

Of course this took some extra time and that was not appreciated. So that's why they invented alignment. So we store word variables at word boundaries and byte variables at byte boundaries.

For example, if we have a structure with a byte field (B) and a word field (W) (and a very naive compiler), we get the following:

+----+
|0000| B
|0001| W
+----+
|0002| W
|0003|
+----+

Which is not fun. But when using word alignment we find:

+----+
|0000| B
|0001| -
+----+
|0002| W
|0003| W
+----+

Here memory is sacrificed for access speed.

You can imagine that when using double word (4 bytes) or quad word (8 bytes) this is even more important. That's why with most modern compilers you can chose which alignment you are using while compiling the program.

What does it mean to align the stack?

Assume the stack looks like this on entry to _main (the address of the stack pointer is just an example):

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230

Push %ebp, and subtract 8 from %esp to reserve some space for local variables:

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230
|      %ebp       |
+-----------------+  <--- 0xbfff122c
:    reserved     :
:     space       :
+-----------------+  <--- 0xbfff1224

Now, the andl instruction zeroes the low 4 bits of %esp, which may decrease it; in this particular example, it has the effect of reserving an additional 4 bytes:

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230
|      %ebp       |
+-----------------+  <--- 0xbfff122c
:    reserved     :
:     space       :
+ - - - - - - - - +  <--- 0xbfff1224
:   extra space   :
+-----------------+  <--- 0xbfff1220

The point of this is that there are some "SIMD" (Single Instruction, Multiple Data) instructions (also known in x86-land as "SSE" for "Streaming SIMD Extensions") which can perform parallel operations on multiple words in memory, but require those multiple words to be a block starting at an address which is a multiple of 16 bytes.

In general, the compiler can't assume that particular offsets from %esp will result in a suitable address (because the state of %esp on entry to the function depends on the calling code). But, by deliberately aligning the stack pointer in this way, the compiler knows that adding any multiple of 16 bytes to the stack pointer will result in a 16-byte aligned address, which is safe for use with these SIMD instructions.

What does aligning the stack mean in assembly?

Addressing is generally byte-based. A unique address points at a byte (which can be the first byte in a word or doubleword, etc, but referenced to that address).

With any numbering system the least significant digit holds the value base to the power 0 (the number 1). The next least base to the power 1, the next base to the power 2. In decimal this is the ones column the tens column the hundreds column. In binary ones, twos, fours... Alignment means evenly divisible by which also means the least significant digits are zeros.

You are always "aligned" on a byte boundary but a 16 bit boundary in binary means the least significant bit is zero, 32 bit aligned two zeros and so on.

0x1234 aligned on both a 16 and 32 bit boundary but not 64 bit

0x1235 not aligned (byte alignment really isn't a thing)

0x1236 aligned on a 16 bit boundary

0x1230 four zeros so 16, 32, 64, 128 BITS not bytes. 2,4,8,16 bytes.

The why is for performance reasons all memories have a fixed width as well as data buses, you can't magically add or remove wires in the logic once implemented, there is a physical limit, you can choose to not use all of them as part of the design but you can't add any.

So while the x86 buses are wider, let's say you had a 32 bit wide data bus as well as a 32 bit wide memory (think cache but also dram but we don't access dram directly in general).

If I want to save the 16 bits 0xAABB to address 0x1001 in a little endian machine then 0x1001 will get 0xBB and 0x1002 will get 0xAA. If I had a 32 bit data bus and a 32 bit memory on the far side of it then I could move those 16 bits if I designed the bus for this, by writing 0xXXAABBXX to address 0x1000 with a byte lane mask of 0b0110 telling the memory controller to use the 32 bits of memory associated with the BYTE based address 0x1000, and the byte lane mask on the bus telling the controller only save the middle two bytes, the outer two are don't cares.

The memory is a fixed width generally so all transactions must be full width it would read the 32 bits modify the 16 in the middle with 0xAABB and write the 32 bits back. This is of course inefficient. Even worse would be to write 0xAABB to 0x1003 that would be two bus transactions one for 0xBBXXXXXX at address 0x1000 and one for 0xXXXXXXAA at address 0x1004. That is a lot of extra cycles both on the bus and the read-modify-writes on the memory.

Now the stack alignment rules are not going to prevent read-modify-writes on writes. For the cases where larger transfers happen there are opportunities for a performance gain, for example if the bus were 32 bits and the memory and you did a 64 bit transfer to address 0x1000, that can based on bus design look like a single transfer with a length of two. The bus handshake happens then two back to back clocks the data moves, rather than handshakes and one width of the bus of data for a smaller transfer. So you get a gain there if the memory is 32 bits wide then it is two writes without a read-modify-write into the sram in the cache. Pretty clean, want to avoid the read-modify-writes.

Now do this for a while as things evolve and the hardware and the tools desire a stack alignment.

Depending on the instruction set, clearly here you are asking x86, but as a programmer you can sometimes choose to say push a byte on the stack and then adjust it to align it. Or if you are making room for local variables, depending on the instruction set (if the stack pointer is general purpose enough to be able to do math on it) you can simply subtract, so sub sp,#8 is the same as pushing two 32 bit items to the stack simply to make room for two 32 bit items.

If the rule is say 32 bit alignment and you push a byte, then you need to adjust the stack pointer by 3 to make the total change in the stack pointer a multiple of 4 bytes (32 bits).

How you know how much is you simply count it up. If it is 16 byte alignment and you push 4 then you need to push 12 more or adjust the stack pointer by 12 more.

The key here is that if everyone agrees to keep the stack aligned then you don't actually have to look at the lower bits of the stack pointer, you just keep track of what you are pushing and popping before calling something else.

If the stack is shared with the interrupt handlers (not really in your current x86 running an operating system, but still possible and possible in many other use cases for general purpose processors) I have not seen that this rule applies there as you will see the compiler do a less than aligned size push or pop then adjust with other pushes or pops or subtraction or addition. If an interrupt occurred between those the handler would see an unaligned stack.

Some architectures will fault on unaligned accesses, a further reason for keeping the stack aligned.

If your code is not messing with the stack then you don't need to mess with the stack (pointer). Only if you use the stack in your code by allocating space on the stack (pushes or math on the stack pointer), do you need to care and you need to know what the convention of the compiler you are linking this code with and conform to that. If this is all assembly language and no compiler then you decide the convention yourself and basically do whatever you want within the limitations of the processor itself.

From your title question it has nothing to do with assembly at all, nor machine code. It has to do with your code and what it does. The assembly language is simply a language in which you convey how much you want to adjust the stack pointer, the instruction doesn't care or know about any such things it takes the constant provided and uses it against the register. Assembly is one of the few if not the only that allows you to do math on the stack pointer register, so there is that connection. But alignment and assembly are not related.

Understanding stack alignment enforcement

Looking at -O0-generated machine code is usually a futile exercise. The compiler will emit whatever works, in the simplest possible way. This often leads to bizarre artifacts.

Stack alignment only refers to alignment of the stack frame. It is not directly related to the alignment of objects on the stack. GCC will allocate on-stack objects with the required alignment. This is simpler if GCC knows that the stack frame already provides sufficient alignment, but if not, GCC will use a frame pointer and perform explicit alignment.

Understanding stack alignment

rsp % 16 == 0 at _start - that's the OS entry point. It's not a function (there's no return address on the stack, instead RSP points at argc).
Unlike functions, RSP is aligned by 16 on entry to _start, as specified by the x86-64 System V ABI.

From _start, you're ready to call a function right away, without having to adjust the stack, because the stack should be aligned before call. call itself will add 8B of return address, and you can expect the rsp % 16 == 8 upon entry, one more push away from 16-byte alignment. That's guaranteed upon entry to any function¹.

Upon app entry, you can trust the kernel to give you 16-byte RSP alignment, or you could align the stack manually with and rsp, -16 before calling any other code conforming to ABI. (Or if you plan to use C runtime lib, then the entry point of your app code should be main, and let libc's crt startup code code run as _start. main is a normal function like any other, so RSP & 0xF == 0x8 on entry to it when it's eventually called.)

Footnote 1: Unless you build with special options that change the ABI, like -mpreferred-stack-boundary=3 instead of the default 4. But that would make it unsafe to call functions in any code compiled without that. For example glibc scanf Segmentation faults when called from a function that doesn't align RSP

Now, after pushing the content of rsp became 0x7fffffffdce8. Is it a violation of the alignment requirements?

Yes, if you would at that point call some more complex function like for example printf with non trivial arguments (so it would use SSE instruction for implementation), it will highly likely segfault.

About push byte 0xFF:

That's not legal instruction in 64b mode (not even in 16 and 32 bit modes) (not legal in the sense of byte operand target size, byte immediate as source value is legal, but operand size can be only 16, 32 or 64 bits), so the NASM will guess the target size (any from legal ones, naturally picking qword in 64b mode), and use the guessed target size with the imm8 from source.

BTW use -w+all option to make the NASM emit (sort of weird, but at least you can investigate) warning in such case:

warning: signed byte value exceeds bounds

For example legit push word 0xFF would push only two bytes to stack, of word value 0x00FF.

How to align the stack: if you already know initial alignment, just adjust as needed before calling some ABI requiring subroutine (in common 64b code that is usually as simple as either not pushing anything, or doing one more redundant push, like push rbp).

If you are not sure about alignment, use some spare register to store original rsp (often rbp is used, so it also functions as stack frame pointer), and then and rsp,-16 to clear the bottom bits.

Keep in mind, when creating your own ABI conforming subroutines, that stack was aligned before call, so it is -8B upon entry. Again simple push rbp is often enough to resolve several issues at the same time, preserving rbp value (so mov rbp, rsp is possible "for free") and aligning stack for rest of subroutine.

EDIT: about encoding, source size, and immediate size...

Unfortunately I'm not 100% sure about how exactly this is supposed to be defined in NASM, but I think actually the push definition is so complex, that it breaks NASM syntax a bit (exhausting the current syntax to a point where you can't specify whether you mean operand size, or source immediate size, so it is silently assumed the size specifier is operand size mainly and affects immediate in certain cases).

By using push byte 0xFF the NASM will take the byte part ALSO as "operand size", not just as immediate size. And byte is not legal operand size for push, so NASM will instead choose qword as by default in 64b mode. Then it will also consider the byte as immediate size, and sign-extend the 0xFF to qword. I.e. this looks to me as a bit of undefined behaviour. NASM creators probably don't expect you to specify immediate size, because the NASM optimizes for size, so when you do push word -1, it will assemble that as "push word operand imm8". You can override that the other way, to make sure you get imm16 by push strict word -1.

See the machine code produced by the various combinations (in 64b mode) (some of them speaking strictly are worth at least of warning, or even error, like "strict qword" producing only imm32, not imm64 (as imm64 opcode does not exist of course) ... not even mentioning that the dword variants are effectively qword operand sizes, you can't use 32b operand size in 64b mode):

 6 00000000 6AFF                            push    -1
 7 00000002 6AFF                            push    strict byte 0xFF
 8          ******************       warning: signed byte value exceeds bounds
 9 00000004 6AFF                            push    byte 0xFF
10          ******************       warning: signed byte value exceeds bounds
11 00000006 6AFF                            push    strict byte -1
12 00000008 6AFF                            push    byte -1
13 0000000A 6668FF00                        push    strict word 0xFF
14 0000000E 6668FF00                        push    word 0xFF
15 00000012 6668FFFF                        push    strict word -1
16 00000016 666AFF                          push    word -1
17 00000019 68FF000000                      push    strict dword 0xFF
18 0000001E 68FF000000                      push    dword 0xFF
19 00000023 68FFFFFFFF                      push    strict dword -1
20 00000028 6AFF                            push    dword -1
21 0000002A 68FF000000                      push    strict qword 0xFF
22 0000002F 68FF000000                      push    qword 0xFF
23 00000034 68FFFFFFFF                      push    strict qword -1
24 00000039 6AFF                            push    qword -1

Anyway, I guess not too many people are bothered by this, as in 64b mode you usually want qword push (rsp -= 8) with immediate encoded in shortest possible way, so you just write push -1 and let the NASM handle the imm8 optimization itself, expecting rsp to change by -8 of course. And in other case, they probably expect you to know legal operand sizes, and not to use byte at all.

If you think this is not acceptable, I would raise this on the NASM forum/bugzilla/somewhere, how it is supposed to work exactly. As far as I'm personally concerned, the current behaviour is "good enough" for me (makes both sense, plus I give quick look to listing file from time to time to verify there's no nasty surprise in the machine code bytes and it landed as expected). That said, I mostly code size intros, so I know about every byte produced and it's purpose. If the NASM would suddenly produce imm16 instead of expected imm8, I would see it on the binary size and investigate.

What's the purpose of stack pointer alignment in the prologue of main()

The System V AMD64 ABI (x86-64 ABI) requires 16-byte stack alignment. double requires 8-byte alignment and SSE extensions require 16-byte alignment.

gcc documentation points it in its documentation for -mpreferred-stack-boundary option:

-mpreferred-stack-boundary=num

Attempt to keep the stack boundary aligned to a 2 raised to num byte boundary. If -mpreferred-stack-boundary is not specified, the default is 4 (16 bytes or 128 bits).

Warning: When generating code for the x86-64 architecture with SSE extensions disabled, -mpreferred-stack-boundary=3 can be used to keep the stack boundary aligned to 8 byte boundary. Since x86-64 ABI require 16 byte stack alignment, this is ABI incompatible and intended to be used in controlled environment where stack space is important limitation. This option leads to wrong code when functions compiled with 16 byte stack alignment (such as functions from a standard library) are called with misaligned stack. In this case, SSE instructions may lead to misaligned memory access traps. In addition, variable arguments are handled incorrectly for 16 byte aligned objects (including x87 long double and __int128), leading to wrong results. You must build all modules with -mpreferred-stack-boundary=3, including any libraries. This includes the system libraries and startup modules.

Stack alignment and ISR

For ARMv7M (including the Cortex-M3 in your atsam3x) the stack alignment in interrupt handlers is controlled by hardware.

Firstly, it is impossible to ever have the stack pointer aligned any worse than 4 bytes. This is because the bottom two bits of the stack pointer are always zero and no instruction can ever change them. The compiler knows this and so if you create char[3] it rounds it up to 4 bytes.

If the STKALIGN bit of the CCR control register is 0 then this is all that happens. The stack pointer is aligned to a multiple of 4 bytes on entry to an interrupt handler function.

If the STKALIGN bit is 1 then the hardware automatically aligns the stack to an 8-byte boundary on entry to an interrupt.

On Cortex-M3 the reset value of the CCR.STKALIGN is 1, and ARM strongly recommend that you do not change it.

In the ARM ABI it is the responsibility of the caller to align the stack. This is because there is a 50:50 chance that it knows it is already aligned without doing anything, so this is much more efficient.

If your compiler is configured to generate code for the ARM ABI then it will assume that the stack is correctly aligned to an 8-byte boundary on entry to any externally linked function and not generate any code to align it again in the called function.

On ARMv7M (and v6M) it is normal and correct to use a bare function as an interrupt handler. There is no ISR prolog/epilog as mentioned in some of the comments.

All of this combined means that as long as your compiler is configured to use the ARM ABI, and as long as you haven't changed the default value of CCR.STKALIGN then your stack will always be correctly aligned.

Correct stack alignment for call to printf?

I have seen examples where the stack pointer/esp is decremented by 4 before calling printf and re-adjusted by 12 after calling printf:

According to a comment in another question, the stack shall be aligned at 16 bytes on systems (e.g. libraries, operating systems) that can use SSE instructions.

Assuming the stack pointer is aligned correctly when the function (main) is called, the call instruction subtracts 4 bytes from esp, so sub and push instructions must subtract exactly 12, 28, 40 ... bytes from esp to keep the stack pointer aligned correctly.

sub esp, 8

Obviously, in this case the compiler is not told to care for a 16-byte stack alignment.

And obviously the compiler allocates more stack than necessary in this case.

I have just told the compiler to generate an 8-byte and a 16-byte alignment for the stack; all other compiler options (and of course the source code) were the same.

The difference was that in the case of the 8-byte alignment, the compiler generated sub esp, 4, in the case of the 16-byte alignment sub esp, 20.

Obviously, this is a problem in the compiler optimization:

If sub esp,20 aligns the stack to 16 bytes, sub esp, 4 will also align to 16 bytes.

And using the "align to 8 byte" option shows that it would definitely possible to do a sub esp, 4 instead of a sub esp, 20.

This shows that some compilers reserve more stack than necessary for some unknown purposes.

Understanding stack allocation and alignment

I think you're missing the fact that there is no requirement for all stack variables to be individually aligned to 16-byte boundaries.

What Is "Stack Alignment"