Why Does The Stack Have to Be Page Aligned

What does it mean to align the stack?

Assume the stack looks like this on entry to _main (the address of the stack pointer is just an example):

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230

Push %ebp, and subtract 8 from %esp to reserve some space for local variables:

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230
|      %ebp       |
+-----------------+  <--- 0xbfff122c
:    reserved     :
:     space       :
+-----------------+  <--- 0xbfff1224

Now, the andl instruction zeroes the low 4 bits of %esp, which may decrease it; in this particular example, it has the effect of reserving an additional 4 bytes:

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230
|      %ebp       |
+-----------------+  <--- 0xbfff122c
:    reserved     :
:     space       :
+ - - - - - - - - +  <--- 0xbfff1224
:   extra space   :
+-----------------+  <--- 0xbfff1220

The point of this is that there are some "SIMD" (Single Instruction, Multiple Data) instructions (also known in x86-land as "SSE" for "Streaming SIMD Extensions") which can perform parallel operations on multiple words in memory, but require those multiple words to be a block starting at an address which is a multiple of 16 bytes.

In general, the compiler can't assume that particular offsets from %esp will result in a suitable address (because the state of %esp on entry to the function depends on the calling code). But, by deliberately aligning the stack pointer in this way, the compiler knows that adding any multiple of 16 bytes to the stack pointer will result in a 16-byte aligned address, which is safe for use with these SIMD instructions.

what is stack alignment?

Alignment of variables in memory (a short history).

In the past computers had an 8 bits databus. This means, that each clock cycle 8 bits of information could be processed. Which was fine then.

Then came 16 bit computers. Due to downward compatibility and other issues, the 8 bit byte was kept and the 16 bit word was introduced. Each word was 2 bytes. And each clock cycle 16 bits of information could be processed. But this posed a small problem.

Let's look at a memory map:

+----+
|0000| 
|0001|
+----+
|0002|
|0003|
+----+
|0004|
|0005|
+----+
| .. |

At each address there is a byte which can be accessed individually.
But words can only be fetched at even addresses. So if we read a word at 0000, we read the bytes at 0000 and 0001. But if we want to read the word at position 0001, we need two read accesses. First 0000,0001 and then 0002,0003 and we only keep 0001,0002.

Of course this took some extra time and that was not appreciated. So that's why they invented alignment. So we store word variables at word boundaries and byte variables at byte boundaries.

For example, if we have a structure with a byte field (B) and a word field (W) (and a very naive compiler), we get the following:

+----+
|0000| B
|0001| W
+----+
|0002| W
|0003|
+----+

Which is not fun. But when using word alignment we find:

+----+
|0000| B
|0001| -
+----+
|0002| W
|0003| W
+----+

Here memory is sacrificed for access speed.

You can imagine that when using double word (4 bytes) or quad word (8 bytes) this is even more important. That's why with most modern compilers you can chose which alignment you are using while compiling the program.

What's the purpose of page-aligned allocation?

It might be related to paging, and to virtual address space. Page size is related to the MMU so is usually constrained by the hardware.

On some operating systems, some functions or system calls want (i.e. require) a page-aligned pointer. For example, on Linux, mmap(2) (notably when you use it with MAP_FIXED, it wants a genuine page aligned address), mprotect(2), madvise(2), mlock(2), mremap(2), also related shm_overview(7).

Some very low-level IO operations might prefer (e.g. run faster) with page aligned buffers (perhaps send(2), or direct write(2) to some block device...) because the kernel might avoid some block copy (e.g. do some DMA) and could special-case page aligned data.

Why does the stack have to be page aligned?

There is a limit in do_page_fault() as to how far outside the stack vma you can be before it considers it a bad access, perhaps you're hitting that?

What does aligning the stack mean in assembly?

Addressing is generally byte-based. A unique address points at a byte (which can be the first byte in a word or doubleword, etc, but referenced to that address).

With any numbering system the least significant digit holds the value base to the power 0 (the number 1). The next least base to the power 1, the next base to the power 2. In decimal this is the ones column the tens column the hundreds column. In binary ones, twos, fours... Alignment means evenly divisible by which also means the least significant digits are zeros.

You are always "aligned" on a byte boundary but a 16 bit boundary in binary means the least significant bit is zero, 32 bit aligned two zeros and so on.

0x1234 aligned on both a 16 and 32 bit boundary but not 64 bit

0x1235 not aligned (byte alignment really isn't a thing)

0x1236 aligned on a 16 bit boundary

0x1230 four zeros so 16, 32, 64, 128 BITS not bytes. 2,4,8,16 bytes.

The why is for performance reasons all memories have a fixed width as well as data buses, you can't magically add or remove wires in the logic once implemented, there is a physical limit, you can choose to not use all of them as part of the design but you can't add any.

So while the x86 buses are wider, let's say you had a 32 bit wide data bus as well as a 32 bit wide memory (think cache but also dram but we don't access dram directly in general).

If I want to save the 16 bits 0xAABB to address 0x1001 in a little endian machine then 0x1001 will get 0xBB and 0x1002 will get 0xAA. If I had a 32 bit data bus and a 32 bit memory on the far side of it then I could move those 16 bits if I designed the bus for this, by writing 0xXXAABBXX to address 0x1000 with a byte lane mask of 0b0110 telling the memory controller to use the 32 bits of memory associated with the BYTE based address 0x1000, and the byte lane mask on the bus telling the controller only save the middle two bytes, the outer two are don't cares.

The memory is a fixed width generally so all transactions must be full width it would read the 32 bits modify the 16 in the middle with 0xAABB and write the 32 bits back. This is of course inefficient. Even worse would be to write 0xAABB to 0x1003 that would be two bus transactions one for 0xBBXXXXXX at address 0x1000 and one for 0xXXXXXXAA at address 0x1004. That is a lot of extra cycles both on the bus and the read-modify-writes on the memory.

Now the stack alignment rules are not going to prevent read-modify-writes on writes. For the cases where larger transfers happen there are opportunities for a performance gain, for example if the bus were 32 bits and the memory and you did a 64 bit transfer to address 0x1000, that can based on bus design look like a single transfer with a length of two. The bus handshake happens then two back to back clocks the data moves, rather than handshakes and one width of the bus of data for a smaller transfer. So you get a gain there if the memory is 32 bits wide then it is two writes without a read-modify-write into the sram in the cache. Pretty clean, want to avoid the read-modify-writes.

Now do this for a while as things evolve and the hardware and the tools desire a stack alignment.

Depending on the instruction set, clearly here you are asking x86, but as a programmer you can sometimes choose to say push a byte on the stack and then adjust it to align it. Or if you are making room for local variables, depending on the instruction set (if the stack pointer is general purpose enough to be able to do math on it) you can simply subtract, so sub sp,#8 is the same as pushing two 32 bit items to the stack simply to make room for two 32 bit items.

If the rule is say 32 bit alignment and you push a byte, then you need to adjust the stack pointer by 3 to make the total change in the stack pointer a multiple of 4 bytes (32 bits).

How you know how much is you simply count it up. If it is 16 byte alignment and you push 4 then you need to push 12 more or adjust the stack pointer by 12 more.

The key here is that if everyone agrees to keep the stack aligned then you don't actually have to look at the lower bits of the stack pointer, you just keep track of what you are pushing and popping before calling something else.

If the stack is shared with the interrupt handlers (not really in your current x86 running an operating system, but still possible and possible in many other use cases for general purpose processors) I have not seen that this rule applies there as you will see the compiler do a less than aligned size push or pop then adjust with other pushes or pops or subtraction or addition. If an interrupt occurred between those the handler would see an unaligned stack.

Some architectures will fault on unaligned accesses, a further reason for keeping the stack aligned.

If your code is not messing with the stack then you don't need to mess with the stack (pointer). Only if you use the stack in your code by allocating space on the stack (pushes or math on the stack pointer), do you need to care and you need to know what the convention of the compiler you are linking this code with and conform to that. If this is all assembly language and no compiler then you decide the convention yourself and basically do whatever you want within the limitations of the processor itself.

From your title question it has nothing to do with assembly at all, nor machine code. It has to do with your code and what it does. The assembly language is simply a language in which you convey how much you want to adjust the stack pointer, the instruction doesn't care or know about any such things it takes the constant provided and uses it against the register. Assembly is one of the few if not the only that allows you to do math on the stack pointer register, so there is that connection. But alignment and assembly are not related.

Why does the MIPS stack pointer need to be kept double word aligned?

The MIPS architecture can only access data types in memory that are evenly aligned with their size.

See MIPS Run by Dominic Sweetman says on page 320:

At the point where a subroutine is called, sp must be eight-byte-aligned, matching the alignment of the largest basic types - a long long integer, or a floating-point double. The eight-byte alignment is not required by 32-bit MIPS integer hardware, but it is essential for compatibility...

Thus, if you never try to push a double to the stack, you can very well live with 4-byte alignment on a 32-bit system. Whether your OS can, is another question, though.

What is the alignment of data on the stack?

There are two reasons to align data:

Hardware requirement. Some machine can only access data in memory if it's properly aligned. Sure, you could perform multiple reads and use some bit arithmetic to emulate reading from any address, but that would be devastating to performance.
Performance. Even if a machine can access any data at any address, it might perform better if the data is suitable aligned.

Of course, this could vary by machine, but "suitably aligned" usually means the address of an N bit datum is evenly divisible by N/8.

So, on a machine where alignment matters, a 32-bit int would be placed at a memory address divisible by 4, a 64-bit pointer would be placed at a memory address divisible by 8, etc.

You can see this in structures.

#include <stdint.h>
#include <stdio.h>

typedef struct {
   uint32_t u32;
   void*    p;
   uint8_t  u8;
} Struct;

int main(void) {
   Struct s;
   printf("%p\n", (void*)&s.u32);
   printf("%p\n", (void*)&s.p);
   printf("%p\n", (void*)&s.u8);
   printf("%p\n", (void*)(&s+1));
   printf("0x%zx\n", sizeof(s));
}

$ gcc -Wall -Wextra -pedantic a.c -o a && ./a
0x7ffef5f775d0
0x7ffef5f775d8
0x7ffef5f775e0
0x7ffef5f775e8
0x18

This means we have this:

 0 1 2 3 4 5 6 7 8 9 a b c d e f 0 1 2 3 4 5 6 7
+-------+-------+---------------+-+-------------+ 
| u32   |XXXXXXX| p             |*|XXXXXXXXXXXXX|   * = u8 
+-------+-------+---------------+-+-------------+   X = unused

Note the wasted space between u32 and p. This is so p is properly aligned.

Also note the wasted space after u8. This is so the structure itself is properly aligned when you have an array of them. Without this final padding, the u32 and p of the second element of the array wouldn't be properly aligned.

Finally, note that using

typedef struct {
   uint32_t u32;
   uint8_t  u8;
   void*    p;
} Struct;

would have resulted in a smaller structure.

 0 1 2 3 4 5 6 7 8 9 a b c d e f 
+-------+-+-----+---------------+
| u32   |*|XXXXX| p             |   * = u8 
+-------+-+-----+---------------+   X = unused

What are benefits of allocating a page-aligned memory chunk?

The "traditional" way to allocate memory is to have it in a contiguous address space (the "heap", growing upwards by calls to sbrk()). Each time you hit a page boundary, there will be a page fault and you get mapped a new page. There are two consequences of this strategy:

pages can only be freed when all allocations inside that page are freed AND when all other allocations are mapped to lower addresses. (the typical effect of heap fragmentation).
larger allocations might occupy one page more than strictly needed (if they start somewhere in the middle of a page).

So this strategy is only suitable for smaller blocks of memory where you don't want to "waste" a whole page for each allocation.

For bigger chunks, it's better to use mmap() which maps you new pages somewhere directly, so you get "page aligned memory". Using this, your allocation doesn't share pages with other allocations. As soon as you don't need the memory any more, you can give it back to the OS. Note that many malloc()implementations choose automatically whether to allocate using sbrk() or mmap(), depending on the size of the desired allocation.