How Is Stack Memory Allocated When Using 'Push' or 'Sub' X86 Instructions

How is Stack memory allocated when using 'push' or 'sub' x86 instructions?

How does the linux kernel avoid the stack overwriting the text (instructions)?

I'm asking about how the kernel enforces the stack size of user programs.

There's a growth limit, set with ulimit -s for the main stack, that will stop the stack from getting anywhere near .text. (And the guard pages below that make sure there's a segfault if the stack does overflow past the growth limit.) See How is Stack memory allocated when using 'push' or 'sub' x86 instructions?. (Or for thread stacks (not the main thread), stack memory is just a normal mmap allocation with no growth; the only lazy allocation is physical pages to back the virtual ones.)

Also, .text is a read+exec mapping of the executable, so there's no way to modify it without calling mprotect first. (It's a private mapping, so doing so would only affect the pages in memory, not the actual file. This is how text relocations work: runtime fixups for absolute addresses, to be fixed up by the dynamic linker.)

The actual mechanism for limiting growth is by simply not extending the mapping and allocating a new page when the process triggers a hardware page fault with the stack pointer below the existing stack area. Thus the page fault is an invalid one, instead of a soft aka minor for the normal stack-growth case, so a SIGSEGV is delivered.


If a program used alloca or a C99 VLA with an unchecked size, malicious input could make it jump over any guard pages and into some other read/write mapping such as .data or stuff that's dynamically allocated.

To harden buggy code against that so it segfaults instead of actually allowing a stack clash attack, there are compiler options that make it touch every intervening page as the stack grows, so it's certain to set off the "tripwire" in the form of an unmapped guard page below the stack-growth limit. See Linux process stack overrun by local variables (stack guarding)

If you set ulimit -s unlimited could you maybe grow the stack into some other mapping, if Linux truly does allow unlimited growth in that case without reserving a guard page as you approach another mapping.

What is the function of the push / pop instructions used on registers in x86 assembly?

pushing a value (not necessarily stored in a register) means writing it to the stack.

popping means restoring whatever is on top of the stack into a register. Those are basic instructions:

push 0xdeadbeef      ; push a value to the stack
pop eax ; eax is now 0xdeadbeef

; swap contents of registers
push eax
mov eax, ebx
pop ebx

How do programs know how much space to allocate for local variables on the stack?

The compiler knows because it looked at the source code (or actually its internal representation of the logic after parsing it) and added up the total size needed for all the things that it had to allocate stack space for. And also it has to get RSP 16-byte aligned before the call, given that RSP % 16 == 8 on function entry.

So alignment is one reason compilers may reserve more than the function actually uses, but also compiler missed-optimization bugs can make it waste space: common for GCC to waste an extra 16 bytes, although that's not happening here.

Yes, modern compilers parse the entire function (actually whole source file) before emitting any code for it. That's kind of the point of an ahead-of-time optimizing compiler, so it's designed around doing that, even if you make a debug build. By comparison, TCC, the Tiny C Compiler, is one-pass, and leaves a spot in its function prologue to go back later and fill in whatever total size after getting to the bottom of the function in the source code. See Tiny C Compiler's generated code emits extra (unnecessary?) NOPs and JMPs - when that number happens to be zero, there's still a sub esp, 0 there. (TCC only targets 32-bit mode.)

Related: Function Prologue and Epilogue in C


In leaf functions, compilers can use the red zone below RSP when targeting the x86-64 System V, avoiding the need to reserve as much (or any) stack space even if there are some locals they choose to spill/reload. (e.g. any at all in unoptimized code.) See also Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets? Except for kernel code, or other code compiled with -mno-red-zone.

Or in Windows x64, callers need to reserve shadow space for their callee to use, which also gives small functions the chance to not spend any instructions moving RSP around, just using the shadow space above their return address. But for non-leaf functions, this means reserving at least 32 bytes of shadow space plus any for alignment or locals. See for example Shadow space example

In standard calling conventions for ISAs other than x86-64, other rules may come into play that affect things.


Note that in 64-bit code, leave pops RBP, not EBP, and that ret pops into RIP, not EIP.

Also, mov ecx,DWORD PTR [rbp-0x4] is not variable initialization. That's a load, from uninitialized memory into a register. Probably you did something like int a,b,c; without initializers, then passed them as args to printf.

How does the stack pointer register work

I think that's supposed to read

sub  sp, 2       ; AX is only 2 bytes wide, not 4
mov [sp], ax ; store to memory, not writing the register

That is, put the value of ax into the memory pointed to by sp.

Perhaps your sub sp, 4 came from pushing a 32-bit register? The stack pointer always decreases by the push operand-size.

(Note that push doesn't modify FLAGS, unlike sub. This pseudocode / equivalent isn't exactly equivalent, also not for the push sp case. See Intel's manual, or this Q&A for pseudocode that works even in those cases.)

Why function parameter occupy at least 4 bytes stack on x86?

Such behaviour is normally governed by the Application Binary Interface (ABI) and the mostly used x86 ABIs (Win32 and Sys V) just requires that each parameter occupies at least 4 bytes. This is mainly due to the fact that most x86 implementations suffer from performance penalties if data is not properly aligned. While your example would not "de-align" the stack, a subroutine taking only three byte sized parameters would do so. Of course, one could define special rules in the ABI to overcome this but it complicates things for little gain.

Keep also in mind, that the x86 ABIs were designed around 1990. At this time, the number of instructions was a very good measure for the speed of a certain piece of code. You example requires one extra instruction compared with four pushes if para1-para4 are located in registers and five extra instructions in the worst case, that all parameters must be loaded from memory (x86 supports pushing memory locations directly).

Further, in your example, you trade saving 12 bytes on the stack for 14 extra code bytes: your code sequence requires 18 bytes of code in case para1-para4 (e.g. al-dl) are located in registers while four pushes require 4 bytes. So overall, you reduce the memory footprint only if you have recursions in your code.



Related Topics



Leave a reply



Submit