What C/C++ Compiler Can Use Push Pop Instructions For Creating Local Variables, Instead of Just Increasing Esp Once

What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?

You're right, push is a minor missed-optimization with all 4 major x86 compilers. There's some code-size, and thus indirectly performance to be had. Or maybe more directly a small amount of performance in some cases, e.g. saving a sub rsp instruction.

But if you're not careful, you can make things slower with extra stack-sync uops by mixing push with [rsp+x] addressing modes. pop doesn't sound useful, just push. As the forum thread you linked suggests, you only use this for the initial store of locals; later reloads and stores should use normal addressing modes like [rsp+8]. We're not talking about trying to avoid mov loads/stores entirely, and we still want random access to the stack slots where we spilled local variables from registers!

Modern code generators avoid using PUSH. It is inefficient on today's processors because it modifies the stack pointer, that gums-up a super-scalar core. (Hans Passant)

This was true 15 years ago, but compilers are once again using push when optimizing for speed, not just code-size. Compilers already use push/pop for saving/restoring call-preserved registers they want to use, like rbx, and for pushing stack args (mostly in 32-bit mode; in 64-bit mode most args fit in registers). Both of these things could be done with mov, but compilers use push because it's more efficient than sub rsp,8 / mov [rsp], rbx. gcc has tuning options to avoid push/pop for these cases, enabled for -mtune=pentium3 and -mtune=pentium, and similar old CPUs, but not for modern CPUs.

Intel since Pentium-M and AMD since Bulldozer(?) have a "stack engine" that tracks the changes to RSP with zero latency and no ALU uops, for PUSH/POP/CALL/RET. Lots of real code was still using push/pop, so CPU designers added hardware to make it efficient. Now we can use them (carefully!) when tuning for performance. See Agner Fog's microarchitecture guide and instruction tables, and his asm optimization manual. They're excellent. (And other links in the x86 tag wiki.)

It's not perfect; reading RSP directly (when the offset from the value in the out-of-order core is nonzero) does cause a stack-sync uop to be inserted on Intel CPUs. e.g. push rax / mov [rsp-8], rdi is 3 total fused-domain uops: 2 stores and one stack-sync.

On function entry, the "stack engine" is already in a non-zero-offset state (from the call in the parent), so using some push instructions before the first direct reference to RSP costs no extra uops at all. (Unless we were tailcalled from another function with jmp, and that function didn't pop anything right before jmp.)

It's kind of funny that compilers have been using dummy push/pop instructions just to adjust the stack by 8 bytes for a while now, because it's so cheap and compact (if you're doing it once, not 10 times to allocate 80 bytes), but aren't taking advantage of it to store useful data. The stack is almost always hot in cache, and modern CPUs have very excellent store / load bandwidth to L1d.


int extfunc(int *,int *);

void foo() {
int a=1, b=2;
extfunc(&a, &b);
}

compiles with clang6.0 -O3 -march=haswell on the Godbolt compiler explorer See that link for all the rest of the code, and many different missed-optimizations and silly code-gen (see my comments in the C source pointing out some of them):

 # compiled for the x86-64 System V calling convention: 
# integer args in rdi, rsi (,rdx, rcx, r8, r9)
push rax # clang / ICC ALREADY use push instead of sub rsp,8
lea rdi, [rsp + 4]
mov dword ptr [rdi], 1 # 6 bytes: opcode + modrm + imm32
mov rsi, rsp # special case for lea rsi, [rsp + 0]
mov dword ptr [rsi], 2
call extfunc(int*, int*)
pop rax # and POP instead of add rsp,8
ret

And very similar code with gcc, ICC, and MSVC, sometimes with the instructions in a different order, or gcc reserving an extra 16B of stack space for no reason. (MSVC reserves more space because it's targeting the Windows x64 calling convention which reserves shadow space instead of having a red-zone).

clang saves code-size by using the LEA results for store addresses instead of repeating RSP-relative addresses (SIB+disp8). ICC and clang put the variables at the bottom of the space it reserved, so one of the addressing modes avoids a disp8. (With 3 variables, reserving 24 bytes instead of 8 was necessary, and clang didn't take advantage then.) gcc and MSVC miss this optimization.

But anyway, more optimal would be:

    push    2                       # only 2 bytes
lea rdi, [rsp + 4]
mov dword ptr [rdi], 1
mov rsi, rsp # special case for lea rsi, [rsp + 0]
call extfunc(int*, int*)
# ... later accesses would use [rsp] and [rsp+] if needed, not pop
pop rax # alternative to add rsp,8
ret

The push is an 8-byte store, and we overlap half of it. This is not a problem, CPUs can store-forward the unmodified low half efficiently even after storing the high half. Overlapping stores in general are not a problem, and in fact glibc's well-commented memcpy implementation uses two (potentially) overlapping loads + stores for small copies (up to the size of 2x xmm registers at least), to load everything then store everything without caring about whether or not there's overlap.

Note that in 64-bit mode, 32-bit push is not available. So we still have to reference rsp directly for the upper half of of the qword. But if our variables were uint64_t, or we didn't care about making them contiguous, we could just use push.

We have to reference RSP explicitly in this case to get pointers to the locals for passing to another function, so there's no getting around the extra stack-sync uop on Intel CPUs. In other cases maybe you just need to spill some function args for use after a call. (Although normally compilers will push rbx and mov rbx,rdi to save an arg in a call-preserved register, instead of spilling/reloading the arg itself, to shorten the critical path.)

I chose 2x 4-byte args so we could reach a 16-byte alignment boundary with 1 push, so we can optimize away the sub rsp, ## (or dummy push) entirely.

I could have used mov rax, 0x0000000200000001 / push rax, but 10-byte mov r64, imm64 takes 2 entries in the uop cache, and a lot of code-size.

gcc7 does know how to merge two adjacent stores, but chooses not to do that for mov in this case. If both constants had needed 32-bit immediates, it would have made sense. But if the values weren't actually constant at all, and came from registers, this wouldn't work while push / mov [rsp+4] would. (It wouldn't be worth merging values in a register with SHL + SHLD or whatever other instructions to turn 2 stores into 1.)

If you need to reserve space for more than one 8-byte chunk, and don't have anything useful to store there yet, definitely use sub instead of multiple dummy PUSHes after the last useful PUSH. But if you have useful stuff to store, push imm8 or push imm32, or push reg are good.

We can see more evidence of compilers using "canned" sequences with ICC output: it uses lea rdi, [rsp] in the arg setup for the call. It seems they didn't think to look for the special case of the address of a local being pointed to directly by a register, with no offset, allowing mov instead of lea. (mov is definitely not worse, and better on some CPUs.)


An interesting example of not making locals contiguous is a version of the above with 3 args, int a=1, b=2, c=3;. To maintain 16B alignment, we now need to offset 8 + 16*1 = 24 bytes, so we could do

bar3:
push 3
push 2 # don't interleave mov in here; extra stack-sync uops
push 1
mov rdi, rsp
lea rsi, [rsp+8]
lea rdx, [rdi+16] # relative to RDI to save a byte with probably no extra latency even if MOV isn't zero latency, at least not on the critical path
call extfunc3(int*,int*,int*)
add rsp, 24
ret

This is significantly smaller code-size than compiler-generated code, because mov [rsp+16], 2 has to use the mov r/m32, imm32 encoding, using a 4-byte immediate because there's no sign_extended_imm8 form of mov.

push imm8 is extremely compact, 2 bytes. mov dword ptr [rsp+8], 1 is 8 bytes: opcode + modrm + SIB + disp8 + imm32. (RSP as a base register always needs a SIB byte; the ModRM encoding with base=RSP is the escape code for a SIB byte existing. Using RBP as a frame pointer allows more compact addressing of locals (by 1 byte per insn), but takes an 3 extra instructions to set up / tear down, and ties up a register. But it avoids further access to RSP, avoiding stack-sync uops. It could actually be a win sometimes.)

One downside to leaving gaps between your locals is that it may defeat load or store merging opportunities later. If you (the compiler) need to copy 2 locals somewhere, you may be able to do it with a single qword load/store if they're adjacent. Compilers don't consider all the future tradeoffs for the function when deciding how to arrange locals on the stack, as far as I know. We want compilers to run quickly, and that means not always back-tracking to consider every possibility for rearranging locals, or various other things. If looking for an optimization would take quadratic time, or multiply the time taken for other steps by a significant constant, it had better be an important optimization. (IDK how hard it might be to implement a search for opportunities to use push, especially if you keep it simple and don't spend time optimizing the stack layout for it.)

However, assuming there are other locals which will be used later, we can allocate them in the gaps between any we spill early. So the space doesn't have to be wasted, we can simply come along later and use mov [rsp+12], eax to store between two 32-bit values we pushed.


A tiny array of long, with non-constant contents

int ext_longarr(long *);
void longarr_arg(long a, long b, long c) {
long arr[] = {a,b,c};
ext_longarr(arr);
}

gcc/clang/ICC/MSVC follow their normal pattern, and use mov stores:

longarr_arg(long, long, long):                     # @longarr_arg(long, long, long)
sub rsp, 24
mov rax, rsp # this is clang being silly
mov qword ptr [rax], rdi # it could have used [rsp] for the first store at least,
mov qword ptr [rax + 8], rsi # so it didn't need 2 reg,reg MOVs to avoid clobbering RDI before storing it.
mov qword ptr [rax + 16], rdx
mov rdi, rax
call ext_longarr(long*)
add rsp, 24
ret

But it could have stored an array of the args like this:

longarr_arg_handtuned:
push rdx
push rsi
push rdi # leave stack 16B-aligned
mov rsp, rdi
call ext_longarr(long*)
add rsp, 24
ret

With more args, we start to get more noticeable benefits especially in code-size when more of the total function is spent storing to the stack. This is a very synthetic example that does nearly nothing else. I could have used volatile int a = 1;, but some compilers treat that extra-specially.


Reasons for not building stack frames gradually

(probably wrong) Stack unwinding for exceptions, and debug formats, I think don't support arbitrary playing around with the stack pointer. So at least before making any call instructions, a function is supposed to have offset RSP as much as its going to for all future function calls in this function.

But that can't be right, because alloca and C99 variable-length arrays would violate that. There may be some kind of toolchain reason outside the compiler itself for not looking for this kind of optimization.

This gcc mailing list post about disabling -maccumulate-outgoing-args for tune=default (in 2014) was interesting. It pointed out that more push/pop led to larger unwind info (.eh_frame section), but that's metadata that's normally never read (if no exceptions), so larger total binary but smaller / faster code. Related: this shows what -maccumulate-outgoing-args does for gcc code-gen.

Obviously the examples I chose were trivial, where we're pushing the input parameters unmodified. More interesting would be when we calculate some things in registers from the args (and data they point to, and globals, etc.) before having a value we want to spill.

If you have to spill/reload anything between function entry and later pushes, you're creating extra stack-sync uops on Intel. On AMD, it could still be a win to do push rbx / blah blah / mov [rsp-32], eax (spill to the red zone) / blah blah / push rcx / imul ecx, [rsp-24], 12345 (reload the earlier spill from what's still the red-zone, with a different offset)

Mixing push and [rsp] addressing modes is less efficient (on Intel CPUs because of stack-sync uops), so compilers would have to carefully weight the tradeoffs to make sure they're not making things slower. sub / mov is well-known to work well on all CPUs, even though it can be costly in code-size, especially for small constants.

"It's hard to keep track of the offsets" is a totally bogus argument. It's a computer; re-calculating offsets from a changing reference is something it has to do anyway when using push to put function args on the stack. I think compilers could run into problems (i.e. need more special-case checks and code, making them compile slower) if they had more than 128B of locals, so you couldn't always mov store below RSP (into what's still the red-zone) before moving RSP down with future push instructions.

Compilers already consider multiple tradeoffs, but currently growing the stack frame gradually isn't one of the things they consider. push wasn't as efficient before Pentium-M introduce the stack engine, so efficient push even being available is a somewhat recent change as far as redesigning how compilers think about stack layout choices.

Having a mostly-fixed recipe for prologues and for accessing locals is certainly simpler.

Why can variables be initialized(and used) without its declaration and definition being run?

Jumping between asm() statements is not supported by GCC;

your code has undefined behaviour.
Literally anything is allowed to happen.

There's no __builtin_unreachable() after it, and you didn't even use asm goto("" ::: : "label") (GCC manual) to tell it about a C label the asm statement might or might not jump to.

Whatever happens in practice with different versions of gcc/clang and different optimization levels when you do that is a coincidence / implementation detail / result of whatever the optimizer actually did.

For example, with optimization enabled it would do constant-propagation assuming that the int ptr=9000; statement would be reached, because it's allowed to assume that execution comes out the end of the first asm statement.

You'd have to look at the compiler's full asm output to see what actually happened. e.g. https://godbolt.org/z/MbGhEnK3b shows GCC -O0 and -O2. With -O0 you do indeed get it reading uninitialized stack space since it jumps over a mov DWORD PTR [rbp-4], 9000, and with -O2 you get constant-propagation: mov esi, 9000 before the call std::basic_ostream<char,... operator <<(int) overload.

because the memory isn't allocated

Space for it actually is allocated in the function prologue; compilers don't generate code to move the stack pointer every time they encounter a declaration inside a scope. They allocate space once at the start of a function. Even the one-pass Tiny C Compiler works this way, not using a separate push to alloc+init separate int vars. (This is actually a missed optimization in some cases when push would be useful to alloc + init in one instruction: What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?)


Even moreso than most other kinds of C undefined behaviour, this is not something the compiler can actually detect at run-time to warn you about. asm statements just insert text into GCC's asm output which is fed to the assembler. You need to accurately describe to the compiler what the asm does (using constraints and things like asm goto) to give the compiler enough information to generate correct code around your asm statement.

GCC does not parse the instructions in the asm template, it just copies it directly to the asm output. (Or for Extended asm, substitutes the %0, %1 etc. operands with text generated according to the operand constraints.)

Why use push/pop instead of sub and mov?

If you compile with -mtune=pentium3 or something earlier than -mtune=pentium-m, GCC will do code-gen like you imagined, because on those old CPUs push/pop really does decode to a separate ALU operation on the stack pointer as well as a load/store. (You'll have to use -m32, or -march=nocona (64-bit P4 Prescott) because those old CPUs also don't support x86-64). Why does gcc use movl instead of push to pass function args?

But Pentium-M introduced a "stack engine" in the front-end that eliminates the stack-adjustment part of stack ops like push/call/ret/pop. It effectively renames the stack pointer with zero latency. See Agner Fog's microarch guide and What is the stack engine in the Sandybridge microarchitecture?

As a general trend, any instruction that's in widespread use in existing binaries will motivate CPU designers to make it fast. For example, Pentium 4 tried to get everyone to stop using INC/DEC; that didn't work; current CPUs do partial-flag renaming better than ever. Modern x86 transistor and power budgets can support that kind of complexity, at least for the big-core CPUs (not Atom / Silvermont). Unfortunately I don't think there's any hope in sight for the false dependencies (on the destination) for instructions like sqrtss or cvtsi2ss, though.


Using the stack pointer explicitly in an instruction like add rsp, 8 requires the stack engine in Intel CPUs to insert a sync uop to update the out-of-order back-end's value of the register. Same if the internal offset gets too large.

In fact pop dummy_register is more efficient than add rsp, 8 or add esp,4 on modern CPUs, so compilers will typically use that to pop one stack slot with the default tuning, or with -march=sandybridge for example. Why does this function push RAX to the stack as the first operation?

See also What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once? re: using push to initialize local variables on the stack instead of sub rsp, n / mov. That could be a win in some cases, especially for code-size with small values, but compilers don't do it.


Also, no, GCC / clang won't make code that's exactly like what you show.

If they need to save registers around a function call, they will typically do that using mov to memory. Or mov to a call-preserved register that they saved at the top of the function, and will restore at the end.

I've never seen GCC or clang push multiple call-clobbered registers before a function call, other than to pass stack args. And definitely not multiple pops afterwards to restore into the same (or different) registers. Spill/reload inside a function typically uses mov. This avoids the possibility of push/pop inside a loop (except for passing stack args to a call), and allows the compiler to do branching without having to worry about matching pushes with pops. Also it reduces complexity of stack-unwind metadata that has to have an entry for every instruction that moves RSP. (Interesting tradeoff between instruction count vs. metadata and code size for using RBP as a traditional frame pointer.)

Something like your code-gen could be seen with call-preserved registers + some reg-reg moves in a tiny function that just called another function and then returned an __int128 that was a function arg in registers. So the incoming RSI:RDI would need to be saved, to return in RDX:RAX.

Or if you store to a global or through a pointer after a non-inline function call, the compiler would also need to save the function args until after the call.

All the calculations take place in registers. Why is the stack not storing the result of the register computation here

All the calculations take place in registers.

Well yes, but they're stored afterwards.

Using memory-destination add instead of just using the accumulator register (EAX) would be an optimization. And one that's impossible when when the result needs to be in a different location than any of the inputs to an expression.

Why is the stack not storing the result of the register computation here

It is, just not with push


You compiled with optimization disabled (debug mode) so every C object really does have its own address in the asm, and is kept in sync between C statements. i.e. no keeping C variables in registers. (Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?). This is one reason why debug mode is extra slow: it's not just avoiding optimizations, it's forcing store/reload.

But the compiler uses mov not push because it's not a function arg. That's a missed optimization that all compilers share, but in this case it's not even trying to optimize. (What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?). It would certainly be possible for the compiler to reserve space for c in the same instruction as storing it, using push. But compilers instead to stack-allocation for all locals on entry to a function with one sub esp, constant.

Somewhere before the mov dword ptr [c],eax that spills c to its stack slot, there's a sub esp, 12 or something that reserves stack space for c. In this exact case, MSVC uses a dummy push to reserve 4 bytes space, as an optimization over sub esp, 4.

In the MSVC asm output, the compiler will emit a c = ebp-4 line or something that defines c as a text substitution for ebp-4. If you looked at disassembly you'd just see [ebp-4] or whatever addressing mode instead of.

In MSVC asm output, don't assume that [c] refers to static storage. It's actually still stack space as expected, but using a symbolic name for the offset.


Putting your code on the Godbolt compiler explorer with 32-bit MSVC 19.22, we get the following asm which only uses symbolic asm constants for the offset, not the whole addressing mode. So [c] might just be that form of listing over-simplifying even further.

_c$ = -4                                                ; size = 4
_a$ = 8 ; size = 4
_b$ = 12 ; size = 4
int AddMe(int,int) PROC ; AddMe
push ebp
mov ebp, esp ## setup a legacy frame pointer
push ecx # RESERVE 4B OF STACK SPACE FOR c

mov eax, DWORD PTR _a$[ebp]
add eax, DWORD PTR _b$[ebp] # c = a+b
mov DWORD PTR _c$[ebp], eax # spill c to the stack

mov eax, DWORD PTR _c$[ebp] # reload it as the return value

mov esp, ebp # restore ESP
pop ebp # tear down the stack frame
ret 0
int AddMe(int,int) ENDP ; AddMe

Why do we use sub esp, 4 instead of push a register in assembly?

TL:DR: Clang already does this. GCC doesn't except at -Os. I haven't benchmarked.


Code-size isn't everything. A dummy push is still a real store that takes up a store-buffer entry until it commits to cache. In fact code-size is usually the last thing to worry about, only when all else is equal (number of front-end uops, avoiding back-end bottlenecks, avoiding any performance pitfalls).

Historically (16-bit x86 before CPUs had caches), push cx would probably not have been faster than sub sp, 2 (3 bytes) or dec sp / dec sp (2 bytes) on those old CPUs where memory bandwidth was the main factor in performance (including for code-fetch). Optimizing for speed on 8088 especially is approximately the same as optimizing for code size.

The reason xor eax,eax is still preferred is that later CPUs were able to make it still at least as fast even apart from the code-size advantage. What is the best way to set a register to zero in x86 assembly: xor, mov or and?


On later CPUs like PPro, push decoded to multiple uops (to adjust ESP and separately to store). So on those CPUs, despite the smaller code size, it costs more in the front-end. Or on P5 Pentium (which didn't decode complex instructions into multiple uops), push stalled the pipeline temporarily and was often avoided by compilers even when the store-to-memory side effect was desired.

But finally, around Pentium-M, CPUs got a "stack engine" that handles the ESP-update part of stack operations outside of the out-of-order back-end, making it single-uop and zero latency (for the dep chain through ESP). As you can see from that link, the stack-sync uops that the stack engine has to insert sometimes do make sub esp,4 cost more than push, if you weren't already going to reference esp directly in the back end before the next stack op. (like call)

IDK if it really would have been a good idea to start using dummy push ecx on CPUs that old, or if limited store-buffer sizes meant that it wasn't a good idea to use up execution resources on doing dummy stores, even to cache lines that were almost certainly hot (the top of the stack).

But anyway, modern compilers do use this peephole optimization, especially in 64-bit mode where needing to adjust the stack by only one push is common. Modern CPUs have large store buffers.

void foo();

int bar() {
foo();
return 0;
}

Clang has been doing this for several years. e.g. with current clang 10.0 -O3 (optimize for speed over size) on Godbolt

bar():
push rax
call foo()
xor eax, eax
pop rcx
ret

GCC does this at -Os, but not at -O3 (I tried with -march=skylake, still chooses to use sub.)

It's less easy to construct a case where sub esp,4 would be useful, but this works:

int bar() {
volatile int arr[1]= {0};
return 0;
}

clang10.0 -m32 -O3 -mtune=skylake

bar():                                # @bar()
push eax
mov dword ptr [esp], 0 # missed optimization for push 0
xor eax, eax
pop ecx
ret

Unfortunately compiler's don't spot the fact that push 0 could have both initialized and reserved space for the volatile int object, replacing both push eax and mov dword [esp], 0 What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?



Related Topics



Leave a reply



Submit