Why Does This Function Push Rax to the Stack as the First Operation

Why does this function push RAX to the stack as the first operation?

The 64-bit ABI requires that the stack is aligned to 16 bytes before a call instruction.

call pushes an 8-byte return address on the stack, which breaks the alignment, so the compiler needs to do something to align the stack again to a multiple of 16 before the next call.

(The ABI design choice of requiring alignment before a call instead of after has the minor advantage that if any args were passed on the stack, this choice makes the first arg 16B-aligned.)

Pushing a don't-care value works well, and can be more efficient than sub rsp, 8 on CPUs with a stack engine. (See the comments).

Why use push/pop instead of sub and mov?

If you compile with -mtune=pentium3 or something earlier than -mtune=pentium-m, GCC will do code-gen like you imagined, because on those old CPUs push/pop really does decode to a separate ALU operation on the stack pointer as well as a load/store. (You'll have to use -m32, or -march=nocona (64-bit P4 Prescott) because those old CPUs also don't support x86-64). Why does gcc use movl instead of push to pass function args?

But Pentium-M introduced a "stack engine" in the front-end that eliminates the stack-adjustment part of stack ops like push/call/ret/pop. It effectively renames the stack pointer with zero latency. See Agner Fog's microarch guide and What is the stack engine in the Sandybridge microarchitecture?

As a general trend, any instruction that's in widespread use in existing binaries will motivate CPU designers to make it fast. For example, Pentium 4 tried to get everyone to stop using INC/DEC; that didn't work; current CPUs do partial-flag renaming better than ever. Modern x86 transistor and power budgets can support that kind of complexity, at least for the big-core CPUs (not Atom / Silvermont). Unfortunately I don't think there's any hope in sight for the false dependencies (on the destination) for instructions like sqrtss or cvtsi2ss, though.

Using the stack pointer explicitly in an instruction like add rsp, 8 requires the stack engine in Intel CPUs to insert a sync uop to update the out-of-order back-end's value of the register. Same if the internal offset gets too large.

In fact pop dummy_register is more efficient than add rsp, 8 or add esp,4 on modern CPUs, so compilers will typically use that to pop one stack slot with the default tuning, or with -march=sandybridge for example. Why does this function push RAX to the stack as the first operation?

See also What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once? re: using push to initialize local variables on the stack instead of sub rsp, n / mov. That could be a win in some cases, especially for code-size with small values, but compilers don't do it.

Also, no, GCC / clang won't make code that's exactly like what you show.

If they need to save registers around a function call, they will typically do that using mov to memory. Or mov to a call-preserved register that they saved at the top of the function, and will restore at the end.

I've never seen GCC or clang push multiple call-clobbered registers before a function call, other than to pass stack args. And definitely not multiple pops afterwards to restore into the same (or different) registers. Spill/reload inside a function typically uses mov. This avoids the possibility of push/pop inside a loop (except for passing stack args to a call), and allows the compiler to do branching without having to worry about matching pushes with pops. Also it reduces complexity of stack-unwind metadata that has to have an entry for every instruction that moves RSP. (Interesting tradeoff between instruction count vs. metadata and code size for using RBP as a traditional frame pointer.)

Something like your code-gen could be seen with call-preserved registers + some reg-reg moves in a tiny function that just called another function and then returned an __int128 that was a function arg in registers. So the incoming RSI:RDI would need to be saved, to return in RDX:RAX.

Or if you store to a global or through a pointer after a non-inline function call, the compiler would also need to save the function args until after the call.

In x86-64 do we always do pushq when we want to push something on the stack?

The entire register is call-preserved, not just the low dword or word. Normal functions always save/restore the whole qword register because that's the only safe thing to do, and it's also efficient enough that there's no reason to create a mechanism for functions to know when they could do anything else.

It's always efficient to read a full register after the 32-bit low half was written because 32-bit register writes implicitly zero-extend to 64-bit. Reading a 64-bit register after the caller wrote the low 8 or 16-bits could cause a partial-register stall on Intel P6-family microarchitectures, if the caller was careless about how it used the register before making a call. On modern uarches (not Intel P6), the 8/16-bit operand size register write already paid whatever merging penalty might have existed (typically a false dependency). (I'm glossing over a couple details like partial AH renaming still being a thing on modern Intel, including Skylake)

While you could move the stack pointer with sub $24, %rsp and use movl or movb to store the 32-bit or 8-bit low parts of some registers, that's only safe if you know something about how your caller uses registers and want to optimize accordingly. (Making your function dependent on the caller's internals, not just the ABI). Even if that was an option for some helper function, it normally wouldn't be worth it to reduce the footprint of your stack frame by a few bytes.

(It's rare for functions to be using 16-bit data, but 8-bit data is not rare. bool and char are common. Compilers usually use movzx aka movzbl loads from memory to zero-extend to full registers, and can often use 32-bit operand size to avoid actually dealing with partial register shenanigans. But they wouldn't care if you saved/restored only the low 8 bits with a mov store / movzbl reload, for registers where a compile is keeping a zero-extended bool or char.)

Are pushl and pushw ever used in x86-64?

pushl literally doesn't exist in 64-bit mode; 32-bit operand-size for push is not encodeable even with a REX.W=0 prefix.

pushw encodeable but never used by compilers in 32 or 64-bit mode. (And generally not useful or recommended for humans, except for weird corner cases or hacks like maybe shellcode. I did use it once when code-golfing (optimizing for code size) merging two 16-bit values into one register for adler-32).

If a compiler did want to do word or dword stores, (e.g. in unoptimized builds to spill incoming register args), it would just use movw or movl.

You generally want to keep the stack aligned by 16 so you're ready to make another function call; that's why I suggested sub $24, %rsp above. (On function entry, RSP points at the return address your caller pushed. RSP+8 and RSP-8 are 16-byte aligned.)

pushq %reg is very efficient on modern CPUs: decodes to a single uop on CPUs with a stack engine (that handles the RSP updates) outside the OoO exec back-end. It's so efficient that clang uses push %rax or other dummy register instead of sub $8, %rsp when it only needs to move the stack pointer by 8 bytes, e.g. to realign the stack before another call.

pushq %reg is a 1 byte instruction (or 2 bytes for r8..r15 including a REX prefix)

What to do with an empty pop?

You can either use add $8, %rsp or simply pop into a register whose value you don't care about, like pop %rcx.

The latter is slightly preferable on recent systems due to the shorter code size and quirks of the "stack engine" (explicit use of RSP can make Intel CPUs insert as stack-sync uop), but the former is not too bad either.

Some compilers (especially clang) do use dummy push/pop by default when they need to move RSP by exactly 8: Why does this function push RAX to the stack as the first operation?

Also note that your pop is redundant with mov %rbp, %rsp since you're using RBP as a frame pointer anyway. And that an _exit system call doesn't care where RSP is pointing.