What's the Difference Between 'Push' and 'Pushq' in At&T Assembly

What's the difference between 'push' and 'pushq' in at&t assembly

I'm not sure what assembly language you're using, but that's true for GAS(GNU Assembler) that uses AT&T syntax too: GAS assembly instructions are generally suffixed with the letters "b", "s", "w", "l", "q" or "t" to determine what size operand is being manipulated.

b = byte (8 bit)
s = short (16 bit integer) or single (32-bit floating point)
w = word (16 bit)
l = long (32 bit integer or 64-bit floating point)
q = quad (64 bit)
t = ten bytes (80-bit floating point)

If the suffix is not specified, and there are no memory operands for the instruction, GAS infers the operand size from the size of the destination register operand (the final operand).

pushq $0x0 just pushes 8 zero bytes to stack. Then push %r9 defines that %r9 is 64 bit register and pushes it's value to stack.

The interesting fact about the stack that it grows down, so null bytes will have higher addresses than the value of %r9, so here may be misunderstanding, because actually value of %r9 is below the null bytes.

In x86-64 do we always do pushq when we want to push something on the stack?

The entire register is call-preserved, not just the low dword or word. Normal functions always save/restore the whole qword register because that's the only safe thing to do, and it's also efficient enough that there's no reason to create a mechanism for functions to know when they could do anything else.

It's always efficient to read a full register after the 32-bit low half was written because 32-bit register writes implicitly zero-extend to 64-bit. Reading a 64-bit register after the caller wrote the low 8 or 16-bits could cause a partial-register stall on Intel P6-family microarchitectures, if the caller was careless about how it used the register before making a call. On modern uarches (not Intel P6), the 8/16-bit operand size register write already paid whatever merging penalty might have existed (typically a false dependency). (I'm glossing over a couple details like partial AH renaming still being a thing on modern Intel, including Skylake)

While you could move the stack pointer with sub $24, %rsp and use movl or movb to store the 32-bit or 8-bit low parts of some registers, that's only safe if you know something about how your caller uses registers and want to optimize accordingly. (Making your function dependent on the caller's internals, not just the ABI). Even if that was an option for some helper function, it normally wouldn't be worth it to reduce the footprint of your stack frame by a few bytes.

(It's rare for functions to be using 16-bit data, but 8-bit data is not rare. bool and char are common. Compilers usually use movzx aka movzbl loads from memory to zero-extend to full registers, and can often use 32-bit operand size to avoid actually dealing with partial register shenanigans. But they wouldn't care if you saved/restored only the low 8 bits with a mov store / movzbl reload, for registers where a compile is keeping a zero-extended bool or char.)

Are pushl and pushw ever used in x86-64?

pushl literally doesn't exist in 64-bit mode; 32-bit operand-size for push is not encodeable even with a REX.W=0 prefix.

pushw encodeable but never used by compilers in 32 or 64-bit mode. (And generally not useful or recommended for humans, except for weird corner cases or hacks like maybe shellcode. I did use it once when code-golfing (optimizing for code size) merging two 16-bit values into one register for adler-32).

If a compiler did want to do word or dword stores, (e.g. in unoptimized builds to spill incoming register args), it would just use movw or movl.

You generally want to keep the stack aligned by 16 so you're ready to make another function call; that's why I suggested sub $24, %rsp above. (On function entry, RSP points at the return address your caller pushed. RSP+8 and RSP-8 are 16-byte aligned.)

pushq %reg is very efficient on modern CPUs: decodes to a single uop on CPUs with a stack engine (that handles the RSP updates) outside the OoO exec back-end. It's so efficient that clang uses push %rax or other dummy register instead of sub $8, %rsp when it only needs to move the stack pointer by 8 bytes, e.g. to realign the stack before another call.

pushq %reg is a 1 byte instruction (or 2 bytes for r8..r15 including a REX prefix)

Difference between push myVar , push [myVar] and push OFFSET myVar

push myVar is simply pushing your var on the stack.

push [myVar] is dereferencing your var. if myVar is a pointer, this code will push the value at the address on the stack.

I'm not sure for the last one, but it seems it does the inverse, push OFFSET myVar is pushing the address of myVar on the stack.

What is the function of the push / pop instructions used on registers in x86 assembly?

pushing a value (not necessarily stored in a register) means writing it to the stack.

popping means restoring whatever is on top of the stack into a register. Those are basic instructions:

push 0xdeadbeef      ; push a value to the stack
pop eax              ; eax is now 0xdeadbeef

; swap contents of registers
push eax
mov eax, ebx
pop ebx

What is callq instruction?

It's just call. Use Intel-syntax disassembly if you want to be able to look up instructions in the Intel/AMD manuals. (objdump -drwC -Mintel, GBD set disassembly-flavor intel, GCC -masm=intel)

The q operand-size suffix does technically apply (it pushes a 64-bit return address and treats RIP as a 64-bit register), but there's no way to override it with instruction prefixes. i.e. calll and callw aren't encodeable in 64-bit mode according to Intel's manual, so it's just annoying that some AT&T syntax tools show it as callq instead of call. This of course applies to retq as well.

Different tools are different in 32 vs. 64-bit mode. (Godbolt)

gcc -S: always call/ret. Nice.
clang -S: callq/retq and calll/retl. At least it's consistently annoying.
objdump -d: callq/retq (explicit 64-bit) and call/ret (implicit for 32-bit). Inconsistent and kinda dumb because 64-bit has no choice of operand-size, but 32-bit does. (Not a useful choice, though: callw truncates EIP to 16 bits.)
Although on the other hand, the default operand size (without a REX.W prefix) for most instructions in 64-bit mode is still 32. But add $1, (%rdi) needs an operand-size suffix; the assembler won't pick 32-bit for you if nothing implies one. OTOH, push is implicitly pushq, even though pushw $1 and pushq $1 are both encodeable (and usable in practice) in 64-bit mode.

GAS in 64-bit mode will assemble callw foo / foo: to 66 e8 00 00, but my Skylake CPU single-steps it as a 6-byte instruction, consuming 2 bytes of 00 after it. And changing RSP by 8. So it decodes it as callq with a rel32=0, ignoring the 66 operand-size prefix. So even though there's no choice of operand-size, GNU Binutils thinks there is. (Tested with GAS 2.38). So it's still odd that it uses suffixes in 64-bit mode but not 32, since it thinks the situation is the same in both modes.

Clang and llvm-objdump -d have the same bug, assembling / disassembling callw in 64-bit mode.

AMD's manual says 64-bit mode can't use 32-bit operand-size, but does not mention any limitation on using 16-bit operand-size. So perhaps GAS and LLVM are correct for AMD CPUs, and there is still the same choice of 66 prefix or not, as in 32-bit mode. (You could test by seeing if RIP = 0x1004 after single-stepping callw foo / foo: in a static executable, instead of 0x401006, with the .text section starting at 0x401000.)

NASM's ndisasm -b64 assumes that a 66 prefix will be ignored in 64-bit mode, disassembling 66E800000000 as call qword 0x18c (it doesn't understand ELF metadata, so I just padded with nops and found it in disassembly of a .o as if it were a flat binary, hence the unusual address.)

From Intel's instruction-set ref manual (linked above):

For a near call absolute, an absolute offset is specified indirectly in a general-purpose register or a memory location (r/m16, r/m32, or r/m64).
The operand-size attribute determines the size of the target operand (16, 32 or 64 bits). When in 64-bit mode, the operand size for near call (and all near branches) is forced to 64-bits.

for rel32 ... As with absolute offsets, the operand-size attribute determines the size of the target operand (16, 32, or 64 bits). In 64-bit mode the target operand will always be 64-bits because the operand size is forced to 64-bits for near branches.

In 32-bit mode, you can encode a 16-bit call rel16 that truncates EIP to 16 bits, or a call r/m16 that uses an absolute 16-bit address. But as the manual says, the operand-size is fixed in 64-bit mode.

This is unlike the situation with push, where it defaults to 64-bit in 64-bit mode, but can be overridden to 16 with an operand-size prefix. (But not to 32 with a REX.W=0). So pushq and pushw are both available, but only callq.

Why does gcc push %rbx at the beginning of main?

GCC dictates how the stack is used. Contract between caller and callee on x86:

    * after call instruction:
          o %eip points at first instruction of function
          o %esp+4 points at first argument
          o %esp points at return address 
    * after ret instruction:
          o %eip contains return address
          o %esp points at arguments pushed by caller
          o called function may have trashed arguments
          o %eax contains return value (or trash if function is void)
          o %ecx, %edx may be trashed
          o %ebp, %ebx, %esi, %edi must contain contents from time of call 
    * Terminology:
          o %eax, %ecx, %edx are "caller save" registers
          o %ebp, %ebx, %esi, %edi are "callee save" registers

The main function is like any other function in this context. gcc decided to use ebx for intermediate calculations, so it preserves its value.

What's the Difference Between 'Push' and 'Pushq' in At&T Assembly