Difference Between Retq and Ret

What is the difference between retq and ret?

In long (64-bit) mode, you return (ret) by popping a quadword address from the stack to %rip.

In 32-bit mode, you return (ret) by popping a dword address from the stack to %eip.

Some tools like objdump -d call the first one retq. It's just a name, the instruction encoding is the same either way (C3).

Retq instruction, where does it return

After studying assembly code, here are my thoughts,
let's look at a sample:

fun:
push %rbp
mov %rsp,%rbp
...
...
pop %rbp
retq

main:
...
...
callq  "address" <fun>
...
...

We can see there is a instruction before retq. The pop %rbp (sometimes it is a leave instruction but they are similar) instruction will

save the content of current stack pointer %rsp to base stack pointer %rbp.
move the %rsp pointer to previous address on stack.

For example: before pop command, the %rsp pointed to 0x0000 0000 0000 00D0. After the pop command it points to 0x0000 0000 0000 00D8 (assume the stack grows from high address to low address).

After the pop command, now %rsp points to a new address and retq takes this address as return address.

What does `rep ret` mean?

There's a whole blog named after this instruction. And the first post describes the reason behind it: http://repzret.org/p/repzret/

Basically, there was an issue in the AMD's branch predictor when a single-byte ret immediately followed a conditional jump as in the code you quoted (and a few other situations), and the workaround was to add the rep prefix, which is ignored by CPU but fixes the predictor penalty.

repz ret: why all the hassle?

Branch misprediction
The reason for all the hoopla is the cost of branch mispredictions.

When a branch comes around the CPU predicts the branch taken and preloads these instructions in the pipeline.

If the prediction is wrong the pipeline needs to be cleared and new instructions loaded.

This can take up to number_of_stages_in_pipeline cycles plus any cycles needed to load the data from the cache. 14 to 25 cycles per misprediction is typical.

Reason: processor design
The reason K8 and K10 suffer from this is because of a nifty optimization by AMD.

AMD K8 and K10 will pre-decode instructions in the cache and keep track of their length in the CPU L1 instruction cache.

In order to do this it has extra bits.

For every 128 bits (16 bytes) of instructions there are 76 bits of additional data stored.

The following table details this:

Data             Size       Notes
-------------------------------------------------------------------------
Instructions     128 bits   The data as read from memory
Parity bits      8 bits     One parity bit for every 16 bits
Pre-decode       56 bits    3 bits per byte (start, end, function) 
                            + 4 bit per 16 byte line
Branch selectors 16 bits    2 bits for each 2 bytes of instruction code

Total            204 bits   128 instructions, 76 metadata

Because all this data is stored in the L1 instruction cache the K8/10 cpu has to spend a lot less work on decode and branch prediction. This saves on silicon.

And because AMD does not have as big a transistor's budget as Intel it needs to work smarter.

However if the code is esp. tight a jump and a ret might occupy the same two byte slot, meaning that there the RET gets predicted as NOT taken (because the jump following it is).

By making the RET occupy two bytes REP RET this can never occur and a RET will always be predicted OK.

Intel does not have this problem, but (used to) suffer(s) from a limited number of prediction slots, which AMD does not.

nop ret
There is never a reason to do nop ret. This is two instructions wasting an extra cycle to execute the nop and the ret might still 'pair' with a jump.

If you want to align use a REP MOV instead or use a multibyte nop.

Closing remarks
Only the local branch prediction is stored with instructions in the cache.

There is a separate Global branch prediction table as well.

What is callq instruction?

It's just call. Use Intel-syntax disassembly if you want to be able to look up instructions in the Intel/AMD manuals. (objdump -drwC -Mintel, GBD set disassembly-flavor intel, GCC -masm=intel)

The q operand-size suffix does technically apply (it pushes a 64-bit return address and treats RIP as a 64-bit register), but there's no way to override it with instruction prefixes. i.e. calll and callw aren't encodeable in 64-bit mode according to Intel's manual, so it's just annoying that some AT&T syntax tools show it as callq instead of call. This of course applies to retq as well.

Different tools are different in 32 vs. 64-bit mode. (Godbolt)

gcc -S: always call/ret. Nice.
clang -S: callq/retq and calll/retl. At least it's consistently annoying.
objdump -d: callq/retq (explicit 64-bit) and call/ret (implicit for 32-bit). Inconsistent and kinda dumb because 64-bit has no choice of operand-size, but 32-bit does. (Not a useful choice, though: callw truncates EIP to 16 bits.)
Although on the other hand, the default operand size (without a REX.W prefix) for most instructions in 64-bit mode is still 32. But add $1, (%rdi) needs an operand-size suffix; the assembler won't pick 32-bit for you if nothing implies one. OTOH, push is implicitly pushq, even though pushw $1 and pushq $1 are both encodeable (and usable in practice) in 64-bit mode.

GAS in 64-bit mode will assemble callw foo / foo: to 66 e8 00 00, but my Skylake CPU single-steps it as a 6-byte instruction, consuming 2 bytes of 00 after it. And changing RSP by 8. So it decodes it as callq with a rel32=0, ignoring the 66 operand-size prefix. So even though there's no choice of operand-size, GNU Binutils thinks there is. (Tested with GAS 2.38). So it's still odd that it uses suffixes in 64-bit mode but not 32, since it thinks the situation is the same in both modes.

Clang and llvm-objdump -d have the same bug, assembling / disassembling callw in 64-bit mode.

AMD's manual says 64-bit mode can't use 32-bit operand-size, but does not mention any limitation on using 16-bit operand-size. So perhaps GAS and LLVM are correct for AMD CPUs, and there is still the same choice of 66 prefix or not, as in 32-bit mode. (You could test by seeing if RIP = 0x1004 after single-stepping callw foo / foo: in a static executable, instead of 0x401006, with the .text section starting at 0x401000.)

NASM's ndisasm -b64 assumes that a 66 prefix will be ignored in 64-bit mode, disassembling 66E800000000 as call qword 0x18c (it doesn't understand ELF metadata, so I just padded with nops and found it in disassembly of a .o as if it were a flat binary, hence the unusual address.)

From Intel's instruction-set ref manual (linked above):

For a near call absolute, an absolute offset is specified indirectly in a general-purpose register or a memory location (r/m16, r/m32, or r/m64).
The operand-size attribute determines the size of the target operand (16, 32 or 64 bits). When in 64-bit mode, the operand size for near call (and all near branches) is forced to 64-bits.

for rel32 ... As with absolute offsets, the operand-size attribute determines the size of the target operand (16, 32, or 64 bits). In 64-bit mode the target operand will always be 64-bits because the operand size is forced to 64-bits for near branches.

In 32-bit mode, you can encode a 16-bit call rel16 that truncates EIP to 16 bits, or a call r/m16 that uses an absolute 16-bit address. But as the manual says, the operand-size is fixed in 64-bit mode.

This is unlike the situation with push, where it defaults to 64-bit in 64-bit mode, but can be overridden to 16 with an operand-size prefix. (But not to 32 with a REX.W=0). So pushq and pushw are both available, but only callq.

Understand what the assemble code for getbuf does

Can someone please help me understand this assemble code?
24 bit for buffer? but what's rsp?

rsp is the stack pointer. Stack is some area in the memory used for temporary data storage by functions. On many CPUs (including x86), the call instruction also stores the address of the instruction which shall be executed after the ret or retq instruction to the stack.

On x86 CPUs, the rsp (32-bit: esp) register contains an address. The memory before that address is free, the memory at that address and after that address is used.

If you need 100 bytes of temporary memory, you subtract 100 from the rsp register; doing so, you indicate that these 100 bytes of memory are used (by your code). As soon as you don't need the memory any longer, you restore the old value of rsp.

Because retq assumes that rsp points to the memory where call stored the address, you have to restore the old value of rsp before you execute retq.

not sure what does gets do

gets is a function intended for C programming. It reads in one line of text (e.g. from the keyboard). In your case, the line is written to the stack. Unlike fgets this function will not check if there is enough space in the memory!

If the line read is longer than 24 bytes, the gets function will overwrite the data in the memory that was "used" before the sub $0x18, %rsp instruction.

And as I already wrote, the address written by call and read by retq is stored there!

In other words: If the line read in by gets is too long, the address written by call is overwritten and the retq instruction will not return to the calling function but it will jump to some wrong address in the memory.

(I hope this answers Question 2.)

I thought we returned already

The xchg %ax %ax is not really an instruction here, but simply some dummy data. It is inserted by some compilers because these compilers always generate functions that are a multiple of (for example) two bytes long.

but I am confused what's the second column for example in the 1st line we have 1 48 38 ec

This is the bytes (in hexa-decimal notation) in RAM that represent the assembler instructions in the 3rd column.

Example: The hexa-decimal bytes 48 38 ec in RAM will be interpreted as sub $0x18, %rsp by the CPU.

(Please also note tum_'s comment: The data shown in your example is obviously wrong: 48 38 ec is obviously not the bytes representing the instruction sub $0x18, %rsp.)

what is jmpl instruction in x86?

An l operand-size suffix implies an indirect jmp, unlike with calll main which is still a relative near-call. This inconsistency is pure insanity in AT&T syntax design.

(And since you're using it with an operand like main, it becomes a memory-indirect jump, doing a data load from main and using that as the new EIP value.)

You never need to use the jmpl mnemonic, you can and should indicate indirect jumps using * on the operand. Like jmp *%eax to set EIP = EAX, or jmp *4(%edi, %ecx, 4) to index a jump table, or jmp *func_pointer. Using jmpl is optional in all of these.

You could use jmpw *%ax to truncate EIP to a 16-bit value. That assembles to 66 ff e0 jmpw *%ax)

Compare What is callq instruction? and What is the difference between retq and ret?, that's just the operand-size suffix behaving like you expected it would, same as plain call or plain ret. But jmp is different.

semi-related: far jmp or call (to a new CS:[ER]IP) in AT&T syntax is ljmp / lcall. These are very different.

It's also insane that GAS accepts jmpl main as equivalent to jmpl *main. It only warns instead of erroring.

$ gcc -no-pie -fno-pie -m32 jmp.s 
jmp.s: Assembler messages:
jmp.s:3: Warning: indirect jmp without `*'

And then disassembling it to see what we got, with objdump -drwC a.out:

08049156 <main>:                                          # corresponding source line (added by hand)
 8049156:       ff 25 56 91 04 08       jmp    *0x8049156    # jmpl main
 804915c:       ff 25 56 91 04 08       jmp    *0x8049156    # jmp  *main
 8049162:       ff 25 56 91 04 08       jmp    *0x8049156    # jmpl *main

08049168 <foo>:
 8049168:       e8 fb ff ff ff          call   8049168 <foo> # calll foo
 804916d:       ff 15 68 91 04 08       call   *0x8049168    # calll *foo
 8049173:       ff 15 68 91 04 08       call   *0x8049168    # call  *foo

We get the same thing if we replace l with q in the source, and built without -m32 (using the default -m64). Including the same warning about a missing *. But the disassembly has an explicit jmpq and callq on every instruction. (Except for a relative direct jmp I added, which uses the jmp mnemonic in the disassembly.)

It's like objdump thinks 32-bit is the default operand-size for jmp/call in both 32 and 64-bit mode, so it wants to always use a q suffix in 64-bit, but leaves it implicit in 32-bit mode. Anyway, that's just disassembly choice between implicit / explicit size suffixes, no weirdness for a programmer writing source code.

Other AT&T-syntax assemblers:

Clang's built-in assembler does reject jmpl main, requiring jmpl *main.
```
$ clang -m32 jmp.s
jmp.s:3:8: error: invalid operand for instruction
  jmpl main
       ^~~~
```
calll main is the same as call main. call *main and calll *main are both accepted for indirect jumps.
YASM's GAS-syntax mode assembles jmpl main to a near relative jmp, like jmp main! So it disagrees with gcc/clang about jmpl implying indirect. (Very few people use YASM in GAS mode; and these days its maintenance hasn't kept up with NASM for new instructions like AVX512. I like YASM's good defaults for long NOPs, but otherwise I'd recommend NASM.)