What is the difference between retq and ret?
In long (64-bit) mode, you return (ret
) by popping a quadword address from the stack to %rip
.
In 32-bit mode, you return (ret
) by popping a dword address from the stack to %eip
.
Some tools like objdump -d
call the first one retq
. It's just a name, the instruction encoding is the same either way (C3
).
Retq instruction, where does it return
After studying assembly code, here are my thoughts,
let's look at a sample:
fun:
push %rbp
mov %rsp,%rbp
...
...
pop %rbp
retq
main:
...
...
callq "address" <fun>
...
...
We can see there is a instruction before retq
. The pop %rbp
(sometimes it is a leave instruction but they are similar) instruction will
- save the content of current stack pointer
%rsp
to base stack pointer%rbp
. - move the
%rsp
pointer to previous address on stack.
For example: before pop command, the %rsp
pointed to 0x0000 0000 0000 00D0
. After the pop
command it points to 0x0000 0000 0000 00D8
(assume the stack grows from high address to low address).
After the pop
command, now %rsp
points to a new address and retq
takes this address as return address.
What does `rep ret` mean?
There's a whole blog named after this instruction. And the first post describes the reason behind it: http://repzret.org/p/repzret/
Basically, there was an issue in the AMD's branch predictor when a single-byte ret
immediately followed a conditional jump as in the code you quoted (and a few other situations), and the workaround was to add the rep
prefix, which is ignored by CPU but fixes the predictor penalty.
repz ret: why all the hassle?
Branch misprediction
The reason for all the hoopla is the cost of branch mispredictions.
When a branch comes around the CPU predicts the branch taken and preloads these instructions in the pipeline.
If the prediction is wrong the pipeline needs to be cleared and new instructions loaded.
This can take up to number_of_stages_in_pipeline
cycles plus any cycles needed to load the data from the cache. 14 to 25 cycles per misprediction is typical.
Reason: processor design
The reason K8 and K10 suffer from this is because of a nifty optimization by AMD.
AMD K8 and K10 will pre-decode instructions in the cache and keep track of their length in the CPU L1 instruction cache.
In order to do this it has extra bits.
For every 128 bits (16 bytes) of instructions there are 76 bits of additional data stored.
The following table details this:
Data Size Notes
-------------------------------------------------------------------------
Instructions 128 bits The data as read from memory
Parity bits 8 bits One parity bit for every 16 bits
Pre-decode 56 bits 3 bits per byte (start, end, function)
+ 4 bit per 16 byte line
Branch selectors 16 bits 2 bits for each 2 bytes of instruction code
Total 204 bits 128 instructions, 76 metadata
Because all this data is stored in the L1 instruction cache the K8/10 cpu has to spend a lot less work on decode and branch prediction. This saves on silicon.
And because AMD does not have as big a transistor's budget as Intel it needs to work smarter.
However if the code is esp. tight a jump and a ret might occupy the same two byte slot, meaning that there the RET
gets predicted as NOT taken (because the jump following it is).
By making the RET occupy two bytes REP RET
this can never occur and a RET will always be predicted OK.
Intel does not have this problem, but (used to) suffer(s) from a limited number of prediction slots, which AMD does not.
nop ret
There is never a reason to do nop ret
. This is two instructions wasting an extra cycle to execute the nop
and the ret
might still 'pair' with a jump.
If you want to align use a REP MOV
instead or use a multibyte nop
.
Closing remarks
Only the local branch prediction is stored with instructions in the cache.
There is a separate Global branch prediction table as well.
What is callq instruction?
It's just call
. Use Intel-syntax disassembly if you want to be able to look up instructions in the Intel/AMD manuals. (objdump -drwC -Mintel
, GBD set disassembly-flavor intel
, GCC -masm=intel
)
The q
operand-size suffix does technically apply (it pushes a 64-bit return address and treats RIP as a 64-bit register), but there's no way to override it with instruction prefixes. i.e. calll
and callw
aren't encodeable in 64-bit mode according to Intel's manual, so it's just annoying that some AT&T syntax tools show it as callq
instead of call
. This of course applies to retq
as well.
Different tools are different in 32 vs. 64-bit mode. (Godbolt)
gcc -S: always
call
/ret
. Nice.clang -S:
callq
/retq
andcalll
/retl
. At least it's consistently annoying.objdump -d:
callq
/retq
(explicit 64-bit) andcall
/ret
(implicit for 32-bit). Inconsistent and kinda dumb because 64-bit has no choice of operand-size, but 32-bit does. (Not a useful choice, though:callw
truncates EIP to 16 bits.)Although on the other hand, the default operand size (without a REX.W prefix) for most instructions in 64-bit mode is still 32. But
add $1, (%rdi)
needs an operand-size suffix; the assembler won't pick 32-bit for you if nothing implies one. OTOH,push
is implicitlypushq
, even thoughpushw $1
andpushq $1
are both encodeable (and usable in practice) in 64-bit mode.
GAS in 64-bit mode will assemble callw foo
/ foo:
to 66 e8 00 00
, but my Skylake CPU single-steps it as a 6-byte instruction, consuming 2 bytes of 00 after it. And changing RSP by 8. So it decodes it as callq
with a rel32=0
, ignoring the 66
operand-size prefix. So even though there's no choice of operand-size, GNU Binutils thinks there is. (Tested with GAS 2.38). So it's still odd that it uses suffixes in 64-bit mode but not 32, since it thinks the situation is the same in both modes.
Clang and llvm-objdump -d
have the same bug, assembling / disassembling callw
in 64-bit mode.
AMD's manual says 64-bit mode can't use 32-bit operand-size, but does not mention any limitation on using 16-bit operand-size. So perhaps GAS and LLVM are correct for AMD CPUs, and there is still the same choice of 66
prefix or not, as in 32-bit mode. (You could test by seeing if RIP = 0x1004
after single-stepping callw foo
/ foo:
in a static executable, instead of 0x401006
, with the .text section starting at 0x401000
.)
NASM's ndisasm -b64
assumes that a 66
prefix will be ignored in 64-bit mode, disassembling 66E800000000
as call qword 0x18c
(it doesn't understand ELF metadata, so I just padded with nops and found it in disassembly of a .o as if it were a flat binary, hence the unusual address.)
From Intel's instruction-set ref manual (linked above):
For a near call absolute, an absolute offset is specified indirectly in a general-purpose register or a memory location (r/m16, r/m32, or r/m64).
The operand-size attribute determines the size of the target operand (16, 32 or 64 bits). When in 64-bit mode, the operand size for near call (and all near branches) is forced to 64-bits.
for rel32 ... As with absolute offsets, the operand-size attribute determines the size of the target operand (16, 32, or 64 bits). In 64-bit mode the target operand will always be 64-bits because the operand size is forced to 64-bits for near branches.
In 32-bit mode, you can encode a 16-bit call rel16
that truncates EIP to 16 bits, or a call r/m16
that uses an absolute 16-bit address. But as the manual says, the operand-size is fixed in 64-bit mode.
This is unlike the situation with push
, where it defaults to 64-bit in 64-bit mode, but can be overridden to 16 with an operand-size prefix. (But not to 32 with a REX.W=0). So pushq
and pushw
are both available, but only callq
.
Understand what the assemble code for getbuf does
Can someone please help me understand this assemble code?
24 bit for buffer? but what's rsp?
rsp
is the stack pointer. Stack is some area in the memory used for temporary data storage by functions. On many CPUs (including x86), the call
instruction also stores the address of the instruction which shall be executed after the ret
or retq
instruction to the stack.
On x86 CPUs, the rsp
(32-bit: esp
) register contains an address. The memory before that address is free, the memory at that address and after that address is used.
If you need 100 bytes of temporary memory, you subtract 100 from the rsp
register; doing so, you indicate that these 100 bytes of memory are used (by your code). As soon as you don't need the memory any longer, you restore the old value of rsp
.
Because retq
assumes that rsp
points to the memory where call
stored the address, you have to restore the old value of rsp
before you execute retq
.
not sure what does
gets
do
gets
is a function intended for C programming. It reads in one line of text (e.g. from the keyboard). In your case, the line is written to the stack. Unlike fgets
this function will not check if there is enough space in the memory!
If the line read is longer than 24 bytes, the gets
function will overwrite the data in the memory that was "used" before the sub $0x18, %rsp
instruction.
And as I already wrote, the address written by call
and read by retq
is stored there!
In other words: If the line read in by gets
is too long, the address written by call
is overwritten and the retq
instruction will not return to the calling function but it will jump to some wrong address in the memory.
(I hope this answers Question 2.)
I thought we returned already
The xchg %ax %ax
is not really an instruction here, but simply some dummy data. It is inserted by some compilers because these compilers always generate functions that are a multiple of (for example) two bytes long.
but I am confused what's the second column for example in the 1st line we have 1 48 38 ec
This is the bytes (in hexa-decimal notation) in RAM that represent the assembler instructions in the 3rd column.
Example: The hexa-decimal bytes 48 38 ec
in RAM will be interpreted as sub $0x18, %rsp
by the CPU.
(Please also note tum_'s comment: The data shown in your example is obviously wrong: 48 38 ec
is obviously not the bytes representing the instruction sub $0x18, %rsp
.)
what is jmpl instruction in x86?
An l
operand-size suffix implies an indirect jmp
, unlike with calll main
which is still a relative near-call. This inconsistency is pure insanity in AT&T syntax design.
(And since you're using it with an operand like main
, it becomes a memory-indirect jump, doing a data load from main
and using that as the new EIP value.)
You never need to use the jmpl
mnemonic, you can and should indicate indirect jumps using *
on the operand. Like jmp *%eax
to set EIP = EAX, or jmp *4(%edi, %ecx, 4)
to index a jump table, or jmp *func_pointer
. Using jmpl
is optional in all of these.
You could use jmpw *%ax
to truncate EIP to a 16-bit value. That assembles to 66 ff e0 jmpw *%ax
)
Compare What is callq instruction? and What is the difference between retq and ret?, that's just the operand-size suffix behaving like you expected it would, same as plain call
or plain ret
. But jmp
is different.
semi-related: far jmp or call (to a new CS:[ER]IP) in AT&T syntax is ljmp / lcall. These are very different.
It's also insane that GAS accepts jmpl main
as equivalent to jmpl *main
. It only warns instead of erroring.
$ gcc -no-pie -fno-pie -m32 jmp.s
jmp.s: Assembler messages:
jmp.s:3: Warning: indirect jmp without `*'
And then disassembling it to see what we got, with objdump -drwC a.out
:
08049156 <main>: # corresponding source line (added by hand)
8049156: ff 25 56 91 04 08 jmp *0x8049156 # jmpl main
804915c: ff 25 56 91 04 08 jmp *0x8049156 # jmp *main
8049162: ff 25 56 91 04 08 jmp *0x8049156 # jmpl *main
08049168 <foo>:
8049168: e8 fb ff ff ff call 8049168 <foo> # calll foo
804916d: ff 15 68 91 04 08 call *0x8049168 # calll *foo
8049173: ff 15 68 91 04 08 call *0x8049168 # call *foo
We get the same thing if we replace l
with q
in the source, and built without -m32
(using the default -m64
). Including the same warning about a missing *
. But the disassembly has an explicit jmpq
and callq
on every instruction. (Except for a relative direct jmp
I added, which uses the jmp
mnemonic in the disassembly.)
It's like objdump thinks 32-bit is the default operand-size for jmp/call in both 32 and 64-bit mode, so it wants to always use a q
suffix in 64-bit, but leaves it implicit in 32-bit mode. Anyway, that's just disassembly choice between implicit / explicit size suffixes, no weirdness for a programmer writing source code.
Other AT&T-syntax assemblers:
Clang's built-in assembler does reject
jmpl main
, requiringjmpl *main
.$ clang -m32 jmp.s
jmp.s:3:8: error: invalid operand for instruction
jmpl main
^~~~calll main
is the same ascall main
.call *main
andcalll *main
are both accepted for indirect jumps.YASM's GAS-syntax mode assembles
jmpl main
to a near relative jmp, likejmp main
! So it disagrees with gcc/clang aboutjmpl
implying indirect. (Very few people use YASM in GAS mode; and these days its maintenance hasn't kept up with NASM for new instructions like AVX512. I like YASM's good defaults for long NOPs, but otherwise I'd recommend NASM.)
Related Topics
Bash Shell Script Variable Assignment
How to Filter Data Between 2 Dates with Awk in a Bash Script
Setting a Gdb Exit Breakpoint Not Working
Recursively Cat All the Files into Single File
"When" Condition on Ansible Playbook Doesn't Work as Expected Using Operators
What Are Good Linux/Unix Books for an Advancing User
Find -Exec Cmd {} + VS | Xargs
Why Does Find -Exec Mv {} ./Target/ + Not Work
Comparing Two Files in Linux Terminal
Selecting a Linux I/O Scheduler
How to Get the List of Dependent Child Images in Docker
How to Check for Opencv on Ubuntu 9.10
Readelf VS. Objdump: Why Are Both Needed
How to Start Gvim with a Maximized Window
Grep Command to Add End Line After Every Match
Failing to Connect to Remote Mongodb Server
How to Set My Application's Desktop Icon for Linux: Kde, Gnome etc