When does Linux x86-64 syscall clobber %r8, %r9 and %r10?
Only 32-bit system calls (e.g. via int 0x80
) in 64-bit mode step on those registers, along with R11. (What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?).
syscall
properly saves/restores all regs including R8, R9, and R10, so user-space using it can assume they keep their values, except the RAX return value. (The kernel's syscall entry point even saves RCX and R11, but at that point they've already been overwritten by the syscall
instruction itself with the original RIP and before-masking RFLAGS value.)
Those, with R11, are the non-legacy registers that are call-clobbered in the function-calling convention, so compiler-generated code for C functions inside the kernel naturally preserves R12-R15, even if an asm entry point didn't save them.
Currently the 64-bit int 0x80
entry point just pushes 0
for the call-clobbered R8-R11 registers in the process-state struct that it will restore from before returning to user space, instead of the original register values.
Historically, the int 0x80
entry point from 32-bit user-space didn't save/restore those registers at all. So their values were whatever compiler-generated kernel code left sitting around. This was thought to be innocent because 32-bit mode can't read those registers, until it was realized that user-space can far-jump to 64-bit mode, using the same CS value that the kernel uses for normal 64-bit user-space processes, selecting that system-wide GDT entry. So there was an actual info leak of kernel data, which was fixed by zeroing those registers.
IDK whether there used to be or still is a separate entry point from 64-bit user-space vs. 32-bit, or how they differ in struct pt_regs
layout. The historical situation where int 0x80
leaked r8..r11 wouldn't have made sense for 64-bit user-space; that leak would have been obvious. So if they're unified now, they must not have been in the past.
Why do x86-64 Linux system calls modify RCX, and what does the value mean?
The system call return value is in rax
, as always. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64.
Note that sys_brk
has a slightly different interface than the brk
/ sbrk
POSIX functions; see the C library/kernel differences section of the Linux brk(2)
man page. Specifically, Linux sys_brk
sets the program break; the arg and return value are both pointers. See Assembly x86 brk() call use. That answer needs upvotes because it's the only good one on that question.
The other interesting part of your question is:
I do not quite understand the value in the rcx register in this case
You're seeing the mechanics of how the syscall
/ sysret
instructions are designed to allow the kernel to resume user-space execution but still be fast.
syscall
doesn't do any loads or stores, it only modifies registers. Instead of using special registers to save a return address, it simply uses regular integer registers.
It's not a coincidence that RCX=RIP
and R11=RFLAGS
after the kernel returns to your user-space code. The only way for this not to be the case is if a ptrace
system call modified the process's saved rcx
or r11
value while it was inside the kernel. (ptrace
is the system call gdb uses). In that case, Linux would use iret
instead of sysret
to return to user space, because the slower general-case iret
can do that. (See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for some walk-through of Linux's system-call entry points. Mostly the entry points from 32-bit processes, not from syscall
in a 64-bit process, though.)
Instead of pushing a return address onto the kernel stack (like int 0x80
does), syscall
:
sets RCX=RIP, R11=RFLAGS (so it's impossible for the kernel to even see the original values of those regs before you executed
syscall
).masks
RFLAGS
with a pre-configured mask from a config register (theIA32_FMASK
MSR). This lets the kernel disable interrupts (IF) until it's doneswapgs
and settingrsp
to point to the kernel stack. Even withcli
as the first instruction at the entry point, there'd be a window of vulnerability. You also getcld
for free by masking offDF
sorep movs
/stos
go upward even if user-space had usedstd
.Fun fact: AMD's first proposed
syscall
/swapgs
design didn't mask RFLAGS, but they changed it after feedback from kernel developers on the amd64 mailing list (in ~2000, a couple years before the first silicon).jumps to the configured
syscall
entry point (setting CS:RIP =IA32_LSTAR
). The oldCS
value isn't saved anywhere, I think.It doesn't do anything else, the kernel has to use
swapgs
to get access to an info block where it saved the kernel stack pointer, becausersp
still has its value from user-space.
So the design of syscall
requires a system-call ABI that clobbers registers, and that's why the values are what they are.
FreeBSD syscall clobbering more registers than Linux? Inline asm different behaviour between optimization levels
Optimization level determines where clang decides to keep its loop counter: in memory (unoptimized) or in a register, in this case r8d
(optimized). R8D is a logical choice for the compiler: it's a call-clobbered reg it can use without saving at the start/end of main
, and you've told it all the registers it could use without a REX prefix (like ECX) are either inputs / outputs or clobbers for the asm statement.
Note: if FreeBSD is like MacOS, system call error / no-error status is returned in CF (the carry flag), not via RAX being in the -4095..-1 range. In that case, you'd want a GCC6 flag-output operand like "=@ccc" (err)
for int err
(#ifdef __GCC_ASM_FLAG_OUTPUTS__
- example) or a setc %cl
in the template to materialize a boolean manually. (CL is a good choice because you can just use it as an output instead of a clobber.)
FreeBSD's syscall
handling trashes R8, R9, and R10, in addition to the bare minimum clobbering the Linux does: RAX (retval) and RCX / R11 (The syscall
instruction itself uses them to save RIP / RFLAGS so the kernel can find its way back to user-space, so the kernel never even sees the original values.)
Possibly also RDX, we're not sure; the comments call it "return value 2" (i.e. as part of a RDX:RAX return value?). We also don't know what future-proof ABI guarantees FreeBSD intends to maintain in future kernels.
You can't assume R8-R10 are zero after syscall
because they're actually preserved instead of zeroed when tracing / single-stepping. (Because then the kernel chooses not to return via sysret
, for the same reason as Linux: hardware / design bugs make that unsafe if registers might have been modified by ptrace while inside the system call. e.g. attempting to sysret
with a non-canonical RIP will #GP in ring 0 (kernel mode) on Intel CPUs! That's a disaster because RSP = user stack at that point.)
The relevant kernel code is the sysret
path (well spotted by @NateEldredge; I found the syscall entry point by searching for swapgs, but hadn't gotten to looking at the return path).
The function-call-preserved registers don't need to be restored by that code because calling a C function didn't destroy them in the first place. and the code does restore the function-call-clobbered "legacy" registers RDI, RSI, and RDX.
R8-R11 are the registers that are call-clobbered in the function-calling convention, and that are outside the original 8 x86 registers. So that's what makes them "special". (R11 doesn't get zeroed; syscall/sysret uses it for RFLAGS, so that's the value you'll find there after syscall
)
Zeroing is slightly faster than loading them, and in the normal case (syscall
instruction inside a libc wrapper function) you're about to return to a caller that's only assuming the function-calling convention, and thus will assume that R8-R11 are trashed (same for RDI, RSI, RDX, and RCX, although FreeBSD does bother to restore those for some reason.)
This zeroing only happens when not single-stepping or tracing (e.g. truss
or GDB si
). The syscall
entry point into an amd64 kernel (Github) does save all the incoming registers, so they're available to be restored by other ways out of the kernel.
Updated asm()
wrapper
// Should be fixed for FreeBSD, plus other improvements
ssize_t sys_write(int fd, const void *data, size_t size){
register ssize_t res __asm__("rax");
register int arg0 __asm__("edi") = fd;
register const void *arg1 __asm__("rsi") = data; // you can use real types
register size_t arg2 __asm__("rdx") = size;
__asm__ __volatile__(
"syscall"
// RDX *maybe* clobbered
: "=a" (res), "+r" (arg2)
// RDI, RSI preserved
: "a" (SYS_write), "r" (arg0), "r" (arg1)
// An arg in R10, R8, or R9 definitely would be
: "rcx", "r11", "memory", "r8", "r9", "r10" ////// The fix: r8-r10
// see below for a version that avoids the "memory" clobber with a dummy input operand
);
return res;
}
Use "+r"
output/input operands with any args that need register long arg3 asm("r10")
or similar for r8 or r9.
This is inside a wrapper function so the modified value of the C variables get thrown away, forcing repeated calls to set up the args every time. That would be the "defensive" approach until another answer identifies more definitely-non-trashed registers.
I did break *0x000000000020192b then info registers when break happened. r8 is zero. Program still gets stuck in this case
I assume that r8
wasn't zero before you did that GDB continue
across the syscall
instruction. Yes, that test confirms that the FreeBSD kernel is trashing r8
when not single-stepping. (And behaving in a way that matches what we see in the source code.)
Note that you can tell the compiler that a write
system call only reads memory (not writes) using a dummy "m"
input operand instead of a "memory"
clobber. That would let it hoist the store of c
out of the loop. (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)
i.e. "m"(*(const char (*)[size]) data)
as an input instead of a "memory"
clobber.
If you're going to write specific wrappers for each syscall you use, instead of a generic wrapper you use for every 3-operand syscall that just casts all operands to unsigned long
, this is the advantage you can get from doing that.
Speaking of which, there's absolutely no point in making your syscall args all be long
; making user-space sign-extend int fd
into a 64-bit register is just wasted instructions. The kernel ABI will (almost certainly) ignore the high bytes of registers for narrow args, like Linux does. (Again, unless you're making a generic syscall3
wrapper that you just use with different SYS_
numbers to define write, read, and other 3-operand system calls; then you would cast everything to register-width and just use a "memory"
clobber).
I made these changes for my modified version below.
Also note that for RDI, RSI, and RDX, there are specific-register letter constraints which you can use instead of register-asm locals, just like you're doing for the return value in RAX ("=a"
). BTW, you don't really need a matching constraint for the call number, just use an "a"
input; it's easier to read because you don't need to look at another operand to check that you're matching the right output.
// assuming RDX *is* clobbered.
// could remove the + if it isn't.
ssize_t sys_write(int fd, const void *data, size_t size)
{
// register long arg3 __asm__("r10") = ??;
// register-asm is useful for R8 and up
ssize_t res;
__asm__ __volatile__("syscall"
// RDX
: "=a" (res), "+d" (size)
// EAX/RAX RDI RSI
: "a" (SYS_write), "D" (fd), "S" (data),
"m" (*(const char (*)[size]) data) // tells compiler this mem is an input
: "rcx", "r11" //, "memory"
#ifndef __linux__
, "r8", "r9", "r10" // Linux always restores these
#endif
);
return res;
}
Some people prefer register ... asm("")
for all the operands because you get to use the full register name, and don't have to remember the totally-non-obvious "D" for RDI/EDI/DI/DIL vs. "d" for RDX/EDX/DX/DL
Why do x86-64 Linux system calls work with 6 registers set?
System calls accept up to 6 arguments, passed in registers (almost the same registers as the SysV x64 C ABI, with r10
replacing rcx
but they are callee preserved in the syscall case), and "extra" arguments are simply ignored.
Some specific answers to your questions below.
The src/internal/x86_64/syscall.s
is just a "thunk" which shifts all the all the arguments into the right place. That is, it converts from a C-ABI function which takes the syscall number and 6 more arguments, into a "syscall ABI" function with the same 6 arguments and the syscall number in rax
. It works "just fine" for any number of arguments - the additional register movement will simply be ignored by the syscall if those arguments aren't used.
Since in the C-ABI all the argument registers are considered scratch (i.e., caller-save), clobbering them is harmless if you assume this __syscall
method is called from C. In fact the kernel makes stronger guarantees about clobbered registers, clobbering only rcx
and r11
so assuming the C calling convention is safe but pessimistic. In particular, the code calling __syscall
as implemented here will unnecessarily save any argument and scratch registers per the C ABI, despite the kernel's promise to preserve them.
The arch/x86_64/syscall_arch.h
file is pretty much the same thing, but in a C header file. Here, you want all seven versions (for zero to six arguments) because modern C compilers will warn or error if you call a function with the wrong number of arguments. So there is no real option to have "one function to rule them all" as in the assembly case. This also has the advantage of doing less work syscalls that take less than 6 arguments.
Your listed questions, answered:
- Why can I pass more parameters than the system call takes?
Because the calling convention is mostly register-based and caller cleanup. You can always pass more arguments in this situation (including in the C ABI) and the other arguments will simply be ignored by the callee. Since the syscall
mechanism is generic at the C and .asm level, there is no real way the compiler can ensure you are passing the right number of arguments - you need to pass the right syscall id and the right number of arguments. If you pass less, the kernel will see garbage, and if you pass more, they will be ignored.
- Is this reasonable, documented behavior?
Yes, sure - because the whole syscall
mechanism is a "generic gate" into the kernel. 99% of the time you aren't going to use that: glibc
wraps the vast majority of interesting syscalls in C ABI wrappers with the correct signature so you don't have to worry about. Those are the ways that syscall access happens safely.
- What am I supposed to set the unused registers to?
You don't set them to anything. If you use the C prototypes arch/x86_64/syscall_arch.h
the compiler just takes care of it for you (it doesn't set them to anything) and if you are writing your own asm, you don't set them to anything (and you should assume they are clobbered after the syscall).
- What will the kernel do with the registers it doesn't use?
It is free to use all the registers it wants, but will adhere to the kernel calling convention which is that on x86-64 all registers other than rax
, rcx
and r11
are preserved (which is why you see rcx
and r11
in the clobber list in the C inline asm).
- Is the seven function approach faster by virtue of having less instructions?
Yes, but the difference is very small since the reg-reg mov
instructions are usually have zero latency and have high throughput (up to 4/cycle) on recent Intel architectures. So moving an extra 6 registers perhaps takes something like 1.5 cycles for a syscall that is usually going to take at least 50 cycles even if it does nothing. So the impact is small, but probably measurable (if you measure very carefully!).
- What happens to the other registers in those functions?
I'm not sure what you mean exactly, but the other registers can be used just like all GP registers, if the kernel wants to preserve their values (e.g., by push
ing them on the stack and then pop
ing them later).
Clang 11 and GCC 8 O2 Breaks Inline Assembly
You can't use extended asm in a
naked
function, only basic asm, according to the gcc manual. You don't need to inform the compiler of clobbered registers (since it won't do anything about them anyway; in anaked
function you are responsible for all register management). And passing the address ofentry
in an extended operand is unnecessary; just dojmp entry
.(In my tests your code doesn't compile at all, so I assume you weren't showing us your exact code - next time please do, so as to avoid wasting people's time.)
Linux x86-64
syscall
system calls are allowed to clobber thercx
andr11
registers, so you need to add those to the clobber lists of your system calls.You align the stack to a 16-byte boundary before jumping to
entry
. However, the 16-byte alignment rule is based on the assumption that you will be calling the function withcall
, which would push an additional 8 bytes onto the stack. As such, the called function actually expects the stack to initially be, not a multiple of 16, but 8 more or less than a multiple of 16. So you are actually aligning the stack incorrectly, and this can be a cause of all sorts of mysterious trouble.So either replace your
jmp
withcall
, or else subtract a further 8 bytes fromrsp
(or justpush
some 64-bit register of your choice).Style note:
unsigned long
is already 64 bits on Linux x86-64, so it would be more idiomatic to use that in place ofunsigned long long
everywhere.General hint: learn about register constraints in extended asm. You can have the compiler load your desired registers for you, instead of writing instructions in your asm to do it yourself. So your
exit
function could instead look like:
void exit(unsigned long status) {
asm volatile("syscall"
: //no outputs
:"a"(60), "D" (status)
:"rcx", "r11");
}
This in particular saves you a few instructions, since status
is already in the %rdi
register on function entry. With your original code, the compiler has to move it somewhere else so that you can then load it into %rdi
yourself.
Your
open
function always returns 1, which will typically not be the fd that was actually opened. So if your program is run with standard output redirected, your program will write to the redirected stdout, instead of to the tty as it seems to want to do. Indeed, this makes theopen
syscall completely pointless, because you never use the file you opened.You should arrange for
open
to return the value that was actually returned by the system call, which will be left in the%rax
register whensyscall
returns. You can use an output operand to have this stored in a temporary variable (which the compiler will likely optimize out), and return that. You'll need to use a digit constraint since it is going in the same register as an input operand. I leave this as an exercise for you. It would likewise be nice if yourwrite
function actually returned the number of bytes written.
For temporary registers in the asm statement, should I use clobber or dummy output?
You should normally let the compiler pick registers for you, using an early-clobber dummy output with any required constraints1. This gives it flexibility to do register allocation for the function.
1 e.g. you can use +&Q
to get one of RAX/RBX/RCX/RDX: registers that have an AH/BH/CH/DH. If you wanted to unpack 8-bit fields with movzbl %h[input], %[high_byte]
; movzbl %b[input], %[low_byte]
; shr $16, %[input]
, you'd need a register that has it's 2nd 8-bit chunk aliased to a high-8 register.
Out of curiosity, when we consider a calling convention of amd64, some registers can be freely used inside the functions; and we could implement some functions by only using those registers inside the asm statement. Why allowing the compiler to choose the registers to be used is better than the mentioned one?
Because functions can inline, maybe into a loop that calls other functions, thus the compiler would want to give it inputs in call-preserved registers. If you were writing a stand-alone function that the compiler always has to call, all you get from inline asm instead of stand-alone is the compiler handling calling-convention differences and C++ name-mangling.
Or maybe the surrounding code uses some instructions that require fixed registers, like cl
for shift counts or RDX:RAX for div
.
when should I use the clobber list? ...
such as syscall instruction requires its parameter should be located in register rdi rsi rdx r10 r8 r9??
Normally you'd use input constraints instead, so only the syscall
instruction itself is inside the inline asm. But syscall
(the instruction itself) clobbers RCX and R11, so system calls made using it unavoidably destroy user-space's RCX and R11. There's no point using dummy outputs for these, unless you have a use for the return address (RCX) or RFLAGS (R11). So yes, clobbers are useful here.
// the compiler will emit all the necessary MOV instructions
#include <stddef.h>
#include <asm/unistd.h>
// the compiler will emit all the necessary MOV instructions
//static inline
size_t sys_write(int fd, const char *buf, size_t len) {
size_t retval;
asm volatile("syscall"
: "=a"(retval) // EDI RSI RDX
: "a"(__NR_write), "D"(fd), "S"(buf), "d"(len)
, "m"(*(char (*)[len]) buf) // dummy memory input: the asm statement reads this memory
: "rcx", "r11" // clobbered by syscall
// , "memory" // would be needed if we didn't use a dummy memory input
);
return retval;
}
A non-inline version of this compiles as follows (with gcc -O3
on the Godbolt compiler explorer), because the function-calling convention nearly matches the system-call convention:
sys_write(int, char const*, unsigned long):
movl $1, %eax
syscall
ret
It would have been really silly to use clobbers on any of the input registers and put a mov
inside the asm:
size_t dumb_sys_write(int fd, const char *buf, size_t len) {
size_t retval;
asm volatile(
"mov %[fd], %%edi\n\t"
"mov %[buf], %%rsi\n\t"
"mov %[len], %%rdx\n\t"
"syscall"
: "=a"(retval) // EDI RSI RDX
: "a"(__NR_write), [fd]"r"(fd), [buf]"r"(buf), [len]"r"(len)
, "m"(*(char (*)[len]) buf) // dummy memory input: the asm statement reads this memory
: "rdi", "rsi", "rdx", "rcx", "r11"
// , "memory" // would be needed if we didn't use a dummy memory input
);
// if(retval > -4096ULL) errno = -retval;
return retval;
}
dumb_sys_write(int, char const*, unsigned long):
movl %edi, %r9d
movq %rsi, %r8
movq %rdx, %r10
movl $1, %eax # compiler generated before this
# from inline asm
mov %r9d, %edi
mov %r8, %rsi
mov %r10, %rdx
syscall
# end of inline asm
ret
And besides that, you're not letting the compiler take advantage of the fact that syscall
doesn't clobber any of its input registers. The compiler might well still want len
in a register, and using a pure input constraint lets it know that the value will still be there afterwards.
You might also use clobbers if you're using any instructions that implicitly use certain registers, but neither the input nor output of those instructions is a direct input or output of the asm statement. That would be rare, though, unless you're writing a whole loop or large block of code in inline asm.
Or maybe if you're wrapping a call
instruction. (It's hard to do this safely, especially because of the red-zone, but people do try to do this). You don't get to choose which registers the code clobbers, so you just tell the compiler about it.
Acceptability of regular usage of r10 and r11
Both r10
and r11
are call-clobbered registers, aka volatile, you can overwrite them without saving/restoring in any leaf or non-leaf function. That's what C compilers do, and expect from functions their code calls, because that's what the ABI doc says: What registers are preserved through a linux x86-64 function call
Your caller will expect them to hold garbage after return. Just like arg-passing registers such as RDI or RCX. (And RDX if it's not part of a wide RDX:RAX return value.)
The x86-64 System V ABI doesn't name its calling convention "cdecl". It's just the x86-64 SysV calling convention. The string "cdecl" doesn't appear in the ABI doc.
r11
is a temporary, aka call-clobbered register. r11
is never used for passing or returning anything, so it's safe even for wrapper / trampoline / hook functions to clobber it even if they want to forward all args and return all return values, unlike any other register. For example, lazy dynamic linker code.
r10
is also call-clobbered. The ABI says "used for passing a function’s static chain pointer". In languages that use nested functions, this is an extra incoming arg to such functions so they can find the local vars of the outer scope. A pointer-to-nested-function needs a code pointer and a static chain pointer for the caller to pass when dereferencing.
It's a "chain" because there can be multiple levels of nesting with their stack frames forming a linked list, and "static" because it's based on lexical scope, not the call stack. (Thanks @Raymond Chen for explaining the terminology.)
But like all arg-passing registers to function calls (not system calls) in x86-64 System V, it's call-clobbered (like in most calling conventions generally). You only need to worry about this usage of r10
if you're hooking or wrapping nested functions, ones defined inside another function. If you're just writing a function that's called normally, it's a pure temporary.
GCC does use r10
as part of its trampoline for function pointers to GNU C nested functions, for a pointer to the stack frame of the outer scope. The trampoline of machine code on the stack is a hack, but this is indeed a static chain pointer; languages with proper support for nested functions (unlike C and C++) would probably have the caller aware of it (like a lambda / closure) and passing a value in r10
when using using pointer to a nested function.
In a normal function, RBX, RBP, and RSP are call-preserved, along with R12..R15. All others can be clobbered without saving/restoring. (That includes xmm/ymm0..15 and zmm0..31 / k0..7, and the mmx/x87 stack, and the condition codes in RFLAGS).
Note that r8..15
need a REX prefix, even with 32-bit operand-size (like xor r10d, r10d
). If you have some 64-bit non-pointer integers, then sure keep them in r8..r11 because you always need a REX prefix for 64-bit operand-size any time you use those values anyway.
Smaller code-size is usually not worse, and sometimes helps with decode and uop-cache density, and L1i cache density. RAX, RCX,RDX, RSI,RDI should be your first choices for scratch regs. (And use 32-bit operand-size unless you need 64-bit. e.g. xor eax,eax
is the correct way to zero RAX. Silvermont doesn't recognize xor r10,r10
as a zeroing idiom, so use xor r10d,r10d
even though it doesn't save code size.)
Related Topics
Do (Statically Linked) Dlls Use a Different Heap Than the Main Program
How Are Sbrk/Brk Implemented in Linux
Differencebetween Clock_Monotonic & Clock_Monotonic_Raw
Shell Script to Get the Process Id on Linux
What's the Accepted Method for Deploying a Linux Application That Relies on Shared Libraries
Adding a Directory to Path in Ubuntu
What Does a Typical ./Configure Do in Linux
One Command to Create a Directory and File Inside It Linux Command
Refresh Net.Core.Somaxcomm (Or Any Sysctl Property) for Docker Containers
How to Remove All Non-Numeric Characters from a String in Bash
Crontab Run Every 15 Minutes Between Certain Hours
How to Launch a New Process That Is Not a Child of the Original Process
Why Is My Bash Script Adding <Feff> to the Beginning of Files
Does Gcc, Icc, or Microsoft's C/C++ Compiler Support or Know Anything About Numa
How to Save Output of "Watch" to File
Language-Agnostic Properly-Tabbing Code Editors for Linux
Using Rsync Include and Exclude Options to Include Directory and File by Pattern