Why do x86-64 Linux system calls modify RCX, and what does the value mean?
The system call return value is in rax
, as always. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64.
Note that sys_brk
has a slightly different interface than the brk
/ sbrk
POSIX functions; see the C library/kernel differences section of the Linux brk(2)
man page. Specifically, Linux sys_brk
sets the program break; the arg and return value are both pointers. See Assembly x86 brk() call use. That answer needs upvotes because it's the only good one on that question.
The other interesting part of your question is:
I do not quite understand the value in the rcx register in this case
You're seeing the mechanics of how the syscall
/ sysret
instructions are designed to allow the kernel to resume user-space execution but still be fast.
syscall
doesn't do any loads or stores, it only modifies registers. Instead of using special registers to save a return address, it simply uses regular integer registers.
It's not a coincidence that RCX=RIP
and R11=RFLAGS
after the kernel returns to your user-space code. The only way for this not to be the case is if a ptrace
system call modified the process's saved rcx
or r11
value while it was inside the kernel. (ptrace
is the system call gdb uses). In that case, Linux would use iret
instead of sysret
to return to user space, because the slower general-case iret
can do that. (See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for some walk-through of Linux's system-call entry points. Mostly the entry points from 32-bit processes, not from syscall
in a 64-bit process, though.)
Instead of pushing a return address onto the kernel stack (like int 0x80
does), syscall
:
sets RCX=RIP, R11=RFLAGS (so it's impossible for the kernel to even see the original values of those regs before you executed
syscall
).masks
RFLAGS
with a pre-configured mask from a config register (theIA32_FMASK
MSR). This lets the kernel disable interrupts (IF) until it's doneswapgs
and settingrsp
to point to the kernel stack. Even withcli
as the first instruction at the entry point, there'd be a window of vulnerability. You also getcld
for free by masking offDF
sorep movs
/stos
go upward even if user-space had usedstd
.Fun fact: AMD's first proposed
syscall
/swapgs
design didn't mask RFLAGS, but they changed it after feedback from kernel developers on the amd64 mailing list (in ~2000, a couple years before the first silicon).jumps to the configured
syscall
entry point (setting CS:RIP =IA32_LSTAR
). The oldCS
value isn't saved anywhere, I think.It doesn't do anything else, the kernel has to use
swapgs
to get access to an info block where it saved the kernel stack pointer, becausersp
still has its value from user-space.
So the design of syscall
requires a system-call ABI that clobbers registers, and that's why the values are what they are.
When does Linux x86-64 syscall clobber %r8, %r9 and %r10?
Only 32-bit system calls (e.g. via int 0x80
) in 64-bit mode step on those registers, along with R11. (What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?).
syscall
properly saves/restores all regs including R8, R9, and R10, so user-space using it can assume they keep their values, except the RAX return value. (The kernel's syscall entry point even saves RCX and R11, but at that point they've already been overwritten by the syscall
instruction itself with the original RIP and before-masking RFLAGS value.)
Those, with R11, are the non-legacy registers that are call-clobbered in the function-calling convention, so compiler-generated code for C functions inside the kernel naturally preserves R12-R15, even if an asm entry point didn't save them.
Currently the 64-bit int 0x80
entry point just pushes 0
for the call-clobbered R8-R11 registers in the process-state struct that it will restore from before returning to user space, instead of the original register values.
Historically, the int 0x80
entry point from 32-bit user-space didn't save/restore those registers at all. So their values were whatever compiler-generated kernel code left sitting around. This was thought to be innocent because 32-bit mode can't read those registers, until it was realized that user-space can far-jump to 64-bit mode, using the same CS value that the kernel uses for normal 64-bit user-space processes, selecting that system-wide GDT entry. So there was an actual info leak of kernel data, which was fixed by zeroing those registers.
IDK whether there used to be or still is a separate entry point from 64-bit user-space vs. 32-bit, or how they differ in struct pt_regs
layout. The historical situation where int 0x80
leaked r8..r11 wouldn't have made sense for 64-bit user-space; that leak would have been obvious. So if they're unified now, they must not have been in the past.
Why do x86-64 Linux system calls work with 6 registers set?
System calls accept up to 6 arguments, passed in registers (almost the same registers as the SysV x64 C ABI, with r10
replacing rcx
but they are callee preserved in the syscall case), and "extra" arguments are simply ignored.
Some specific answers to your questions below.
The src/internal/x86_64/syscall.s
is just a "thunk" which shifts all the all the arguments into the right place. That is, it converts from a C-ABI function which takes the syscall number and 6 more arguments, into a "syscall ABI" function with the same 6 arguments and the syscall number in rax
. It works "just fine" for any number of arguments - the additional register movement will simply be ignored by the syscall if those arguments aren't used.
Since in the C-ABI all the argument registers are considered scratch (i.e., caller-save), clobbering them is harmless if you assume this __syscall
method is called from C. In fact the kernel makes stronger guarantees about clobbered registers, clobbering only rcx
and r11
so assuming the C calling convention is safe but pessimistic. In particular, the code calling __syscall
as implemented here will unnecessarily save any argument and scratch registers per the C ABI, despite the kernel's promise to preserve them.
The arch/x86_64/syscall_arch.h
file is pretty much the same thing, but in a C header file. Here, you want all seven versions (for zero to six arguments) because modern C compilers will warn or error if you call a function with the wrong number of arguments. So there is no real option to have "one function to rule them all" as in the assembly case. This also has the advantage of doing less work syscalls that take less than 6 arguments.
Your listed questions, answered:
- Why can I pass more parameters than the system call takes?
Because the calling convention is mostly register-based and caller cleanup. You can always pass more arguments in this situation (including in the C ABI) and the other arguments will simply be ignored by the callee. Since the syscall
mechanism is generic at the C and .asm level, there is no real way the compiler can ensure you are passing the right number of arguments - you need to pass the right syscall id and the right number of arguments. If you pass less, the kernel will see garbage, and if you pass more, they will be ignored.
- Is this reasonable, documented behavior?
Yes, sure - because the whole syscall
mechanism is a "generic gate" into the kernel. 99% of the time you aren't going to use that: glibc
wraps the vast majority of interesting syscalls in C ABI wrappers with the correct signature so you don't have to worry about. Those are the ways that syscall access happens safely.
- What am I supposed to set the unused registers to?
You don't set them to anything. If you use the C prototypes arch/x86_64/syscall_arch.h
the compiler just takes care of it for you (it doesn't set them to anything) and if you are writing your own asm, you don't set them to anything (and you should assume they are clobbered after the syscall).
- What will the kernel do with the registers it doesn't use?
It is free to use all the registers it wants, but will adhere to the kernel calling convention which is that on x86-64 all registers other than rax
, rcx
and r11
are preserved (which is why you see rcx
and r11
in the clobber list in the C inline asm).
- Is the seven function approach faster by virtue of having less instructions?
Yes, but the difference is very small since the reg-reg mov
instructions are usually have zero latency and have high throughput (up to 4/cycle) on recent Intel architectures. So moving an extra 6 registers perhaps takes something like 1.5 cycles for a syscall that is usually going to take at least 50 cycles even if it does nothing. So the impact is small, but probably measurable (if you measure very carefully!).
- What happens to the other registers in those functions?
I'm not sure what you mean exactly, but the other registers can be used just like all GP registers, if the kernel wants to preserve their values (e.g., by push
ing them on the stack and then pop
ing them later).
Win64 and Linux-x86_64 Calling Convention Unused registers modified or not
A registers's status as call-preserved or call-clobbered never depends on the number of args actually passed by the caller and/or expected by the callee, in any calling convention for any ISA I've looked at, and certainly not any of the standard ones on x86.
But yes the calling conventions for raw system calls are different from those for functions, even for presumably thin wrapper functions.
All standard user-space function calling conventions have all the arg-passing registers (and stack slots) as call-clobbered. So if your asm uses call
, that's what you need to expect.
The system-calling conventions on mainstream OSes preserves all registers (except the return value). (But on x86-64, only after syscall
itself overwrites RCX and R11, because that happens before the kernel gets control.) If you directly use syscall
or int 0x80
or whatever, that's what you should expect.
Note that Windows does not have a stable system-call ABI across kernel versions and doesn't document the raw system calls, so in normal Windows code you're always making DLL function calls, never raw system calls. People have reverse-engineered the system calls for different Windows versions, though.
MacOS also doesn't officially have a stable/documented syscall ABI, but in practice Darwin basically does, at least for the normal POSIX open/read/write/close/exit calls that toy programs use.
- What registers are preserved through a linux x86-64 function call
- What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64
- https://packagecloud.io/blog/the-definitive-guide-to-linux-system-calls/
- Where is the x86-64 System V ABI documented?
- Windows system calls
Why Assembly x86_64 syscall parameters are not in alphabetical order like i386
The x86-64 System V ABI was designed to minimize instruction-count (and to some degree code-size) in SPECint as compiled by the version of gcc that was current before the first AMD64 CPUs were sold. See this answer for some history and list-archive links.
Since 5 minutes before I thought all registers were the same but they were used differently because of a convention. Now all things changed for me
x86-64 is not fully orthogonal. Some instructions implicitly use specific registers. e.g. push
implicitly uses rsp
as the stack pointer, shl edx, cl
is only usable with a shift count in cl
(until BMI2 shlx
).
More rarely used: widening mul rdi
does rdx:rax = rax*rdi
. The rep-string instructions implicitly use RDI, RSI, and RCX, although they're often not worth using.
It turns out that choosing the arg-passing registers so that functions that passed their args to memcpy could inline it as rep movs
was useful in the metric Jan Hubicka was using, thus rdi
and rsi
were chosen as the first two args. But that leaving rcx
unused until the 4th arg was better, because cl
is needed for variable-count shift. (And most functions don't happen to use their 3rd arg as a shift count.) (Probably older GCC versions inlined memcpy
or memset
as rep movs
more aggressively; it's usually not worth it vs. SIMD for small arrays these days.)
The x86-64 System V ABI uses almost the same calling convention for functions as it does for system calls. This is not a coincidence: it means the implementation for a libc wrapper function like mmap
can be:
mmap:
mov r10, rcx ; syscall destroys rcx and r11; 4th arg passed in r10 for syscalls
mov eax, __NR_mmap
syscall
cmp rax, -4096
ja .set_errno_and_stuff
ret
This is a tiny advantage, but there's really no reason not to do this. It also saves a few instructions inside the kernel setting up the arg-passing registers before dispatching to the C implementation of the system call in the kernel. (See this answer for a look at some kernel side of system call handling. Mostly about the int 0x80
handler, but I think I mentioned the 64-bit syscall
handler and that it dispatches to a table of functions directly from asm.)
The syscall
instruction itself destroys RCX and R11 (to save user-space RIP and RFLAGS without needing microcode to set up the kernel stack) so the conventions can't be identical unless the user-space convention avoided RCX and R11. But RCX is a handy register whose low half can be used without a REX prefix so that probably would have been worse than leaving it as a call-clobbered pure scratch like R11. Also, the user-space convention uses R10 as a "static chain" pointer for languages with first-class nested functions (not C/C++).
Having the first 4 args able to avoid a REX prefix is probably best for overall code-size, and using RBX or RBP instead of RCX would be weird. Having a couple call-preserved registers that don't need a REX prefix (EBX/EBP) is good.
See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 for the function-call and system-call conventions.
The i386 system call convention is the clunky and inconvenient one: ebx
is call-preserved, so almost every syscall wrapper needs to save/restore ebx
, except for calls with no args like getpid
. (And for that you don't even need to enter the kernel, just call into the vDSO: see The Definitive Guide to Linux System Calls (on x86) for more about vDSO and tons of other stuff.)
But the i386 function-calling convention passes all args on the stack, so glibc wrapper functions still need to mov
every arg anyway.
Also note that the "natural" order of x86 registers is EAX, ECX, EDX, EBX, according to their numeric codes in machine code, and also the order that pusha
/ popa
use. See Why are first four x86 GPRs named in such unintuitive order?.
Related Topics
What Are the Return Values of System Calls in Assembly
How to Loop Over Directories in Linux
Edit Shell Script While It's Running
Getting a Unique Id from a Unix-Like System
How Is the System Call in Linux Implemented
How to Cat ≪≪Eof ≫≫ a File Containing Code
How to Redirect Output of an Already Running Process
Setting Up Ftp on Amazon Cloud Server
Run an Untrusted C Program in a Sandbox in Linux That Prevents It from Opening Files, Forking, etc.
How to Add a New Device in Qemu Source Code
Bluez: How to Set Up a Gatt Server from the Command Line
How to Remove the Last Character of a File in Unix