Linux X64: Why Does R10 Come Before R8 and R9 in Syscalls

When does Linux x86-64 syscall clobber %r8, %r9 and %r10?

Only 32-bit system calls (e.g. via int 0x80) in 64-bit mode step on those registers, along with R11. (What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?).

syscall properly saves/restores all regs including R8, R9, and R10, so user-space using it can assume they keep their values, except the RAX return value. (The kernel's syscall entry point even saves RCX and R11, but at that point they've already been overwritten by the syscall instruction itself with the original RIP and before-masking RFLAGS value.)


Those, with R11, are the non-legacy registers that are call-clobbered in the function-calling convention, so compiler-generated code for C functions inside the kernel naturally preserves R12-R15, even if an asm entry point didn't save them.

Currently the 64-bit int 0x80 entry point just pushes 0 for the call-clobbered R8-R11 registers in the process-state struct that it will restore from before returning to user space, instead of the original register values.

Historically, the int 0x80 entry point from 32-bit user-space didn't save/restore those registers at all. So their values were whatever compiler-generated kernel code left sitting around. This was thought to be innocent because 32-bit mode can't read those registers, until it was realized that user-space can far-jump to 64-bit mode, using the same CS value that the kernel uses for normal 64-bit user-space processes, selecting that system-wide GDT entry. So there was an actual info leak of kernel data, which was fixed by zeroing those registers.

IDK whether there used to be or still is a separate entry point from 64-bit user-space vs. 32-bit, or how they differ in struct pt_regs layout. The historical situation where int 0x80 leaked r8..r11 wouldn't have made sense for 64-bit user-space; that leak would have been obvious. So if they're unified now, they must not have been in the past.

Why is RCX not used for passing parameters to system calls, being replaced with R10?

X86-64 system calls use syscall instruction. This instruction saves return address to rcx, and after that it loads rip from IA32_LSTAR MSR. I.e. rcx is immediately destroyed by syscall. This is the reason why rcx had to be replaced for system call ABI.

This same syscall instruction also saves rflags into r11, and then masks rflags using IA32_FMASK MSR. This is why r11 isn't saved by the kernel.

So, these changes reflect how the syscall mechanism works. This is why the kernel is forced to declare rcx and r11 as not saved and even can't use them for parameter passing.

Reference: Intel's Instruction Set Reference, look for SYSCALL.

Why do x86-64 Linux system calls modify RCX, and what does the value mean?

The system call return value is in rax, as always. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64.

Note that sys_brk has a slightly different interface than the brk / sbrk POSIX functions; see the C library/kernel differences section of the Linux brk(2) man page. Specifically, Linux sys_brk sets the program break; the arg and return value are both pointers. See Assembly x86 brk() call use. That answer needs upvotes because it's the only good one on that question.


The other interesting part of your question is:

I do not quite understand the value in the rcx register in this case

You're seeing the mechanics of how the syscall / sysret instructions are designed to allow the kernel to resume user-space execution but still be fast.

syscall doesn't do any loads or stores, it only modifies registers. Instead of using special registers to save a return address, it simply uses regular integer registers.

It's not a coincidence that RCX=RIP and R11=RFLAGS after the kernel returns to your user-space code. The only way for this not to be the case is if a ptrace system call modified the process's saved rcx or r11 value while it was inside the kernel. (ptrace is the system call gdb uses). In that case, Linux would use iret instead of sysret to return to user space, because the slower general-case iret can do that. (See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for some walk-through of Linux's system-call entry points. Mostly the entry points from 32-bit processes, not from syscall in a 64-bit process, though.)


Instead of pushing a return address onto the kernel stack (like int 0x80 does), syscall:

  • sets RCX=RIP, R11=RFLAGS (so it's impossible for the kernel to even see the original values of those regs before you executed syscall).

  • masks RFLAGS with a pre-configured mask from a config register (the IA32_FMASK MSR). This lets the kernel disable interrupts (IF) until it's done swapgs and setting rsp to point to the kernel stack. Even with cli as the first instruction at the entry point, there'd be a window of vulnerability. You also get cld for free by masking off DF so rep movs / stos go upward even if user-space had used std.

    Fun fact: AMD's first proposed syscall / swapgs design didn't mask RFLAGS, but they changed it after feedback from kernel developers on the amd64 mailing list (in ~2000, a couple years before the first silicon).

  • jumps to the configured syscall entry point (setting CS:RIP = IA32_LSTAR). The old CS value isn't saved anywhere, I think.

  • It doesn't do anything else, the kernel has to use swapgs to get access to an info block where it saved the kernel stack pointer, because rsp still has its value from user-space.

So the design of syscall requires a system-call ABI that clobbers registers, and that's why the values are what they are.

Acceptability of regular usage of r10 and r11

The x86-64 System V ABI doesn't call its calling convention "cdecl". It's just the x86-64 SysV calling convention. The string "cdecl" doesn't appear in the ABI doc.

r11 is a temporary, aka call-clobbered register.

r10 is also a call-clobbered register. The ABI says "used for passing a function’s static chain pointer", but C doesn't use this and code generated by gcc and clang does freely use r10 without saving/restoring it. The ABI's table of register usage lists r10 as not preserved across function calls so a leaf function can always clobber it. (Which registers to use as temporaries when writing AMD64 SysV assembly?)

gcc does use r10 as part of its trampoline for function pointers to GNU C nested functions, for a pointer to the stack frame of the outer scope. The trampoline of machine code on the stack is a hack, but this is indeed a static chain pointer; languages with proper support for nested functions would probably have the caller aware of it (like a lambda / closure) and passing a value in r10 when using using pointer to a nested function.

Non-leaf functions do not need to pass on their incoming r10 to their children unless they're "nested functions" in a language that supports that sort of thing (not C or C++). Therefore r10 is also a pure temporary in normal circumstances.


r10 and r11 are not arg-passing registers, unlike the other call-clobbered registers, so "wrapper" functions can use them (especially r11) without saving/restoring anything.

In a normal function, RBX, RBP, and RSP are call-preserved, along with R12..R15. All others can be clobbered without saving/restoring. (That includes xmm/ymm0..15 and zmm0..31, and the x87 stack, and the condition codes in RFLAGS).


Note that r8..15 need a REX prefix, even with 32-bit operand-size (like xor r10d, r10d). If you have some 64-bit non-pointer integers, then sure keep them in r8..r11 because you always need a REX prefix for 64-bit operand-size any time you use those values anyway.

Smaller code-size is usually not worse, and sometimes helps with decode and uop-cache density, and L1i cache density. RAX, RCX,RDX, RSI,RDI should be your first choices for scratch regs. (And use 32-bit operand-size unless you need 64-bit. e.g. xor eax,eax is the correct way to zero RAX. Silvermont doesn't recognize xor r10,r10 as a zeroing idiom, so use xor r10d,r10d even though it doesn't save code size.)

If you do run out of low registers, ideally use r10 / r11 for things that will normally be used with 64-bit operand-size (or VEX prefixes) anyway. e.g. pointers to 64-bit data or pointers to pointers. mov eax, [r10] needs a REX prefix while mov eax, [rdi] doesn't. But mov rax, [rdi] and mov r8, [r10] are the same size.

It's hard to gain much because you often need to use different values together in different combinations, like eventually using cmp eax, r10d or whatever, but if you want to go all-out on optimizing, then think about code-size. Maybe also think about where the instruction boundaries are and how it will fit into the uop cache.

See the x86 tag wiki, and especially http://agner.org/optimize/ for tips on writing efficient code.

On x64 Linux, what is the difference between syscall, int 0x80 and ret to exit a program?

If you use printf or other libc functions, it's best to ret from main or call exit. (Which are equivalent; main's caller will call the libc exit function.)

If not, if you were only making other raw system calls like write with syscall, it's also appropriate and consistent to exit that way, but either way, or call exit are 100% fine in main.

If you want to work without libc at all, e.g. put your code under _start: instead of main: and link with ld or gcc -static -nostdlib, then you can't use ret. Use mov eax, 231 (__NR_exit_group) / syscall.

main is a real & normal function like any other (called with a valid return address), but _start (the process entry point) isn't. On entry to _start, the stack holds argc and argv, so trying to ret would set RIP=argc, and then code-fetch would segfault on that unmapped address. Nasm segmentation fault on RET in _start



System call vs. ret-from-main

Exiting via a system call is like calling _exit() in C - skip atexit() and libc cleanup, notably not flushing any buffered stdout output (line buffered on a terminal, full-buffered otherwise).
This leads to symptoms such as Using printf in assembly leads to empty output when piping, but works on the terminal (or if your output doesn't end with \n, even on a terminal.)

main is a function, called (indirectly) from CRT startup code. (Assuming you link your program normally, like you would a C program.) Your hand-written main works exactly like a compiler-generate C main function would. Its caller (__libc_start_main) really does do something like int result = main(argc, argv); exit(result);,

e.g. call rax (pointer passed by _start) / mov edi, eax / call exit.

So returning from main is exactly1 like calling exit.

  • Syscall implementation of exit() for a comparison of the relevant C functions, exit vs. _exit vs. exit_group and the underlying asm system calls.

  • C question: What is the difference between exit and return? is primarily about exit() vs. return, although there is mention of calling _exit() directly, i.e. just making a system call. It's applicable because C main compiles to an asm main just like you'd write by hand.

Footnote 1: You can invent a hypothetical intentionally weird case where it's different. e.g. you used stack space in main as your stdio buffer with sub rsp, 1024 / mov rsi, rsp / ... / call setvbuf. Then returning from main would involve putting RSP above that buffer, and __libc_start_main's call to exit could overwrite some of that buffer with return addresses and locals before execution reached the fflush cleanup. This mistake is more obvious in asm than C because you need leave or mov rsp, rbp or add rsp, 1024 or something to point RSP at your return address.

In C++, return from main runs destructors for its locals (before global/static exit stuff), exit doesn't. But that just means the compiler makes asm that does more stuff before actually running the ret, so it's all manual in asm, like in C.

The other difference is of course the asm / calling-convention details: exit status in EAX (return value) or EDI (first arg), and of course to ret you have to have RSP pointing at your return address, like it was on function entry. With call exit you don't, and you can even do a conditional tailcall of exit like jne exit. Since it's a noreturn function, you don't really need RSP pointing at a valid return address. (RSP should be aligned by 16 before a call, though, or RSP%16 = 8 before a tailcall, matching the alignment after call pushes a return address. It's unlikely that exit / fflush cleanup will do any alignment-required stores/loads to the stack, but it's a good habit to get this right.)

(This whole footnote is about ret vs. call exit, not syscall, so it's a bit of a tangent from the rest of the answer. You can also run syscall without caring where the stack-pointer points.)



SYS_exit vs. SYS_exit_group raw system calls

The raw SYS_exit system call is for exiting the current thread, like pthread_exit().

(eax=60 / syscall, or eax=1 / int 0x80).

SYS_exit_group is for exiting the whole program, like _exit.

(eax=231 / syscall, or eax=252 / int 0x80).

In a single-threaded program you can use either, but conceptually exit_group makes more sense to me if you're going to use raw system calls. glibc's _exit() wrapper function actually uses the exit_group system call (since glibc 2.3). See Syscall implementation of exit() for more details.

However, nearly all the hand-written asm you'll ever see uses SYS_exit1. It's not "wrong", and SYS_exit is perfectly acceptable for a program that didn't start more threads. Especially if you're trying to save code size with xor eax,eax / inc eax (3 bytes in 32-bit mode) or push 60 / pop rax (3 bytes in 64-bit mode), while push 231/pop rax would be even larger than mov eax,231 because it doesn't fit in a signed imm8.

Note 1: (Usually actually hard-coding the number, not using __NR_... constants from asm/unistd.h or their SYS_... names from sys/syscall.h)

And historically, it's all there was. Note that in unistd_32.h, __NR_exit has call number 1, but __NR_exit_group = 252 wasn't added until years later when the kernel gained support for tasks that share virtual address space with their parent, aka threads started by clone(2). This is when SYS_exit conceptually became "exit current thread". (But one could easily and convincingly argue that in a single-threaded program, SYS_exit does still mean exit the whole program, because it only differs from exit_group if there are multiple threads.)

To be honest, I've never used eax=252 / int 0x80 in anything, only ever eax=1. It's only in 64-bit code where I often use mov eax,231 instead of mov eax,60 because neither number is "simple" or memorable the way 1 is, so might as well be a cool guy and use the "modern" exit_group way in my single-threaded toy program / experiment / microbenchmark / SO answer. :P (If I didn't enjoy tilting at windmills, I wouldn't spend so much time on assembly, especially on SO.)

And BTW, I usually use NASM for one-off experiments so it's inconvenient to use pre-defined symbolic constants for call numbers; with GCC to preprocess a .S before running GAS you can make your code self-documenting with #include <sys/syscall.h> so you can use mov $SYS_exit_group, %eax (or $__NR_exit_group), or mov eax, __NR_exit_group with .intel_syntax noprefix.



Don't use the 32-bit int 0x80 ABI in 64-bit code:

What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? explains what happens if you use the COMPAT_IA32_EMULATION int 0x80 ABI in 64-bit code.

It's totally fine for just exiting, as long as your kernel has that support compiled in, otherwise it will segfault just like any other random int number like int 0x7f. (e.g. on WSL1, or people that built custom kernels and disabled that support.)

But the only reason you'd do it that way in asm would be so you could build the same source file with nasm -felf32 or nasm -felf64. (You can't use syscall in 32-bit code, except on some AMD CPUs which have a 32-bit version of syscall. And the 32-bit ABI uses different call numbers anyway so this wouldn't let the same source be useful for both modes.)


Related:

  • Why am I allowed to exit main using ret? (CRT startup code calls main, you're not returning directly to the kernel.)
  • Nasm segmentation fault on RET in _start - you can't ret from _start
  • Using printf in assembly leads to empty output when piping, but works on the terminal stdout buffer (not) flushing with raw system call exit
  • Syscall implementation of exit() call exit vs. mov eax,60/syscall (_exit) vs. mov eax,231/syscall (exit_group).
  • Can't call C standard library function on 64-bit Linux from assembly (yasm) code - modern Linux distros config GCC in a way that call exit or call puts won't link with nasm -felf64 foo.asm && gcc foo.o.
  • Is main() really start of a C++ program? - Ciro's answer is a deep dive into how glibc + its CRT startup code actually call main (including x86-64 asm disassembly in GDB), and shows the glibc source code for __libc_start_main.
  • Linux x86 Program Start Up
    or - How the heck do we get to main()? 32-bit asm, and more detail than you'll probably want until you're a lot more comfortable with asm, but if you've ever wondered why CRT runs so much code before getting to main, that covers what's happening at a level that's a couple steps up from using GDB with starti (stop at the process entry point, e.g. in the dynamic linker's _start) and stepi until you get to your own _start or main.
  • https://stackoverflow.com/tags/x86/info lots of good links about this and everything else.


Related Topics



Leave a reply



Submit