Why Does a Syscall Clobber Rcx and R11

When does Linux x86-64 syscall clobber %r8, %r9 and %r10?

Only 32-bit system calls (e.g. via int 0x80) in 64-bit mode step on those registers, along with R11. (What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?).

syscall properly saves/restores all regs including R8, R9, and R10, so user-space using it can assume they keep their values, except the RAX return value. (The kernel's syscall entry point even saves RCX and R11, but at that point they've already been overwritten by the syscall instruction itself with the original RIP and before-masking RFLAGS value.)


Those, with R11, are the non-legacy registers that are call-clobbered in the function-calling convention, so compiler-generated code for C functions inside the kernel naturally preserves R12-R15, even if an asm entry point didn't save them.

Currently the 64-bit int 0x80 entry point just pushes 0 for the call-clobbered R8-R11 registers in the process-state struct that it will restore from before returning to user space, instead of the original register values.

Historically, the int 0x80 entry point from 32-bit user-space didn't save/restore those registers at all. So their values were whatever compiler-generated kernel code left sitting around. This was thought to be innocent because 32-bit mode can't read those registers, until it was realized that user-space can far-jump to 64-bit mode, using the same CS value that the kernel uses for normal 64-bit user-space processes, selecting that system-wide GDT entry. So there was an actual info leak of kernel data, which was fixed by zeroing those registers.

IDK whether there used to be or still is a separate entry point from 64-bit user-space vs. 32-bit, or how they differ in struct pt_regs layout. The historical situation where int 0x80 leaked r8..r11 wouldn't have made sense for 64-bit user-space; that leak would have been obvious. So if they're unified now, they must not have been in the past.

Why do x86-64 Linux system calls modify RCX, and what does the value mean?

The system call return value is in rax, as always. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64.

Note that sys_brk has a slightly different interface than the brk / sbrk POSIX functions; see the C library/kernel differences section of the Linux brk(2) man page. Specifically, Linux sys_brk sets the program break; the arg and return value are both pointers. See Assembly x86 brk() call use. That answer needs upvotes because it's the only good one on that question.


The other interesting part of your question is:

I do not quite understand the value in the rcx register in this case

You're seeing the mechanics of how the syscall / sysret instructions are designed to allow the kernel to resume user-space execution but still be fast.

syscall doesn't do any loads or stores, it only modifies registers. Instead of using special registers to save a return address, it simply uses regular integer registers.

It's not a coincidence that RCX=RIP and R11=RFLAGS after the kernel returns to your user-space code. The only way for this not to be the case is if a ptrace system call modified the process's saved rcx or r11 value while it was inside the kernel. (ptrace is the system call gdb uses). In that case, Linux would use iret instead of sysret to return to user space, because the slower general-case iret can do that. (See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for some walk-through of Linux's system-call entry points. Mostly the entry points from 32-bit processes, not from syscall in a 64-bit process, though.)


Instead of pushing a return address onto the kernel stack (like int 0x80 does), syscall:

  • sets RCX=RIP, R11=RFLAGS (so it's impossible for the kernel to even see the original values of those regs before you executed syscall).

  • masks RFLAGS with a pre-configured mask from a config register (the IA32_FMASK MSR). This lets the kernel disable interrupts (IF) until it's done swapgs and setting rsp to point to the kernel stack. Even with cli as the first instruction at the entry point, there'd be a window of vulnerability. You also get cld for free by masking off DF so rep movs / stos go upward even if user-space had used std.

    Fun fact: AMD's first proposed syscall / swapgs design didn't mask RFLAGS, but they changed it after feedback from kernel developers on the amd64 mailing list (in ~2000, a couple years before the first silicon).

  • jumps to the configured syscall entry point (setting CS:RIP = IA32_LSTAR). The old CS value isn't saved anywhere, I think.

  • It doesn't do anything else, the kernel has to use swapgs to get access to an info block where it saved the kernel stack pointer, because rsp still has its value from user-space.

So the design of syscall requires a system-call ABI that clobbers registers, and that's why the values are what they are.

FreeBSD syscall clobbering more registers than Linux? Inline asm different behaviour between optimization levels

Optimization level determines where clang decides to keep its loop counter: in memory (unoptimized) or in a register, in this case r8d (optimized). R8D is a logical choice for the compiler: it's a call-clobbered reg it can use without saving at the start/end of main, and you've told it all the registers it could use without a REX prefix (like ECX) are either inputs / outputs or clobbers for the asm statement.

Note: if FreeBSD is like MacOS, system call error / no-error status is returned in CF (the carry flag), not via RAX being in the -4095..-1 range. In that case, you'd want a GCC6 flag-output operand like "=@ccc" (err) for int err(#ifdef __GCC_ASM_FLAG_OUTPUTS__ - example) or a setc %cl in the template to materialize a boolean manually. (CL is a good choice because you can just use it as an output instead of a clobber.)


FreeBSD's syscall handling trashes R8, R9, and R10, in addition to the bare minimum clobbering the Linux does: RAX (retval) and RCX / R11 (The syscall instruction itself uses them to save RIP / RFLAGS so the kernel can find its way back to user-space, so the kernel never even sees the original values.)

Possibly also RDX, we're not sure; the comments call it "return value 2" (i.e. as part of a RDX:RAX return value?). We also don't know what future-proof ABI guarantees FreeBSD intends to maintain in future kernels.

You can't assume R8-R10 are zero after syscall because they're actually preserved instead of zeroed when tracing / single-stepping. (Because then the kernel chooses not to return via sysret, for the same reason as Linux: hardware / design bugs make that unsafe if registers might have been modified by ptrace while inside the system call. e.g. attempting to sysret with a non-canonical RIP will #GP in ring 0 (kernel mode) on Intel CPUs! That's a disaster because RSP = user stack at that point.)


The relevant kernel code is the sysret path (well spotted by @NateEldredge; I found the syscall entry point by searching for swapgs, but hadn't gotten to looking at the return path).

The function-call-preserved registers don't need to be restored by that code because calling a C function didn't destroy them in the first place. and the code does restore the function-call-clobbered "legacy" registers RDI, RSI, and RDX.

R8-R11 are the registers that are call-clobbered in the function-calling convention, and that are outside the original 8 x86 registers. So that's what makes them "special". (R11 doesn't get zeroed; syscall/sysret uses it for RFLAGS, so that's the value you'll find there after syscall)

Zeroing is slightly faster than loading them, and in the normal case (syscall instruction inside a libc wrapper function) you're about to return to a caller that's only assuming the function-calling convention, and thus will assume that R8-R11 are trashed (same for RDI, RSI, RDX, and RCX, although FreeBSD does bother to restore those for some reason.)


This zeroing only happens when not single-stepping or tracing (e.g. truss or GDB si). The syscall entry point into an amd64 kernel (Github) does save all the incoming registers, so they're available to be restored by other ways out of the kernel.



Updated asm() wrapper

// Should be fixed for FreeBSD, plus other improvements
ssize_t sys_write(int fd, const void *data, size_t size){
register ssize_t res __asm__("rax");
register int arg0 __asm__("edi") = fd;
register const void *arg1 __asm__("rsi") = data; // you can use real types
register size_t arg2 __asm__("rdx") = size;
__asm__ __volatile__(
"syscall"
// RDX *maybe* clobbered
: "=a" (res), "+r" (arg2)
// RDI, RSI preserved
: "a" (SYS_write), "r" (arg0), "r" (arg1)
// An arg in R10, R8, or R9 definitely would be
: "rcx", "r11", "memory", "r8", "r9", "r10" ////// The fix: r8-r10
// see below for a version that avoids the "memory" clobber with a dummy input operand
);
return res;
}

Use "+r" output/input operands with any args that need register long arg3 asm("r10") or similar for r8 or r9.

This is inside a wrapper function so the modified value of the C variables get thrown away, forcing repeated calls to set up the args every time. That would be the "defensive" approach until another answer identifies more definitely-non-trashed registers.



I did break *0x000000000020192b then info registers when break happened. r8 is zero. Program still gets stuck in this case

I assume that r8 wasn't zero before you did that GDB continue across the syscall instruction. Yes, that test confirms that the FreeBSD kernel is trashing r8 when not single-stepping. (And behaving in a way that matches what we see in the source code.)


Note that you can tell the compiler that a write system call only reads memory (not writes) using a dummy "m" input operand instead of a "memory" clobber. That would let it hoist the store of c out of the loop. (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)

i.e. "m"(*(const char (*)[size]) data) as an input instead of a "memory" clobber.

If you're going to write specific wrappers for each syscall you use, instead of a generic wrapper you use for every 3-operand syscall that just casts all operands to unsigned long, this is the advantage you can get from doing that.

Speaking of which, there's absolutely no point in making your syscall args all be long; making user-space sign-extend int fd into a 64-bit register is just wasted instructions. The kernel ABI will (almost certainly) ignore the high bytes of registers for narrow args, like Linux does. (Again, unless you're making a generic syscall3 wrapper that you just use with different SYS_ numbers to define write, read, and other 3-operand system calls; then you would cast everything to register-width and just use a "memory" clobber).

I made these changes for my modified version below.

Also note that for RDI, RSI, and RDX, there are specific-register letter constraints which you can use instead of register-asm locals, just like you're doing for the return value in RAX ("=a"). BTW, you don't really need a matching constraint for the call number, just use an "a" input; it's easier to read because you don't need to look at another operand to check that you're matching the right output.

// assuming RDX *is* clobbered.
// could remove the + if it isn't.
ssize_t sys_write(int fd, const void *data, size_t size)
{
// register long arg3 __asm__("r10") = ??;
// register-asm is useful for R8 and up

ssize_t res;
__asm__ __volatile__("syscall"
// RDX
: "=a" (res), "+d" (size)
// EAX/RAX RDI RSI
: "a" (SYS_write), "D" (fd), "S" (data),
"m" (*(const char (*)[size]) data) // tells compiler this mem is an input
: "rcx", "r11" //, "memory"
#ifndef __linux__
, "r8", "r9", "r10" // Linux always restores these
#endif
);
return res;
}

Some people prefer register ... asm("") for all the operands because you get to use the full register name, and don't have to remember the totally-non-obvious "D" for RDI/EDI/DI/DIL vs. "d" for RDX/EDX/DX/DL

Why do x86-64 Linux system calls work with 6 registers set?

System calls accept up to 6 arguments, passed in registers (almost the same registers as the SysV x64 C ABI, with r10 replacing rcx but they are callee preserved in the syscall case), and "extra" arguments are simply ignored.

Some specific answers to your questions below.

The src/internal/x86_64/syscall.s is just a "thunk" which shifts all the all the arguments into the right place. That is, it converts from a C-ABI function which takes the syscall number and 6 more arguments, into a "syscall ABI" function with the same 6 arguments and the syscall number in rax. It works "just fine" for any number of arguments - the additional register movement will simply be ignored by the syscall if those arguments aren't used.

Since in the C-ABI all the argument registers are considered scratch (i.e., caller-save), clobbering them is harmless if you assume this __syscall method is called from C. In fact the kernel makes stronger guarantees about clobbered registers, clobbering only rcx and r11 so assuming the C calling convention is safe but pessimistic. In particular, the code calling __syscall as implemented here will unnecessarily save any argument and scratch registers per the C ABI, despite the kernel's promise to preserve them.

The arch/x86_64/syscall_arch.h file is pretty much the same thing, but in a C header file. Here, you want all seven versions (for zero to six arguments) because modern C compilers will warn or error if you call a function with the wrong number of arguments. So there is no real option to have "one function to rule them all" as in the assembly case. This also has the advantage of doing less work syscalls that take less than 6 arguments.

Your listed questions, answered:

  • Why can I pass more parameters than the system call takes?

Because the calling convention is mostly register-based and caller cleanup. You can always pass more arguments in this situation (including in the C ABI) and the other arguments will simply be ignored by the callee. Since the syscall mechanism is generic at the C and .asm level, there is no real way the compiler can ensure you are passing the right number of arguments - you need to pass the right syscall id and the right number of arguments. If you pass less, the kernel will see garbage, and if you pass more, they will be ignored.

  • Is this reasonable, documented behavior?

Yes, sure - because the whole syscall mechanism is a "generic gate" into the kernel. 99% of the time you aren't going to use that: glibc wraps the vast majority of interesting syscalls in C ABI wrappers with the correct signature so you don't have to worry about. Those are the ways that syscall access happens safely.

  • What am I supposed to set the unused registers to?

You don't set them to anything. If you use the C prototypes arch/x86_64/syscall_arch.h the compiler just takes care of it for you (it doesn't set them to anything) and if you are writing your own asm, you don't set them to anything (and you should assume they are clobbered after the syscall).

  • What will the kernel do with the registers it doesn't use?

It is free to use all the registers it wants, but will adhere to the kernel calling convention which is that on x86-64 all registers other than rax, rcx and r11 are preserved (which is why you see rcx and r11 in the clobber list in the C inline asm).

  • Is the seven function approach faster by virtue of having less instructions?

Yes, but the difference is very small since the reg-reg mov instructions are usually have zero latency and have high throughput (up to 4/cycle) on recent Intel architectures. So moving an extra 6 registers perhaps takes something like 1.5 cycles for a syscall that is usually going to take at least 50 cycles even if it does nothing. So the impact is small, but probably measurable (if you measure very carefully!).

  • What happens to the other registers in those functions?

I'm not sure what you mean exactly, but the other registers can be used just like all GP registers, if the kernel wants to preserve their values (e.g., by pushing them on the stack and then poping them later).

Clang 11 and GCC 8 O2 Breaks Inline Assembly

  1. You can't use extended asm in a naked function, only basic asm, according to the gcc manual. You don't need to inform the compiler of clobbered registers (since it won't do anything about them anyway; in a naked function you are responsible for all register management). And passing the address of entry in an extended operand is unnecessary; just do jmp entry.

    (In my tests your code doesn't compile at all, so I assume you weren't showing us your exact code - next time please do, so as to avoid wasting people's time.)

  2. Linux x86-64 syscall system calls are allowed to clobber the rcx and r11 registers, so you need to add those to the clobber lists of your system calls.

  3. You align the stack to a 16-byte boundary before jumping to entry. However, the 16-byte alignment rule is based on the assumption that you will be calling the function with call, which would push an additional 8 bytes onto the stack. As such, the called function actually expects the stack to initially be, not a multiple of 16, but 8 more or less than a multiple of 16. So you are actually aligning the stack incorrectly, and this can be a cause of all sorts of mysterious trouble.

    So either replace your jmp with call, or else subtract a further 8 bytes from rsp (or just push some 64-bit register of your choice).

  4. Style note: unsigned long is already 64 bits on Linux x86-64, so it would be more idiomatic to use that in place of unsigned long long everywhere.

  5. General hint: learn about register constraints in extended asm. You can have the compiler load your desired registers for you, instead of writing instructions in your asm to do it yourself. So your exit function could instead look like:

    void exit(unsigned long status) {
asm volatile("syscall"
: //no outputs
:"a"(60), "D" (status)
:"rcx", "r11");
}

This in particular saves you a few instructions, since status is already in the %rdi register on function entry. With your original code, the compiler has to move it somewhere else so that you can then load it into %rdi yourself.


  1. Your open function always returns 1, which will typically not be the fd that was actually opened. So if your program is run with standard output redirected, your program will write to the redirected stdout, instead of to the tty as it seems to want to do. Indeed, this makes the open syscall completely pointless, because you never use the file you opened.

    You should arrange for open to return the value that was actually returned by the system call, which will be left in the %rax register when syscall returns. You can use an output operand to have this stored in a temporary variable (which the compiler will likely optimize out), and return that. You'll need to use a digit constraint since it is going in the same register as an input operand. I leave this as an exercise for you. It would likewise be nice if your write function actually returned the number of bytes written.

For temporary registers in the asm statement, should I use clobber or dummy output?

You should normally let the compiler pick registers for you, using an early-clobber dummy output with any required constraints1. This gives it flexibility to do register allocation for the function.

1 e.g. you can use +&Q to get one of RAX/RBX/RCX/RDX: registers that have an AH/BH/CH/DH. If you wanted to unpack 8-bit fields with movzbl %h[input], %[high_byte]

; movzbl %b[input], %[low_byte] ; shr $16, %[input], you'd need a register that has it's 2nd 8-bit chunk aliased to a high-8 register.

Out of curiosity, when we consider a calling convention of amd64, some registers can be freely used inside the functions; and we could implement some functions by only using those registers inside the asm statement. Why allowing the compiler to choose the registers to be used is better than the mentioned one?

Because functions can inline, maybe into a loop that calls other functions, thus the compiler would want to give it inputs in call-preserved registers. If you were writing a stand-alone function that the compiler always has to call, all you get from inline asm instead of stand-alone is the compiler handling calling-convention differences and C++ name-mangling.

Or maybe the surrounding code uses some instructions that require fixed registers, like cl for shift counts or RDX:RAX for div.



when should I use the clobber list? ...
such as syscall instruction requires its parameter should be located in register rdi rsi rdx r10 r8 r9??

Normally you'd use input constraints instead, so only the syscall instruction itself is inside the inline asm. But syscall (the instruction itself) clobbers RCX and R11, so system calls made using it unavoidably destroy user-space's RCX and R11. There's no point using dummy outputs for these, unless you have a use for the return address (RCX) or RFLAGS (R11). So yes, clobbers are useful here.

// the compiler will emit all the necessary MOV instructions
#include <stddef.h>
#include <asm/unistd.h>

// the compiler will emit all the necessary MOV instructions
//static inline
size_t sys_write(int fd, const char *buf, size_t len) {
size_t retval;
asm volatile("syscall"
: "=a"(retval) // EDI RSI RDX
: "a"(__NR_write), "D"(fd), "S"(buf), "d"(len)
, "m"(*(char (*)[len]) buf) // dummy memory input: the asm statement reads this memory
: "rcx", "r11" // clobbered by syscall
// , "memory" // would be needed if we didn't use a dummy memory input
);
return retval;
}

A non-inline version of this compiles as follows (with gcc -O3 on the Godbolt compiler explorer), because the function-calling convention nearly matches the system-call convention:

sys_write(int, char const*, unsigned long):
movl $1, %eax
syscall
ret

It would have been really silly to use clobbers on any of the input registers and put a mov inside the asm:

size_t dumb_sys_write(int fd, const char *buf, size_t len) {
size_t retval;
asm volatile(
"mov %[fd], %%edi\n\t"
"mov %[buf], %%rsi\n\t"
"mov %[len], %%rdx\n\t"
"syscall"
: "=a"(retval) // EDI RSI RDX
: "a"(__NR_write), [fd]"r"(fd), [buf]"r"(buf), [len]"r"(len)
, "m"(*(char (*)[len]) buf) // dummy memory input: the asm statement reads this memory
: "rdi", "rsi", "rdx", "rcx", "r11"
// , "memory" // would be needed if we didn't use a dummy memory input
);

// if(retval > -4096ULL) errno = -retval;

return retval;
}

dumb_sys_write(int, char const*, unsigned long):
movl %edi, %r9d
movq %rsi, %r8
movq %rdx, %r10
movl $1, %eax # compiler generated before this
# from inline asm
mov %r9d, %edi
mov %r8, %rsi
mov %r10, %rdx
syscall
# end of inline asm
ret

And besides that, you're not letting the compiler take advantage of the fact that syscall doesn't clobber any of its input registers. The compiler might well still want len in a register, and using a pure input constraint lets it know that the value will still be there afterwards.


You might also use clobbers if you're using any instructions that implicitly use certain registers, but neither the input nor output of those instructions is a direct input or output of the asm statement. That would be rare, though, unless you're writing a whole loop or large block of code in inline asm.

Or maybe if you're wrapping a call instruction. (It's hard to do this safely, especially because of the red-zone, but people do try to do this). You don't get to choose which registers the code clobbers, so you just tell the compiler about it.

Acceptability of regular usage of r10 and r11

Both r10 and r11 are call-clobbered registers, aka volatile, you can overwrite them without saving/restoring in any leaf or non-leaf function. That's what C compilers do, and expect from functions their code calls, because that's what the ABI doc says: What registers are preserved through a linux x86-64 function call

Your caller will expect them to hold garbage after return. Just like arg-passing registers such as RDI or RCX. (And RDX if it's not part of a wide RDX:RAX return value.)


The x86-64 System V ABI doesn't name its calling convention "cdecl". It's just the x86-64 SysV calling convention. The string "cdecl" doesn't appear in the ABI doc.

r11 is a temporary, aka call-clobbered register. r11 is never used for passing or returning anything, so it's safe even for wrapper / trampoline / hook functions to clobber it even if they want to forward all args and return all return values, unlike any other register. For example, lazy dynamic linker code.

r10 is also call-clobbered. The ABI says "used for passing a function’s static chain pointer". In languages that use nested functions, this is an extra incoming arg to such functions so they can find the local vars of the outer scope. A pointer-to-nested-function needs a code pointer and a static chain pointer for the caller to pass when dereferencing.

It's a "chain" because there can be multiple levels of nesting with their stack frames forming a linked list, and "static" because it's based on lexical scope, not the call stack. (Thanks @Raymond Chen for explaining the terminology.)

But like all arg-passing registers to function calls (not system calls) in x86-64 System V, it's call-clobbered (like in most calling conventions generally). You only need to worry about this usage of r10 if you're hooking or wrapping nested functions, ones defined inside another function. If you're just writing a function that's called normally, it's a pure temporary.

GCC does use r10 as part of its trampoline for function pointers to GNU C nested functions, for a pointer to the stack frame of the outer scope. The trampoline of machine code on the stack is a hack, but this is indeed a static chain pointer; languages with proper support for nested functions (unlike C and C++) would probably have the caller aware of it (like a lambda / closure) and passing a value in r10 when using using pointer to a nested function.


In a normal function, RBX, RBP, and RSP are call-preserved, along with R12..R15. All others can be clobbered without saving/restoring. (That includes xmm/ymm0..15 and zmm0..31 / k0..7, and the mmx/x87 stack, and the condition codes in RFLAGS).


Note that r8..15 need a REX prefix, even with 32-bit operand-size (like xor r10d, r10d). If you have some 64-bit non-pointer integers, then sure keep them in r8..r11 because you always need a REX prefix for 64-bit operand-size any time you use those values anyway.

Smaller code-size is usually not worse, and sometimes helps with decode and uop-cache density, and L1i cache density. RAX, RCX,RDX, RSI,RDI should be your first choices for scratch regs. (And use 32-bit operand-size unless you need 64-bit. e.g. xor eax,eax is the correct way to zero RAX. Silvermont doesn't recognize xor r10,r10 as a zeroing idiom, so use xor r10d,r10d even though it doesn't save code size.)



Related Topics



Leave a reply



Submit