Why Is Rcx Not Used for Passing Parameters to System Calls, Being Replaced with R10

Why is RCX not used for passing parameters to system calls, being replaced with R10?

X86-64 system calls use syscall instruction. This instruction saves return address to rcx, and after that it loads rip from IA32_LSTAR MSR. I.e. rcx is immediately destroyed by syscall. This is the reason why rcx had to be replaced for system call ABI.

This same syscall instruction also saves rflags into r11, and then masks rflags using IA32_FMASK MSR. This is why r11 isn't saved by the kernel.

So, these changes reflect how the syscall mechanism works. This is why the kernel is forced to declare rcx and r11 as not saved and even can't use them for parameter passing.

Reference: Intel's Instruction Set Reference, look for SYSCALL.

What are R10-R15 registers used for in the Windows x64 calling convention?

The Windows x64 calling convention is designed to make it easy to implement variadic functions (like printf and scanf) by dumping the 4 register args into the shadow space, creating a contiguous array of all args. Args larger than 8 bytes are passed by reference, so each arg always takes exactly 1 arg-passing slot.

Given this design constraint, more register args would require a larger shadow space, which wastes more stack space for small functions that don't have a lot of args.

Yes, more register args would normally be more efficient. But if the callee wants to make another function call right away with different args, it would then have to store all its register args to the stack, so there's a limit on how many register args are useful.

You want a good mix of call-preserved and call-clobbered registers, regardless of how many are used for arg-passing. R10 and R11 are call-clobbered scratch regs. A transparent wrapper function written in asm might use them for scratch space without disturbing any of the args in RCX,RDX,R8,R9, and without needing to save/restore a call-preserved register anywhere.

R12..R15 are call-preserved registers you can use for whatever you want, as long as your save/restore them before returning.

Or if we can do that in user-defined functions

Yes, you can freely make up your own calling conventions when calling from asm to asm, subject to constraints imposed by the OS. But if you want exceptions to be able to unwind the stack through such a call (e.g. if one of the child functions calls back into some C++ that can throw), you have to follow more restrictions, such as creating unwind metadata. If not, you can do nearly anything.

See my Choose your calling convention to put args where you want them. answer on the CodeGolf Q&A "Tips for golfing in x86/x64 machine code".

You can also return in whatever register(s) you want, and return multiple values. (e.g. an asm strcmp or memcmp function can return the -/0/+ difference in the mismatch in EAX, and return the mismatch position in RDI, so the caller can use either or both.)

A useful exercise in evaluating a design is to compare it to other actual or possible designs

By comparison, the x86-64 System V ABI passes the first 6 integer args in registers, and the first 8 FP args in XMM0..7. (Windows x64 passes the 5th arg on the stack, even if it's FP and the first 4 args were all integer.)

So the other major x86-64 calling convention does use more arg-passing registers. It doesn't use shadow-space; it defines a red-zone below RSP that's safe from being asynchronously clobbered. Small leaf functions can still avoid manipulating RSP to reserve space.

Fun fact: R10 and R11 are also non-arg-passing call-clobbered registers in x86-64 SysV. Fun fact #2: syscall destroys R11 (and RCX), so Linux uses R10 instead of RCX for passing arguments to system calls, but otherwise uses the same register-arg passing convention as user-space function calls.

See also Why does Windows64 use a different calling convention from all other OSes on x86-64? for more guesswork and info about why Microsoft made the design choices they did with their calling convention.

x86-64 System V makes it more complex to implement variadic functions (more code to index args), but they're generally rare. Most code doesn't bottleneck on sscanf throughput. Shadow space is usually worse than a red-zone. The original Windows x64 convention doesn't pass vector args (__m128) by value, so there's a 2nd 64-bit calling convention on Windows called vectorcall that allows efficient vector args. (Not usually a big deal because most functions that take vector args are inline, but SIMD math library functions would benefit.)

Having more args passed in the low 8 (rax..rdi original registers that don't need a REX prefix), and having more call-clobbered registers that don't need a REX prefix, is probably good for code-size in code that inlines enough to not make a huge amount of function calls. You could say that Window's choice of having more of the non-REX registers be call-preserved is better for code with loops containing function calls, but if you're making lots of function calls to short callees, then they'd benefit from more call-clobbered scratch registers that didn't need REX prefixes. I wonder how much thought MS put into this, or if they just mostly kept things similar to 32-bit calling conventions when choosing which of the low-8 registers would be call-preserved.

One of x86-64 System V's weaknesses is having no call-preserved XMM registers, though. So any function call requires spilling/reloading any FP vars. Having a couple, like the low 128 or 64 bits of xmm6 and xmm7, would have been maybe good.

Is reserving stack space necessary for functions less than four arguments?

Your quote is from the "calling convention" part of the documentation. At the very least, you do not have to worry about this if you do not call other functions from your assembly code. If you do, then you must respect, among other things, "red zone" and stack alignment considerations, that the recommendation you quote is intended to ensure.

EDIT: this post clarifies the difference between "red zone" and "shadow space".

System Calls in windows & Native API?

If you're doing assembly programming under Windows you don't do manual syscalls. You use NTDLL and the Native API to do that for you.

The Native API is simply a wrapper around the kernelmode side of things. All it does is perform a syscall for the correct API.

You should NEVER need to manually syscall so your entire question is redundant.

Linux syscall codes do not change, Windows's do, that's why you need to work through an extra abstraction layer (aka NTDLL).

EDIT:

Also, even if you're working at the assembly level, you still have full access to the Win32 API, there's no reason to be using the NT API to begin with! Imports, exports, etc all work just fine in assembly programs.

EDIT2:

If you REALLY want to do manual syscalls, you're going to need to reverse NTDLL for each relevant Windows version, add version detection (via the PEB), and perform a syscall lookup for each call.

However, that would be silly. NTDLL is there for a reason.

People have already done the reverse-engineering part: see https://j00ru.vexillium.org/syscalls/nt/64/ for a table of system-call numbers for each Windows kernel. (Note that the later rows do change even between versions of Windows 10.) Again, this is a bad idea outside of personal-use-only experiments on your own machine to learn more about asm and/or Windows internals. Don't inline system calls into code that you distribute to anyone else.

Why Assembly x86_64 syscall parameters are not in alphabetical order like i386

The x86-64 System V ABI was designed to minimize instruction-count (and to some degree code-size) in SPECint as compiled by the version of gcc that was current before the first AMD64 CPUs were sold. See this answer for some history and list-archive links.

Since 5 minutes before I thought all registers were the same but they were used differently because of a convention. Now all things changed for me

x86-64 is not fully orthogonal. Some instructions implicitly use specific registers. e.g. push implicitly uses rsp as the stack pointer, shl edx, cl is only usable with a shift count in cl (until BMI2 shlx).

More rarely used: widening mul rdi does rdx:rax = rax*rdi. The rep-string instructions implicitly use RDI, RSI, and RCX, although they're often not worth using.

It turns out that choosing the arg-passing registers so that functions that passed their args to memcpy could inline it as rep movs was useful in the metric Jan Hubicka was using, thus rdi and rsi were chosen as the first two args. But that leaving rcx unused until the 4th arg was better, because cl is needed for variable-count shift. (And most functions don't happen to use their 3rd arg as a shift count.) (Probably older GCC versions inlined memcpy or memset as rep movs more aggressively; it's usually not worth it vs. SIMD for small arrays these days.)

The x86-64 System V ABI uses almost the same calling convention for functions as it does for system calls. This is not a coincidence: it means the implementation for a libc wrapper function like mmap can be:

mmap:
    mov  r10, rcx       ; syscall destroys rcx and r11; 4th arg passed in r10 for syscalls
    mov  eax, __NR_mmap
    syscall

    cmp  rax, -4096
    ja  .set_errno_and_stuff
    ret

This is a tiny advantage, but there's really no reason not to do this. It also saves a few instructions inside the kernel setting up the arg-passing registers before dispatching to the C implementation of the system call in the kernel. (See this answer for a look at some kernel side of system call handling. Mostly about the int 0x80 handler, but I think I mentioned the 64-bit syscall handler and that it dispatches to a table of functions directly from asm.)

The syscall instruction itself destroys RCX and R11 (to save user-space RIP and RFLAGS without needing microcode to set up the kernel stack) so the conventions can't be identical unless the user-space convention avoided RCX and R11. But RCX is a handy register whose low half can be used without a REX prefix so that probably would have been worse than leaving it as a call-clobbered pure scratch like R11. Also, the user-space convention uses R10 as a "static chain" pointer for languages with first-class nested functions (not C/C++).

Having the first 4 args able to avoid a REX prefix is probably best for overall code-size, and using RBX or RBP instead of RCX would be weird. Having a couple call-preserved registers that don't need a REX prefix (EBX/EBP) is good.

See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 for the function-call and system-call conventions.

The i386 system call convention is the clunky and inconvenient one: ebx is call-preserved, so almost every syscall wrapper needs to save/restore ebx, except for calls with no args like getpid. (And for that you don't even need to enter the kernel, just call into the vDSO: see The Definitive Guide to Linux System Calls (on x86) for more about vDSO and tons of other stuff.)

But the i386 function-calling convention passes all args on the stack, so glibc wrapper functions still need to mov every arg anyway.

Also note that the "natural" order of x86 registers is EAX, ECX, EDX, EBX, according to their numeric codes in machine code, and also the order that pusha / popa use. See Why are first four x86 GPRs named in such unintuitive order?.

Purpose of saving an incoming pthread address on the stack before syscall in MUSL's x86_64 __syscall_cp_asm wrapper?

This is to support pthread cancellation points; a signal handler can later look at the stack.

The commit log for the commit that introduced this code explains that storing a pointer at a known place on the stack before a syscall makes it possible for the "cancellation signal handler" to determine "whether the interrupted code was in a cancellable state." (The initial version of that code also saves the address of the syscall instruction, but later commits changed that.)

The first arg (which that asm function stores on the stack) comes from its C caller, __syscall_cp_c, which passes __syscall_cp_asm(&self->cancel, nr, u, v, w, x, y, z);, where self came from __pthread_self().

You're correct, overwriting the caller's stack arg with a different incoming arg is not "visible" to a C caller following the x86-64 System V ABI. (A callee owns its stack args; the caller has to assume they've been overwritten so compiler generated code will never read that memory location as an output). So we needed to look for alternate explanations.

Using 2 total mov instructions to copy the incoming RDI into the 8(%rsp) after reading that memory location is I think necessary. We can't delay the mov %rdx,%rdi until after the load because we need to free up RDX to hold R8, to free up R8 to hold the load. You could avoid touching an "extra" register by using R10 before it's used to load the other arg, but it would still take at least 2 instructions.

Or the arg order could be optimized to pass that pointer in a later arg, perhaps passing the call number last and the pthread pointer in the last register arg (minimal shuffling but avoiding need for a double dereference for that test/branch) or the first stack arg (where you want it anyway). Or match the arg order of the __syscall wrapper that takes nr first with no pthread pointer.

Why Is Rcx Not Used for Passing Parameters to System Calls, Being Replaced with R10