X86_64 Linux Syscall Arguments

Why do x86-64 Linux system calls work with 6 registers set?

System calls accept up to 6 arguments, passed in registers (almost the same registers as the SysV x64 C ABI, with r10 replacing rcx but they are callee preserved in the syscall case), and "extra" arguments are simply ignored.

Some specific answers to your questions below.

The src/internal/x86_64/syscall.s is just a "thunk" which shifts all the all the arguments into the right place. That is, it converts from a C-ABI function which takes the syscall number and 6 more arguments, into a "syscall ABI" function with the same 6 arguments and the syscall number in rax. It works "just fine" for any number of arguments - the additional register movement will simply be ignored by the syscall if those arguments aren't used.

Since in the C-ABI all the argument registers are considered scratch (i.e., caller-save), clobbering them is harmless if you assume this __syscall method is called from C. In fact the kernel makes stronger guarantees about clobbered registers, clobbering only rcx and r11 so assuming the C calling convention is safe but pessimistic. In particular, the code calling __syscall as implemented here will unnecessarily save any argument and scratch registers per the C ABI, despite the kernel's promise to preserve them.

The arch/x86_64/syscall_arch.h file is pretty much the same thing, but in a C header file. Here, you want all seven versions (for zero to six arguments) because modern C compilers will warn or error if you call a function with the wrong number of arguments. So there is no real option to have "one function to rule them all" as in the assembly case. This also has the advantage of doing less work syscalls that take less than 6 arguments.

Your listed questions, answered:

Why can I pass more parameters than the system call takes?

Because the calling convention is mostly register-based and caller cleanup. You can always pass more arguments in this situation (including in the C ABI) and the other arguments will simply be ignored by the callee. Since the syscall mechanism is generic at the C and .asm level, there is no real way the compiler can ensure you are passing the right number of arguments - you need to pass the right syscall id and the right number of arguments. If you pass less, the kernel will see garbage, and if you pass more, they will be ignored.

Is this reasonable, documented behavior?

Yes, sure - because the whole syscall mechanism is a "generic gate" into the kernel. 99% of the time you aren't going to use that: glibc wraps the vast majority of interesting syscalls in C ABI wrappers with the correct signature so you don't have to worry about. Those are the ways that syscall access happens safely.

What am I supposed to set the unused registers to?

You don't set them to anything. If you use the C prototypes arch/x86_64/syscall_arch.h the compiler just takes care of it for you (it doesn't set them to anything) and if you are writing your own asm, you don't set them to anything (and you should assume they are clobbered after the syscall).

What will the kernel do with the registers it doesn't use?

It is free to use all the registers it wants, but will adhere to the kernel calling convention which is that on x86-64 all registers other than rax, rcx and r11 are preserved (which is why you see rcx and r11 in the clobber list in the C inline asm).

Is the seven function approach faster by virtue of having less instructions?

Yes, but the difference is very small since the reg-reg mov instructions are usually have zero latency and have high throughput (up to 4/cycle) on recent Intel architectures. So moving an extra 6 registers perhaps takes something like 1.5 cycles for a syscall that is usually going to take at least 50 cycles even if it does nothing. So the impact is small, but probably measurable (if you measure very carefully!).

What happens to the other registers in those functions?

I'm not sure what you mean exactly, but the other registers can be used just like all GP registers, if the kernel wants to preserve their values (e.g., by pushing them on the stack and then poping them later).

Example of an x86_64 system call which reads parameters from the stack or from fixed memory locations

I don't know about Windows; maybe it does something different.

Linux only ever uses 6 registers for system-call args, not fixed or user-stack locations. If a system call needs more things, one of the args will be a pointer to a struct (like clone3). I think most other x86-64 OSes that use the x86-64 System V ABI are similar. (i.e. all non-Windows one.)

Linux with sysenter from 32-bit user-space may look at the user-space stack for something, but I think just what it needs to be able to return to user-space, not args per-se.

*BSD and MacOS with 32-bit int 0x80 read args from user stack memory, instead of registers, but for 64-bit code they use the x86-64 System V ABI the way Linux does.

The *BSD int 0x80 convention of reading from user ESP is optimized for libc system-call wrappers: it looks for the first arg at 4(%esp), leaving room for a return address at 0(%esp). So the libc wrapper for most system calls could just be int $0x80 / ret, because i386 System V uses a stack-args calling convention.

Obviously it's possible to make a system-calling convention that isn't exclusively register-based, like *BSD in 32-bit mode. It means extra checking, though, since the kernel can't trust any pointers from user-space, not even RSP. For example, mov rsp, 0xffffff...1230 / syscall could try to trick the kernel into reading args from somewhere in kernel space, with the error return value maybe telling you something about what they were. Or causing an invalid page fault if you pass a bad address (or GPF for a non-canonical address).

So it's less convenient. But of course a kernel needs to be able to sanity-check pointer args to syscalls because many like read do take pointers to user-space memory. Still, having to do that on every system call, even ones that should be simpler, is less good.

Register args also lets hand-written asm set up args for a C function safely without needing to do any address checking. Or in modern Linux, just pass a pointer to the register-save area, with C code deciding how many and what width to load. I guess this makes Spectre and ROP attacks harder by not letting user-space enter the kernel with so many user-controlled values in registers for system calls that don't take 6x 64-bit args.

OTOH, with args all on the user stack, an asm entry point just has to pass the user stack pointer to some C function that does the checking and loading.

What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64

Further reading for any of the topics here: The Definitive Guide to Linux System Calls

I verified these using GNU Assembler (gas) on Linux.

Kernel Interface

x86-32 aka i386 Linux System Call convention:

In x86-32 parameters for Linux system call are passed using registers. %eax for syscall_number. %ebx, %ecx, %edx, %esi, %edi, %ebp are used for passing 6 parameters to system calls.

The return value is in %eax. All other registers (including EFLAGS) are preserved across the int $0x80.

I took following snippet from the Linux Assembly Tutorial but I'm doubtful about this. If any one can show an example, it would be great.

If there are more than six arguments,
%ebx must contain the memory
location where the list of arguments
is stored - but don't worry about this
because it's unlikely that you'll use
a syscall with more than six
arguments.

For an example and a little more reading, refer to http://www.int80h.org/bsdasm/#alternate-calling-convention. Another example of a Hello World for i386 Linux using int 0x80: Hello, world in assembly language with Linux system calls?

There is a faster way to make 32-bit system calls: using sysenter. The kernel maps a page of memory into every process (the vDSO), with the user-space side of the sysenter dance, which has to cooperate with the kernel for it to be able to find the return address. Arg to register mapping is the same as for int $0x80. You should normally call into the vDSO instead of using sysenter directly. (See The Definitive Guide to Linux System Calls for info on linking and calling into the vDSO, and for more info on sysenter, and everything else to do with system calls.)

x86-32 [Free|Open|Net|DragonFly]BSD UNIX System Call convention:

Parameters are passed on the stack. Push the parameters (last parameter pushed first) on to the stack. Then push an additional 32-bit of dummy data (Its not actually dummy data. refer to following link for more info) and then give a system call instruction int $0x80

http://www.int80h.org/bsdasm/#default-calling-convention

x86-64 Linux System Call convention:

(Note: x86-64 Mac OS X is similar but different from Linux. TODO: check what *BSD does)

Refer to section: "A.2 AMD64 Linux Kernel Conventions" of System V Application Binary Interface AMD64 Architecture Processor Supplement. The latest versions of the i386 and x86-64 System V psABIs can be found linked from this page in the ABI maintainer's repo. (See also the x86 tag wiki for up-to-date ABI links and lots of other good stuff about x86 asm.)

Here is the snippet from this section:

User-level applications use as integer registers for passing the
sequence %rdi, %rsi, %rdx, %rcx,
%r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9.
A system-call is done via the syscall instruction. This clobbers %rcx and %r11 as well as the %rax return value, but other registers are preserved.
The number of the syscall has to be passed in register %rax.
System-calls are limited to six arguments, no argument is passed
directly on the stack.
Returning from the syscall, register %rax contains the result of
the system-call. A value in the range between -4095 and -1 indicates
an error, it is -errno.
Only values of class INTEGER or class MEMORY are passed to the kernel.

Remember this is from the Linux-specific appendix to the ABI, and even for Linux it's informative not normative. (But it is in fact accurate.)

This 32-bit int $0x80 ABI is usable in 64-bit code (but highly not recommended). What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? It still truncates its inputs to 32-bit, so it's unsuitable for pointers, and it zeros r8-r11.

User Interface: function calling

x86-32 Function Calling convention:

In x86-32 parameters were passed on stack. Last parameter was pushed first on to the stack until all parameters are done and then call instruction was executed. This is used for calling C library (libc) functions on Linux from assembly.

Modern versions of the i386 System V ABI (used on Linux) require 16-byte alignment of %esp before a call, like the x86-64 System V ABI has always required. Callees are allowed to assume that and use SSE 16-byte loads/stores that fault on unaligned. But historically, Linux only required 4-byte stack alignment, so it took extra work to reserve naturally-aligned space even for an 8-byte double or something.

Some other modern 32-bit systems still don't require more than 4 byte stack alignment.

x86-64 System V user-space Function Calling convention:

x86-64 System V passes args in registers, which is more efficient than i386 System V's stack args convention. It avoids the latency and extra instructions of storing args to memory (cache) and then loading them back again in the callee. This works well because there are more registers available, and is better for modern high-performance CPUs where latency and out-of-order execution matter. (The i386 ABI is very old).

In this new mechanism: First the parameters are divided into classes. The class of each parameter determines the manner in which it is passed to the called function.

For complete information refer to : "3.2 Function Calling Sequence" of System V Application Binary Interface AMD64 Architecture Processor Supplement which reads, in part:

Once arguments are classified, the registers get assigned (in
left-to-right order) for passing as follows:
If the class is MEMORY, pass the argument on the stack.
If the class is INTEGER, the next available register of the
sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used

So %rdi, %rsi, %rdx, %rcx, %r8 and %r9 are the registers in order used to pass integer/pointer (i.e. INTEGER class) parameters to any libc function from assembly. %rdi is used for the first INTEGER parameter. %rsi for 2nd, %rdx for 3rd and so on. Then call instruction should be given. The stack (%rsp) must be 16B-aligned when call executes.

If there are more than 6 INTEGER parameters, the 7th INTEGER parameter and later are passed on the stack. (Caller pops, same as x86-32.)

The first 8 floating point args are passed in %xmm0-7, later on the stack. There are no call-preserved vector registers. (A function with a mix of FP and integer arguments can have more than 8 total register arguments.)

Variadic functions (like printf) always need %al = the number of FP register args.

There are rules for when to pack structs into registers (rdx:rax on return) vs. in memory. See the ABI for details, and check compiler output to make sure your code agrees with compilers about how something should be passed/returned.

Note that the Windows x64 function calling convention has multiple significant differences from x86-64 System V, like shadow space that must be reserved by the caller (instead of a red-zone), and call-preserved xmm6-xmm15. And very different rules for which arg goes in which register.

Difference in ABI between x86_64 Linux functions and syscalls

The syscall instruction is intended to provide a quicker method of entering Ring-0 in order to carry out a system call. This is meant to be an improvement over the old method, which was to raise a software interrupt (int 0x80 on Linux).

Part of the reason the instruction is faster is because it does not change memory, or even change rsp to point at a kernel stack. Unlike a software interrupt, where the CPU is forced to allow the OS to resume operation without clobbering anything, for this command the CPU is allowed to assume the software is aware that something is happening here.

In particular, syscall stores two parts of the user-space state in registers. The RIP to return to after the call is stored in rcx, and the flags are stored in R11 (because RFLAGS is masked with a kernel-supplied value before entry to the kernel). This means that both those registers are clobbered by the instruction.

Since they are clobbered, the syscall ABI uses another register instead of rcx, hence the use of r10 for the 4th argument.

r10 is a natural choice, since in the x86-64 SystemV ABI it's not used for passing function args, and functions don't need to preserve their caller's value of r10. So a syscall wrapper function can mov %rcx, %r10 without any save/restore. This wouldn't be possible with any other register, for 6-arg syscalls and the SysV ABI's function calling convention.

BTW, the 32-bit system call ABI is also accessible with sysenter, which requires cooperation between user-space and kernel-space to allow returning to user-space after a sysenter. (i.e. storing some state in user-space before running sysenter). This is higher performance than int 0x80, but awkward. Still, glibc uses it (by jumping to user-space code in the vdso pages that the kernel maps into the address space of every process).

AMD's syscall is another approach to the same idea as Intel's sysenter: to make entry/exit from the kernel less expensive by not preserving absolutely everything.

Why does x86_64 assembly have odd syscall argument order?

The normal userspace order as per the x86-64 ABI is: rdi, rsi, rdx, rcx, r8, then r9. That is not much more logical, beats me how they have come up with that.

Since the syscall instruction clobbers rcx, that had to be substituted and r10 has been chosen for that. This is at least somewhat logical :)

Why Assembly x86_64 syscall parameters are not in alphabetical order like i386

The x86-64 System V ABI was designed to minimize instruction-count (and to some degree code-size) in SPECint as compiled by the version of gcc that was current before the first AMD64 CPUs were sold. See this answer for some history and list-archive links.

Since 5 minutes before I thought all registers were the same but they were used differently because of a convention. Now all things changed for me

x86-64 is not fully orthogonal. Some instructions implicitly use specific registers. e.g. push implicitly uses rsp as the stack pointer, shl edx, cl is only usable with a shift count in cl (until BMI2 shlx).

More rarely used: widening mul rdi does rdx:rax = rax*rdi. The rep-string instructions implicitly use RDI, RSI, and RCX, although they're often not worth using.

It turns out that choosing the arg-passing registers so that functions that passed their args to memcpy could inline it as rep movs was useful in the metric Jan Hubicka was using, thus rdi and rsi were chosen as the first two args. But that leaving rcx unused until the 4th arg was better, because cl is needed for variable-count shift. (And most functions don't happen to use their 3rd arg as a shift count.) (Probably older GCC versions inlined memcpy or memset as rep movs more aggressively; it's usually not worth it vs. SIMD for small arrays these days.)

The x86-64 System V ABI uses almost the same calling convention for functions as it does for system calls. This is not a coincidence: it means the implementation for a libc wrapper function like mmap can be:

mmap:
    mov  r10, rcx       ; syscall destroys rcx and r11; 4th arg passed in r10 for syscalls
    mov  eax, __NR_mmap
    syscall

    cmp  rax, -4096
    ja  .set_errno_and_stuff
    ret

This is a tiny advantage, but there's really no reason not to do this. It also saves a few instructions inside the kernel setting up the arg-passing registers before dispatching to the C implementation of the system call in the kernel. (See this answer for a look at some kernel side of system call handling. Mostly about the int 0x80 handler, but I think I mentioned the 64-bit syscall handler and that it dispatches to a table of functions directly from asm.)

The syscall instruction itself destroys RCX and R11 (to save user-space RIP and RFLAGS without needing microcode to set up the kernel stack) so the conventions can't be identical unless the user-space convention avoided RCX and R11. But RCX is a handy register whose low half can be used without a REX prefix so that probably would have been worse than leaving it as a call-clobbered pure scratch like R11. Also, the user-space convention uses R10 as a "static chain" pointer for languages with first-class nested functions (not C/C++).

Having the first 4 args able to avoid a REX prefix is probably best for overall code-size, and using RBX or RBP instead of RCX would be weird. Having a couple call-preserved registers that don't need a REX prefix (EBX/EBP) is good.

See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 for the function-call and system-call conventions.

The i386 system call convention is the clunky and inconvenient one: ebx is call-preserved, so almost every syscall wrapper needs to save/restore ebx, except for calls with no args like getpid. (And for that you don't even need to enter the kernel, just call into the vDSO: see The Definitive Guide to Linux System Calls (on x86) for more about vDSO and tons of other stuff.)

But the i386 function-calling convention passes all args on the stack, so glibc wrapper functions still need to mov every arg anyway.

Also note that the "natural" order of x86 registers is EAX, ECX, EDX, EBX, according to their numeric codes in machine code, and also the order that pusha / popa use. See Why are first four x86 GPRs named in such unintuitive order?.

Linux syscall documentation

With regards to syscalls, I found the x86-64 Linux system calls on google - it even has what registers to put the arguments in (you'll notice they're exactly as described above). Here you go: https://blog.rchapman.org/posts/Linux_System_Call_Table_for_x86_64/

As @fuz pointed out, the calling convention for syscalls (all syscalls that is) is not the same as regular function calls. Essentially, arguments get loaded into registers rdi, rsi, rdx, r10, r8 and r9 in that specific order.

Win64 and Linux-x86_64 Calling Convention Unused registers modified or not

A registers's status as call-preserved or call-clobbered never depends on the number of args actually passed by the caller and/or expected by the callee, in any calling convention for any ISA I've looked at, and certainly not any of the standard ones on x86.

But yes the calling conventions for raw system calls are different from those for functions, even for presumably thin wrapper functions.

All standard user-space function calling conventions have all the arg-passing registers (and stack slots) as call-clobbered. So if your asm uses call, that's what you need to expect.

The system-calling conventions on mainstream OSes preserves all registers (except the return value). (But on x86-64, only after syscall itself overwrites RCX and R11, because that happens before the kernel gets control.) If you directly use syscall or int 0x80 or whatever, that's what you should expect.

Note that Windows does not have a stable system-call ABI across kernel versions and doesn't document the raw system calls, so in normal Windows code you're always making DLL function calls, never raw system calls. People have reverse-engineered the system calls for different Windows versions, though.

MacOS also doesn't officially have a stable/documented syscall ABI, but in practice Darwin basically does, at least for the normal POSIX open/read/write/close/exit calls that toy programs use.

What registers are preserved through a linux x86-64 function call
What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64
https://packagecloud.io/blog/the-definitive-guide-to-linux-system-calls/
Where is the x86-64 System V ABI documented?
Windows system calls

How to pass parameters to Linux system call?

You need to tell the build system that your system call requires 2 arguments and that they are of type int. This is so that the scripts that are part of the build system will generate appropriate wrappers for casting the arguments into the type you require. Instead of defining the actual handler like you did, you should use -

SYSCALL_DEFINE2(my_syscall_2, int, a, int, b) // Yes, there is a comma between the types and the argument names
{
    printk("my_syscall_2 : %d, %d\n", a, b);
    return b;
}

SYSCALL_DEFINEx is defined in linux/include/linux/syscalls.h.

You can look at an example in linux/fs/read_write.c