Intel X86 VS X64 System Call

Intel x86 vs x64 system call


General part

EDIT: Linux irrelevant parts removed

While not totally wrong, narrowing down to int 0x80 and syscall oversimplifies the question as with sysenter there is at least a 3rd option.

Using 0x80 and eax for syscall number, ebx, ecx, edx, esi, edi, and ebp to pass parameters is just one of many possible other choices to implement a system call, but those registers are the ones the 32-bit Linux ABI chose.

Before taking a closer look at the techniques involved, it should be stated that they all circle around the problem of escaping the privilege prison every process runs in.

Another choice to the ones presented here offered by the x86 architecture would have been the use of a call gate (see: http://en.wikipedia.org/wiki/Call_gate)

The only other possibility present on all i386 machines is using a software interrupt, which allows the ISR (Interrupt Service Routine or simply an interrupt handler) to run at a different privilege level than before.

(Fun fact: some i386 OSes have used an invalid-instruction exception to enter the kernel for system calls, because that was actually faster than an int instruction on 386 CPUs. See OsDev syscall/sysret and sysenter/sysexit instructions enabling for a summary of possible system-call mechanisms.)

Software Interrupt

What exactly happens once an interrupt is triggered depends on whether switching to the ISR requires a privilege change or not:

(Intel® 64 and IA-32 Architectures Software Developer’s Manual)

6.4.1 Call and Return Operation for Interrupt or Exception Handling Procedures

...

If the code segment for the handler procedure has the
same privilege level as the currently executing program or task, the
handler procedure uses the current stack; if the handler executes at a
more privileged level, the processor switches to the stack for the
handler’s privilege level.

....

If a stack switch does occur, the processor does the following:

  1. Temporarily saves (internally) the current contents of the SS, ESP, EFLAGS, CS, and > EIP registers.

  2. Loads the segment selector and stack pointer for the new stack (that is, the stack for the privilege level being
    called) from the TSS into the SS and ESP registers and switches to the new stack.

  3. Pushes the temporarily saved SS, ESP, EFLAGS, CS, and EIP values for the interrupted procedure’s stack onto
    the new stack.

  4. Pushes an error code on the new stack (if appropriate).

  5. Loads the segment selector for the new code segment and the new instruction pointer (from the interrupt gate
    or trap gate) into the CS and EIP registers, respectively.

  6. If the call is through an interrupt gate, clears the IF flag in the EFLAGS register.

  7. Begins execution of the handler procedure at the new privilege level.

... sigh this seems to be a lot to do and even once we're done it doesn't get too much better:

(excerpt taken from the same source as mentioned above: Intel® 64 and IA-32 Architectures Software Developer’s Manual)

When executing a return from an interrupt or exception handler from a
different privilege level than the interrupted procedure, the
processor performs these actions:

  1. Performs a privilege check.

  2. Restores the CS and EIP registers to their values prior to the interrupt or exception.

  3. Restores the EFLAGS register.

  4. Restores the SS and ESP registers to their values prior to the interrupt or exception, resulting in a stack switch back to the
    stack of the interrupted procedure.

  5. Resumes execution of the interrupted procedure.

Sysenter

Another option on the 32-bit platform not mentioned in your question at all, but nevertheless utilized by the Linux kernel is the sysenter instruction.

(Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z)

Description Executes a fast call to a level 0 system procedure or
routine. SYSENTER is a companion instruction to SYSEXIT. The
instruction is optimized to provide the maximum performance for system
calls from user code running at privilege level 3 to operating system
or executive procedures running at privilege level 0.

One disadvantage of using this solution is, that it is not present on all 32-bit machines, so the int 0x80 method still has to be provided in case the CPU doesn't know about it.

The SYSENTER and SYSEXIT instructions were introduced into the IA-32
architecture in the Pentium II processor. The availability of these
instructions on a processor is indicated with the SYSENTER/SYSEXIT
present (SEP) feature flag returned to the EDX register by the CPUID
instruction. An operating system that qualifies the SEP flag must also
qualify the processor family and model to ensure that the
SYSENTER/SYSEXIT instructions are actually present

Syscall

The last possibility, the syscall instruction, pretty much allows for the same functionality as the sysenter instruction. The existence of both is due to the fact that one (systenter) was introduced by Intel while the other (syscall) was introduced by AMD.

Linux specific

In the Linux kernel any of the three possibilities mentioned above may be chosen to realize a system call.

See also The Definitive Guide to Linux System Calls.

As already stated above, the int 0x80 method is the only one of the 3 chosen implementations, that can run on any i386 CPU so this is the only one that is always available for 32-bit user-space.

(syscall is the only one that's always available for 64-bit user-space, and the only one you should ever use in 64-bit code; x86-64 kernels can be built without CONFIG_IA32_EMULATION, and int 0x80 still invokes the 32-bit ABI which truncates pointers to 32-bit.)

To allow to switch between all 3 choices every process run is given access to a special shared object that gives access to the system call implementation chosen for the running system. This is the strange looking linux-gate.so.1 you already might have encountered as unresolved library when using ldd or the like.

(arch/x86/vdso/vdso32-setup.c)

 if (vdso32_syscall()) {                                                                               
vsyscall = &vdso32_syscall_start;
vsyscall_len = &vdso32_syscall_end - &vdso32_syscall_start;
} else if (vdso32_sysenter()){
vsyscall = &vdso32_sysenter_start;
vsyscall_len = &vdso32_sysenter_end - &vdso32_sysenter_start;
} else {
vsyscall = &vdso32_int80_start;
vsyscall_len = &vdso32_int80_end - &vdso32_int80_start;
}

To utilize it all you have to do is load all your registers system call number in eax, parameters in ebx, ecx, edx, esi, edi as with int 0x80 system call implementation and call the main routine.

Unfortunately it is not all that easy; as to minimize the security risk of a fixed predefined address, the location at which the vdso (virtual dynamic shared object) will be visible in a process is randomized, so you will have to figure out the correct location first.

This address is individual to each process and is passed to the process once it is started.

In case you didn't know, when started in Linux, every process gets pointers to the parameters passed once it was started and pointers to a description of the environment variables it is running under passed on its stack - each of them terminated by NULL.

Additionally to these a third block of so called elf-auxiliary-vectors gets passed following the ones mentioned before. The correct location is encoded in one of these carrying the type-identifier AT_SYSINFO.

So stack layout looks like this (addresses grow downwards):

  • parameter-0
  • ...
  • parameter-m
  • NULL
  • environment-0
  • ....
  • environment-n
  • NULL
  • ...
  • auxilliary elf vector: AT_SYSINFO
  • ...
  • auxilliary elf vector: AT_NULL

Usage example

To find the correct address you will have to first skip all arguments and all environment pointers and then start scanning for AT_SYSINFO as shown in the example below:

#include <stdio.h>
#include <elf.h>

void putc_1 (char c) {
__asm__ ("movl $0x04, %%eax\n"
"movl $0x01, %%ebx\n"
"movl $0x01, %%edx\n"
"int $0x80"
:: "c" (&c)
: "eax", "ebx", "edx");
}

void putc_2 (char c, void *addr) {
__asm__ ("movl $0x04, %%eax\n"
"movl $0x01, %%ebx\n"
"movl $0x01, %%edx\n"
"call *%%esi"
:: "c" (&c), "S" (addr)
: "eax", "ebx", "edx");
}


int main (int argc, char *argv[]) {

/* using int 0x80 */
putc_1 ('1');


/* rather nasty search for jump address */
argv += argc + 1; /* skip args */
while (*argv != NULL) /* skip env */
++argv;

Elf32_auxv_t *aux = (Elf32_auxv_t*) ++argv; /* aux vector start */

while (aux->a_type != AT_SYSINFO) {
if (aux->a_type == AT_NULL)
return 1;
++aux;
}

putc_2 ('2', (void*) aux->a_un.a_val);

return 0;
}

As you will see by taking a look at the following snippet of /usr/include/asm/unistd_32.h on my system:

#define __NR_restart_syscall 0
#define __NR_exit 1
#define __NR_fork 2
#define __NR_read 3
#define __NR_write 4
#define __NR_open 5
#define __NR_close 6

The syscall I used is the one numbered 4 (write) as passed in the eax register.
Taking filedescriptor (ebx = 1), data-pointer (ecx = &c) and size (edx = 1) as its arguments, each passed in the corresponding register.

To put a long story short

Comparing a supposedly slow running int 0x80 system call on any Intel CPU with a (hopefully) much faster implementation using the (genuinely invented by AMD) syscall instruction is comparing apples to oranges.

IMHO: Most probably the sysenter instruction instead of int 0x80 should be to the test here.

x86_64 Assembly Linux System Call Confusion

You're running into one surprising difference between i386 and x86_64: they don't use the same system call mechanism. The correct code is:

movq $60, %rax
movq $2, %rdi ; not %rbx!
syscall

Interrupt 0x80 always invokes 32-bit system calls. It's used to allow 32-bit applications to run on 64-bit systems.

For the purposes of learning, you should probably try to follow the tutorial exactly, rather than translating on the fly to 64-bit -- there are a few other significant behavioral differences that you're likely to run into. Once you're familiar with i386, then you can pick up x86_64 separately.

On x64 Linux, what is the difference between syscall, int 0x80 and ret to exit a program?

If you use printf or other libc functions, it's best to ret from main or call exit. (Which are equivalent; main's caller will call the libc exit function.)

If not, if you were only making other raw system calls like write with syscall, it's also appropriate and consistent to exit that way, but either way, or call exit are 100% fine in main.

If you want to work without libc at all, e.g. put your code under _start: instead of main: and link with ld or gcc -static -nostdlib, then you can't use ret. Use mov eax, 231 (__NR_exit_group) / syscall.

main is a real & normal function like any other (called with a valid return address), but _start (the process entry point) isn't. On entry to _start, the stack holds argc and argv, so trying to ret would set RIP=argc, and then code-fetch would segfault on that unmapped address. Nasm segmentation fault on RET in _start



System call vs. ret-from-main

Exiting via a system call is like calling _exit() in C - skip atexit() and libc cleanup, notably not flushing any buffered stdout output (line buffered on a terminal, full-buffered otherwise).
This leads to symptoms such as Using printf in assembly leads to empty output when piping, but works on the terminal (or if your output doesn't end with \n, even on a terminal.)

main is a function, called (indirectly) from CRT startup code. (Assuming you link your program normally, like you would a C program.) Your hand-written main works exactly like a compiler-generate C main function would. Its caller (__libc_start_main) really does do something like int result = main(argc, argv); exit(result);,

e.g. call rax (pointer passed by _start) / mov edi, eax / call exit.

So returning from main is exactly1 like calling exit.

  • Syscall implementation of exit() for a comparison of the relevant C functions, exit vs. _exit vs. exit_group and the underlying asm system calls.

  • C question: What is the difference between exit and return? is primarily about exit() vs. return, although there is mention of calling _exit() directly, i.e. just making a system call. It's applicable because C main compiles to an asm main just like you'd write by hand.

Footnote 1: You can invent a hypothetical intentionally weird case where it's different. e.g. you used stack space in main as your stdio buffer with sub rsp, 1024 / mov rsi, rsp / ... / call setvbuf. Then returning from main would involve putting RSP above that buffer, and __libc_start_main's call to exit could overwrite some of that buffer with return addresses and locals before execution reached the fflush cleanup. This mistake is more obvious in asm than C because you need leave or mov rsp, rbp or add rsp, 1024 or something to point RSP at your return address.

In C++, return from main runs destructors for its locals (before global/static exit stuff), exit doesn't. But that just means the compiler makes asm that does more stuff before actually running the ret, so it's all manual in asm, like in C.

The other difference is of course the asm / calling-convention details: exit status in EAX (return value) or EDI (first arg), and of course to ret you have to have RSP pointing at your return address, like it was on function entry. With call exit you don't, and you can even do a conditional tailcall of exit like jne exit. Since it's a noreturn function, you don't really need RSP pointing at a valid return address. (RSP should be aligned by 16 before a call, though, or RSP%16 = 8 before a tailcall, matching the alignment after call pushes a return address. It's unlikely that exit / fflush cleanup will do any alignment-required stores/loads to the stack, but it's a good habit to get this right.)

(This whole footnote is about ret vs. call exit, not syscall, so it's a bit of a tangent from the rest of the answer. You can also run syscall without caring where the stack-pointer points.)



SYS_exit vs. SYS_exit_group raw system calls

The raw SYS_exit system call is for exiting the current thread, like pthread_exit().

(eax=60 / syscall, or eax=1 / int 0x80).

SYS_exit_group is for exiting the whole program, like _exit.

(eax=231 / syscall, or eax=252 / int 0x80).

In a single-threaded program you can use either, but conceptually exit_group makes more sense to me if you're going to use raw system calls. glibc's _exit() wrapper function actually uses the exit_group system call (since glibc 2.3). See Syscall implementation of exit() for more details.

However, nearly all the hand-written asm you'll ever see uses SYS_exit1. It's not "wrong", and SYS_exit is perfectly acceptable for a program that didn't start more threads. Especially if you're trying to save code size with xor eax,eax / inc eax (3 bytes in 32-bit mode) or push 60 / pop rax (3 bytes in 64-bit mode), while push 231/pop rax would be even larger than mov eax,231 because it doesn't fit in a signed imm8.

Note 1: (Usually actually hard-coding the number, not using __NR_... constants from asm/unistd.h or their SYS_... names from sys/syscall.h)

And historically, it's all there was. Note that in unistd_32.h, __NR_exit has call number 1, but __NR_exit_group = 252 wasn't added until years later when the kernel gained support for tasks that share virtual address space with their parent, aka threads started by clone(2). This is when SYS_exit conceptually became "exit current thread". (But one could easily and convincingly argue that in a single-threaded program, SYS_exit does still mean exit the whole program, because it only differs from exit_group if there are multiple threads.)

To be honest, I've never used eax=252 / int 0x80 in anything, only ever eax=1. It's only in 64-bit code where I often use mov eax,231 instead of mov eax,60 because neither number is "simple" or memorable the way 1 is, so might as well be a cool guy and use the "modern" exit_group way in my single-threaded toy program / experiment / microbenchmark / SO answer. :P (If I didn't enjoy tilting at windmills, I wouldn't spend so much time on assembly, especially on SO.)

And BTW, I usually use NASM for one-off experiments so it's inconvenient to use pre-defined symbolic constants for call numbers; with GCC to preprocess a .S before running GAS you can make your code self-documenting with #include <sys/syscall.h> so you can use mov $SYS_exit_group, %eax (or $__NR_exit_group), or mov eax, __NR_exit_group with .intel_syntax noprefix.



Don't use the 32-bit int 0x80 ABI in 64-bit code:

What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? explains what happens if you use the COMPAT_IA32_EMULATION int 0x80 ABI in 64-bit code.

It's totally fine for just exiting, as long as your kernel has that support compiled in, otherwise it will segfault just like any other random int number like int 0x7f. (e.g. on WSL1, or people that built custom kernels and disabled that support.)

But the only reason you'd do it that way in asm would be so you could build the same source file with nasm -felf32 or nasm -felf64. (You can't use syscall in 32-bit code, except on some AMD CPUs which have a 32-bit version of syscall. And the 32-bit ABI uses different call numbers anyway so this wouldn't let the same source be useful for both modes.)


Related:

  • Why am I allowed to exit main using ret? (CRT startup code calls main, you're not returning directly to the kernel.)
  • Nasm segmentation fault on RET in _start - you can't ret from _start
  • Using printf in assembly leads to empty output when piping, but works on the terminal stdout buffer (not) flushing with raw system call exit
  • Syscall implementation of exit() call exit vs. mov eax,60/syscall (_exit) vs. mov eax,231/syscall (exit_group).
  • Can't call C standard library function on 64-bit Linux from assembly (yasm) code - modern Linux distros config GCC in a way that call exit or call puts won't link with nasm -felf64 foo.asm && gcc foo.o.
  • Is main() really start of a C++ program? - Ciro's answer is a deep dive into how glibc + its CRT startup code actually call main (including x86-64 asm disassembly in GDB), and shows the glibc source code for __libc_start_main.
  • Linux x86 Program Start Up
    or - How the heck do we get to main()? 32-bit asm, and more detail than you'll probably want until you're a lot more comfortable with asm, but if you've ever wondered why CRT runs so much code before getting to main, that covers what's happening at a level that's a couple steps up from using GDB with starti (stop at the process entry point, e.g. in the dynamic linker's _start) and stepi until you get to your own _start or main.
  • https://stackoverflow.com/tags/x86/info lots of good links about this and everything else.

Difference in ABI between x86_64 Linux functions and syscalls

The syscall instruction is intended to provide a quicker method of entering Ring-0 in order to carry out a system call. This is meant to be an improvement over the old method, which was to raise a software interrupt (int 0x80 on Linux).

Part of the reason the instruction is faster is because it does not change memory, or even change rsp to point at a kernel stack. Unlike a software interrupt, where the CPU is forced to allow the OS to resume operation without clobbering anything, for this command the CPU is allowed to assume the software is aware that something is happening here.

In particular, syscall stores two parts of the user-space state in registers. The RIP to return to after the call is stored in rcx, and the flags are stored in R11 (because RFLAGS is masked with a kernel-supplied value before entry to the kernel). This means that both those registers are clobbered by the instruction.

Since they are clobbered, the syscall ABI uses another register instead of rcx, hence the use of r10 for the 4th argument.

r10 is a natural choice, since in the x86-64 SystemV ABI it's not used for passing function args, and functions don't need to preserve their caller's value of r10. So a syscall wrapper function can mov %rcx, %r10 without any save/restore. This wouldn't be possible with any other register, for 6-arg syscalls and the SysV ABI's function calling convention.


BTW, the 32-bit system call ABI is also accessible with sysenter, which requires cooperation between user-space and kernel-space to allow returning to user-space after a sysenter. (i.e. storing some state in user-space before running sysenter). This is higher performance than int 0x80, but awkward. Still, glibc uses it (by jumping to user-space code in the vdso pages that the kernel maps into the address space of every process).

AMD's syscall is another approach to the same idea as Intel's sysenter: to make entry/exit from the kernel less expensive by not preserving absolutely everything.

What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64

Further reading for any of the topics here: The Definitive Guide to Linux System Calls


I verified these using GNU Assembler (gas) on Linux.

Kernel Interface

x86-32 aka i386 Linux System Call convention:

In x86-32 parameters for Linux system call are passed using registers. %eax for syscall_number. %ebx, %ecx, %edx, %esi, %edi, %ebp are used for passing 6 parameters to system calls.

The return value is in %eax. All other registers (including EFLAGS) are preserved across the int $0x80.

I took following snippet from the Linux Assembly Tutorial but I'm doubtful about this. If any one can show an example, it would be great.

If there are more than six arguments,
%ebx must contain the memory
location where the list of arguments
is stored - but don't worry about this
because it's unlikely that you'll use
a syscall with more than six
arguments.

For an example and a little more reading, refer to http://www.int80h.org/bsdasm/#alternate-calling-convention. Another example of a Hello World for i386 Linux using int 0x80: Hello, world in assembly language with Linux system calls?

There is a faster way to make 32-bit system calls: using sysenter. The kernel maps a page of memory into every process (the vDSO), with the user-space side of the sysenter dance, which has to cooperate with the kernel for it to be able to find the return address. Arg to register mapping is the same as for int $0x80. You should normally call into the vDSO instead of using sysenter directly. (See The Definitive Guide to Linux System Calls for info on linking and calling into the vDSO, and for more info on sysenter, and everything else to do with system calls.)

x86-32 [Free|Open|Net|DragonFly]BSD UNIX System Call convention:

Parameters are passed on the stack. Push the parameters (last parameter pushed first) on to the stack. Then push an additional 32-bit of dummy data (Its not actually dummy data. refer to following link for more info) and then give a system call instruction int $0x80

http://www.int80h.org/bsdasm/#default-calling-convention



x86-64 Linux System Call convention:

(Note: x86-64 Mac OS X is similar but different from Linux. TODO: check what *BSD does)

Refer to section: "A.2 AMD64 Linux Kernel Conventions" of System V Application Binary Interface AMD64 Architecture Processor Supplement. The latest versions of the i386 and x86-64 System V psABIs can be found linked from this page in the ABI maintainer's repo. (See also the x86 tag wiki for up-to-date ABI links and lots of other good stuff about x86 asm.)

Here is the snippet from this section:

  1. User-level applications use as integer registers for passing the
    sequence %rdi, %rsi, %rdx, %rcx,
    %r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9.
  2. A system-call is done via the syscall instruction. This clobbers %rcx and %r11 as well as the %rax return value, but other registers are preserved.
  3. The number of the syscall has to be passed in register %rax.
  4. System-calls are limited to six arguments, no argument is passed
    directly on the stack.
  5. Returning from the syscall, register %rax contains the result of
    the system-call. A value in the range between -4095 and -1 indicates
    an error, it is -errno.
  6. Only values of class INTEGER or class MEMORY are passed to the kernel.

Remember this is from the Linux-specific appendix to the ABI, and even for Linux it's informative not normative. (But it is in fact accurate.)

This 32-bit int $0x80 ABI is usable in 64-bit code (but highly not recommended). What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? It still truncates its inputs to 32-bit, so it's unsuitable for pointers, and it zeros r8-r11.

User Interface: function calling

x86-32 Function Calling convention:

In x86-32 parameters were passed on stack. Last parameter was pushed first on to the stack until all parameters are done and then call instruction was executed. This is used for calling C library (libc) functions on Linux from assembly.

Modern versions of the i386 System V ABI (used on Linux) require 16-byte alignment of %esp before a call, like the x86-64 System V ABI has always required. Callees are allowed to assume that and use SSE 16-byte loads/stores that fault on unaligned. But historically, Linux only required 4-byte stack alignment, so it took extra work to reserve naturally-aligned space even for an 8-byte double or something.

Some other modern 32-bit systems still don't require more than 4 byte stack alignment.



x86-64 System V user-space Function Calling convention:

x86-64 System V passes args in registers, which is more efficient than i386 System V's stack args convention. It avoids the latency and extra instructions of storing args to memory (cache) and then loading them back again in the callee. This works well because there are more registers available, and is better for modern high-performance CPUs where latency and out-of-order execution matter. (The i386 ABI is very old).

In this new mechanism: First the parameters are divided into classes. The class of each parameter determines the manner in which it is passed to the called function.

For complete information refer to : "3.2 Function Calling Sequence" of System V Application Binary Interface AMD64 Architecture Processor Supplement which reads, in part:

Once arguments are classified, the registers get assigned (in
left-to-right order) for passing as follows:

  1. If the class is MEMORY, pass the argument on the stack.
  2. If the class is INTEGER, the next available register of the
    sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used

So %rdi, %rsi, %rdx, %rcx, %r8 and %r9 are the registers in order used to pass integer/pointer (i.e. INTEGER class) parameters to any libc function from assembly. %rdi is used for the first INTEGER parameter. %rsi for 2nd, %rdx for 3rd and so on. Then call instruction should be given. The stack (%rsp) must be 16B-aligned when call executes.



Related Topics



Leave a reply



Submit