How Syscall Knows Where to Jump

How syscall knows where to jump?

syscall

The syscall instruction is really just an INTEL/AMD CPU instruction. Here is the synopsis:

IF (CS.L ≠ 1 ) or (IA32_EFER.LMA ≠ 1) or (IA32_EFER.SCE ≠ 1)
(* Not in 64-Bit Mode or SYSCALL/SYSRET not enabled in IA32_EFER *)
THEN #UD;
FI;
RCX ← RIP; (* Will contain address of next instruction *)
RIP ← IA32_LSTAR;
R11 ← RFLAGS;
RFLAGS ← RFLAGS AND NOT(IA32_FMASK);
CS.Selector ← IA32_STAR[47:32] AND FFFCH (* Operating system provides CS; RPL forced to 0 *)
(* Set rest of CS to a fixed value *)
CS.Base ← 0;
(* Flat segment *)
CS.Limit ← FFFFFH;
(* With 4-KByte granularity, implies a 4-GByte limit *)
CS.Type ← 11;
(* Execute/read code, accessed *)
CS.S ← 1;
CS.DPL ← 0;
CS.P ← 1;
CS.L ← 1;
(* Entry is to 64-bit mode *)
CS.D ← 0;
(* Required if CS.L = 1 *)
CS.G ← 1;
(* 4-KByte granularity *)
CPL ← 0;
SS.Selector ← IA32_STAR[47:32] + 8;
(* SS just above CS *)
(* Set rest of SS to a fixed value *)
SS.Base ← 0;
(* Flat segment *)
SS.Limit ← FFFFFH;
(* With 4-KByte granularity, implies a 4-GByte limit *)
SS.Type ← 3;
(* Read/write data, accessed *)
SS.S ← 1;
SS.DPL ← 0;
SS.P ← 1;
SS.B ← 1;
(* 32-bit stack segment *)
SS.G ← 1;
(* 4-KByte granularity *)

The most important part are the two instructions that save and manage the RIP register:

RCX ← RIP
RIP ← IA32_LSTAR

So in other words, there must be code at the address saved in IA32_LSTAR (a register) and RCX is the return address.

The CS and SS segments are also tweaked so your kernel code will be able to further run at CPU Level 0 (a privileged level.)

The #UD may happen if you do not have the right to execute syscall or if the instruction doesn't exist.

How is RAX interpreted?

This is just an index into a table of kernel function pointers. First the kernel does a bounds-check (and returns -ENOSYS if RAX > __NR_syscall_max), then dispatches to (C syntax) sys_call_table[rax](rdi, rsi, rdx, r10, r8, r9);

; Intel-syntax translation of Linux 4.12 syscall entry point
... ; save user-space registers etc.
call [sys_call_table + rax * 8] ; dispatch to sys_execve() or whatever kernel C function

;;; execve probably won't return via this path, but most other calls will
... ; restore registers except RAX return value, and return to user-space

Modern Linux is more complicated in practice because of workarounds for x86 vulnerabilities like Meltdown and L1TF by changing the page tables so most of kernel memory isn't mapped while user-space is running. The above code is a literal translation (from AT&T syntax) of call *sys_call_table(, %rax, 8) from ENTRY(entry_SYSCALL_64) in Linux 4.12 arch/x86/entry/entry_64.S (before Spectre/Meltdown mitigations were added). Also related: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? has some more details about the kernel side of system-call dispatching.

Fast?

The instruction is said to be fast. This is because in the old days one would have to use an instruction such as INT3. The interrupts make use of the kernel stack, it pushes many registers on the stack and uses the rather slow RTE to exit the exception state and return to the address just after the interrupt. This is generally much slower.

With the syscall you may be able to avoid most of that overhead. However, in what you're asking, this is not really going to help.

Another instruction which is used along syscall is swapgs. This gives the kernel a way to access its own data and stack. You should look at the Intel/AMD documentation about those instructions for more details.

New Process?

The Linux system has what it calls a task table. Each process and each thread within a process is actually called a task.

When you create a new process, Linux creates a task. For that to work, it runs codes which does things such as:

  • Make sure the executable exists
  • Setup a new task (including parsing the ELF program headers from that executable to create memory mappings in the newly-created virtual address space.)
  • Allocates a stack buffer
  • Load the first few blocks of the executable (as an optimization for demand paging), allocating some physical pages for the virtual pages to map to.
  • Setup the start address in the task (ELF entry point from the executable)
  • Mark the task as ready (a.k.a. running)

This is, of course, super simplified.

The start address is defined in your ELF binary. It really only needs to determine that one address and save it in the task current RIP pointer and "return" to user-space. The normal demand-paging mechanism will take care of the rest: if the code is not yet loaded, it will generate a #PF page-fault exception and the kernel will load the necessary code at that point. Although in most cases the loader will already have some part of the software loaded as an optimization to avoid that initial page-fault.

(A #PF on a page that isn't mapped would result in the kernel delivering a SIGSEGV segfault signal to your process, but a "valid" page fault is handled silently by the kernel.)

All new processes usually get loaded at the same virtual address (ignoring PIE + ASLR). This is possible because we use the MMU (Memory Management Unit). That coprocessor translates memory addresses between virtual address spaces and physical address space.

(Editor's note: the MMU isn't really a coprocessor; in modern CPUs virtual memory logic is tightly integrated into each core, along side the L1 instruction/data caches. Some ancient CPUs did use an external MMU chip, though.)

Determine the Address?

So, now we understand that all processes have the same virtual address (0x400000 under Linux is the default chosen by ld). To determine the real physical address we use the MMU. How does the kernel decide of that physical address? Well, it has a memory allocation function. That simple.

It calls a "malloc()" type of function which searches for a memory block which is not currently used and creates (a.k.a. loads) the process at that location. If no memory block is currently available, the kernel checks for swapping something out of memory. If that fails, the creation of the process fails.

In case of a process creation, it will allocate pretty large blocks of memory to start with. It is not unusual to allocate 1Mb or 2Mb buffers to start a new process. This makes things go a lot faster.

Also, if the process is already running and you start it again, a lot of the memory used by the already running instance can be reused. In that case the kernel does not allocate/load those parts. It will use the MMU to share those pages that can be made common to both instances of the process (i.e. in most cases the code part of the process can be shared since it is read-only, some part of the data can be shared when it is also marked as read-only; if not marked read-only, the data can still be shared if it wasn't modified yet--in this case it's marked as copy on write.)

How does a system call work

In short, here's how a system call works:

  • First, the user application program sets up the arguments for the system call.
  • After the arguments are all set up, the program executes the "system call" instruction.
  • This instruction causes an exception: an event that causes the processor to jump to a new address and start executing the code there.

  • The instructions at the new address save your user program's state, figure out what system call you want, call the function in the kernel that implements that system call, restores your user program state, and returns control back to the user program.

A visual explanation of a user application invoking the open() system call:

Sample Image

It should be noted that the system call interface (it serves as the link to system calls made available by the operating system) invokes intended system call in OS kernel and returns status of the system call and any return values. The caller need know nothing about how the system call is implemented or what it does during execution.

Another example: A C program invoking printf() library call, which calls write() system call

Sample Image

For more detailed explanation read section 1.5.1 in CH-1 and Section 2.3 in CH-2 from Operating System Concepts.

Who sets the RIP register when you call the clone syscall?

Normally the way it works is that, when the computer boots, Linux sets up a MSR (Model Specific Register) to work with the assembly instruction syscall. The assembly instruction syscall will make the RIP register jump to the address specified in the MSR to enter kernel mode. As stated in 64-ia-32-architectures-software-developer-vol-2b-manual from Intel:

SYSCALL invokes an OS system-call handler at privilege level 0.
It does so by loading RIP from the IA32_LSTAR MSR

Once in kernel mode, the kernel will look at the arguments passed into conventional registers (RAX, RBX etc.) to determine what the syscall is asking. Then the kernel will invoke one of the sys_XXX functions whose prototypes are in linux/syscalls.h (https://elixir.bootlin.com/linux/latest/source/include/linux/syscalls.h#L217). The definition of sys_clone is in kernel/fork.c.

SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
int __user *, parent_tidptr,
int __user *, child_tidptr,
unsigned long, tls)
#endif
{
return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
}

The SYSCALLDEFINE5 macro takes the first argument and prefixes sys_ to it. This function is actually sys_clone and it calls _do_fork.

It means there really isn't a clone() function which is invoked by glibc to call into the kernel. The kernel is called with the syscall instruction, it jumps to an address specified in the MSR and then it invokes one of the syscalls in the sys_call_table.

The entry point to the kernel for x86 is here: https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S. If you scroll down you'll see the line: call *sys_call_table(, %rax, 8). Basically, call one of the functions of the sys_call_table. The implementation of the sys_call_table is here: https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscall_64.c#L20.

// SPDX-License-Identifier: GPL-2.0
/* System call table for x86-64. */

#include <linux/linkage.h>
#include <linux/sys.h>
#include <linux/cache.h>
#include <linux/syscalls.h>
#include <asm/unistd.h>
#include <asm/syscall.h>

#define __SYSCALL_X32(nr, sym)
#define __SYSCALL_COMMON(nr, sym) __SYSCALL_64(nr, sym)

#define __SYSCALL_64(nr, sym) extern long __x64_##sym(const struct pt_regs *);
#include <asm/syscalls_64.h>
#undef __SYSCALL_64

#define __SYSCALL_64(nr, sym) [nr] = __x64_##sym,

asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
/*
* Smells like a compiler bug -- it doesn't work
* when the & below is removed.
*/
[0 ... __NR_syscall_max] = &__x64_sys_ni_syscall,
#include <asm/syscalls_64.h>
};

I recommend you read the following: https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-2.html. On this website is stated that

As you can see, we include the asm/syscalls_64.h header at the end of the array. This header file is generated by the special script at arch/x86/entry/syscalls/syscalltbl.sh and generates our header file from the syscall table (https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl).

...

...

So, after this, our sys_call_table takes the following form:

asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
[0 ... __NR_syscall_max] = &sys_ni_syscall,
[0] = sys_read,
[1] = sys_write,
[2] = sys_open,
...
...
...
};

Once you have the table generated, one of its entries is being jumped to when you use the syscall assembly instruction. For clone() it will call sys_clone() which itself calls _do_fork(). Which is defined as such:

long _do_fork(unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *parent_tidptr,
int __user *child_tidptr,
unsigned long tls)
{
struct task_struct *p;
int trace = 0;
long nr;

/*
* Determine whether and which event to report to ptracer. When
* called from kernel_thread or CLONE_UNTRACED is explicitly
* requested, no event is reported; otherwise, report if the event
* for the type of forking is enabled.
*/
if (!(clone_flags & CLONE_UNTRACED)) {
if (clone_flags & CLONE_VFORK)
trace = PTRACE_EVENT_VFORK;
else if ((clone_flags & CSIGNAL) != SIGCHLD)
trace = PTRACE_EVENT_CLONE;
else
trace = PTRACE_EVENT_FORK;

if (likely(!ptrace_event_enabled(current, trace)))
trace = 0;
}

p = copy_process(clone_flags, stack_start, stack_size,
child_tidptr, NULL, trace, tls);
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
*/
if (!IS_ERR(p)) {
struct completion vfork;
struct pid *pid;

trace_sched_process_fork(current, p);

pid = get_task_pid(p, PIDTYPE_PID);
nr = pid_vnr(pid);

if (clone_flags & CLONE_PARENT_SETTID)
put_user(nr, parent_tidptr);

if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
init_completion(&vfork);
get_task_struct(p);
}

wake_up_new_task(p);

/* forking complete and child started to run, tell ptracer */
if (unlikely(trace))
ptrace_event_pid(trace, pid);

if (clone_flags & CLONE_VFORK) {
if (!wait_for_vfork_done(p, &vfork))
ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
}

put_pid(pid);
} else {
nr = PTR_ERR(p);
}
return nr;
}

It calls wake_up_new_task() which puts the task on the runqueue and wakes it. I'm surprised it even wakes the task immediatly. I would have guessed that the scheduler would have done it instead and that it would have been given a high priority to run as soon as possible. In itself, the kernel doesn't have to receive a function pointer because as stated on the manpage for clone():

The raw clone() system call corresponds more closely to fork(2)
in that execution in the child continues from the point of the
call. As such, the fn and arg arguments of the clone() wrapper
function are omitted.

The child continues execution where the syscall was made. I don't understand exactly the mechanism but in the end the child will continue execution in a new thread. The parent thread (which created the new child thread) returns and the child thread jumps to the specified function instead.

I think it works with the following lines (on the link you provided):

testq   %rax,%rax
jl SYSCALL_ERROR_LABEL
jz L(thread_start) //Child jumps to thread_start

ret //Parent returns to where it was

Because rax is a 64 bits register, they use the 'q' version of the GNU syntax assembly instruction test. They test if rax is zero. If it is less than zero then there was an error. If it is zero then jump to thread_start. If it is not zero nor negative (in the case of the parent thread), continue execution and return. The new thread is created with rax as 0. It allows to diffenrentiate between the parent and the child thread.

EDIT

As stated on the link you provided,

The parameters are passed in register and on the stack from userland:
rdi: fn
rsi: child_stack
rdx: flags
rcx: arg
r8d: TID field in parent
r9d: thread pointer

So when your program executes the following lines:

/* Insert the argument onto the new stack.  */
subq $16,%rsi
movq %rcx,8(%rsi)

/* Save the function pointer. It will be popped off in the
child in the ebx frobbing below. */
movq %rdi,0(%rsi)

it inserts the function pointer and arguments onto the new stack. Then it calls the kernel which itself doesn't have to push anything onto the stack. It just receives the new stack as an argument and then makes the child's thread RSP register point to it. I would guess this happens in the copy_process() function (called from fork()) along the lines of:

retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
if (retval)
goto bad_fork_cleanup_io;

It seems to be done in the copy_thread_tls() function which itself calls copy_thread(). copy_thread() has its prototype in include/linux/sched.h and it is defined based on the architecture. I'm not sure where it is defined for x86.

Relationship between system calls API, syscall instruction and exception mechanism (interrupts)

TL;DR

The syscall instruction itself acts like a glorified jump, it's a hardware-supported way to efficiently and safely jump from unprivileged user-space into the kernel.

The syscall instruction jumps to a kernel entry-point that dispatches the call.

Before x86_64 two other mechanisms were used: the int instruction and the sysenter instruction.

They have different entry-points (still present today in 32-bit kernels, and 64-bit kernels that can run 32-bit user-space programs).

The former uses the x86 interrupt machinery and can be confused with the exceptions dispatching (that also uses the interrupt machinery).

However, exceptions are spurious events while int is used to generate a software interrupt, again, a glorified jump.


The C language doesn't concern itself with system calls, it relies on the C runtime to perform all the interactions with the environment of the future program.

The C runtime implements the above-mentioned interactions through an environment specific mechanism.

There could be various layers of software abstractions but in the end the OS APIs get called.

The term API is used to denote a contract, strictly speaking using an API doesn't require to invoke a piece of kernel code (the trend is to implement non-critical functions in userspace to limit the exploitable code), here we are only interested in the subset of the API that requires a privilege switch.

Under Linux, the kernel exposes a set of services accessible from userspace, these entry-points are called system calls.

Under Windows, the kernel services (that are accessed with the same mechanism of the Linux analogues) are considered private in the sense that they are not required to be stable across versions.

A set of DLL/EXE exported functions are used as entry-points instead (e.g. ntoskrnl.exe, hal.dll, kernel32.dll, user32.dll) that in turn use the kernel services through a (private) system call.

Note that under Linux, most system calls have a POSIX wrapper around it, so it's possible to use these wrappers, that are ordinary C functions, to invoke a system call.

The underlying ABI is different, so is for the error reporting; the wrapper translates between the two worlds.

The C runtime calls the OS APIs, in the case of Linux the system calls are used directly because they are public (in the sense that are stable across versions), while for Windows the usual DLLs, like kernel32.dll, are marked as dependencies and used.

We are reduced to the point where an user-mode program, being it part of the C runtime (Linux) or part of an API DLL (Windows), need to invoke a code in the kernel.

The x86 architecture historically offered different ways to do so, for example, a call gate.

Another way is through the int instruction, it has a few advantages:

  • It is what the BIOS and the DOS did in their times.

    In real-mode, using an int instructions is suitable because a vector number (e.g. 21h) is easier to remember than a far address (e.g. 0f000h:0fff0h).
  • It saves the flags.
  • It is easy to set up (setting up ISR is relatively easy).

With the modernization of the architecture this mechanism turned out to have a big disadvantage: it is slow.
Before the introduction of the sysenter (note, sysenter not syscall) instruction there was no faster alternative (a call gate would be equally slow).

With the advent of the Pentium Pro/II[1] a new pair of instructions sysenter and sysexit were introduced to make system calls faster.

Linux started using them since the version 2.5 and are still used today on 32-bit systems I believe.

I won't explain the whole mechanism of the sysenter instruction and the companion VDSO necessary to use it, it is only needed to say that it was faster than the int mechanism (I can't find an article from Andy Glew where he says that sysenter turned out to be slow on Pentium III, I don't know how it performs nowadays).

With the advent of x86-64 the AMD response to sysenter, i.e. the syscall/sysret pair, began the de-facto way to switch from user-mode to kernel-mode.

This is due to the fact that sysenter is actually fast and very simple (it copies rip and rflags into rcx and r11 respectively, masks rflags and jump to an address set in IA32_LSTAR).

64-bit versions of both Linux and Windows use syscall.

To recap, control can be given to the kernel through three mechanism:

  • Software interrupts.

    This was int 80h for 32-bit Linux (pre 2.5) and int 2eh for 32-bit Windows.
  • Via sysenter.

    Used by 32-bit versions of Linux since 2.5.
  • Via syscall.

    Used by 64-bit versions of Linux and Windows.

Here is a nice page to put it in a better shape.

The C runtime is usually a static library, thus pre-compiled, that uses one of the three methods above.

The syscall instruction transfers control to a kernel entry-point (see entry_64.s) directly.

It is an instruction that just does so, it is not implemented by the OS, it is used by the OS.

The term exception is overloaded in CS, C++ has exceptions, so do Java and C#.

The OS can have a language agnostic exception trapping mechanism (under windows it was once called SEH, now has been rewritten).

The CPU also has exceptions.

I believe we are talking about the last meaning.

Exceptions are dispatched through interrupts, they are a kind of interrupt.

It goes unsaid that while exceptions are synchronous (they happen at specific, replayable points) they are "unwanted", they are exceptional, in the sense that programmers tend to avoid them and when they happen is due to either a bug, an unhandled corner case or a bad situation.

They, thus, are not used to transfer control to the kernel (they could).

Software interrupts (that are synchronous too) were used instead; the mechanism is almost exactly the same (exceptions can have a status code pushed on the kernel stack) but the semantic is different.

We never deferenced a null-pointer, accessed an unmapped page or similar to invoke a system call, we used the int instruction instead.

How to invoke a system call via syscall or sysenter in inline assembly?

First of all, you can't safely use GNU C Basic asm(""); syntax for this (without input/output/clobber constraints). You need Extended asm to tell the compiler about registers you modify. See the inline asm in the GNU C manual and the inline-assembly tag wiki for links to other guides for details on what things like "D"(1) means as part of an asm() statement.

You also need asm volatile because that's not implicit for Extended asm statements with 1 or more output operands.


I'm going to show you how to execute system calls by writing a program that writes Hello World! to standard output by using the write() system call. Here's the source of the program without an implementation of the actual system call :

#include <sys/types.h>

ssize_t my_write(int fd, const void *buf, size_t size);

int main(void)
{
const char hello[] = "Hello world!\n";
my_write(1, hello, sizeof(hello));
return 0;
}

You can see that I named my custom system call function as my_write in order to avoid name clashes with the "normal" write, provided by libc. The rest of this answer contains the source of my_write for i386 and amd64.

i386

System calls in i386 Linux are implemented using the 128th interrupt vector, e.g. by calling int 0x80 in your assembly code, having set the parameters accordingly beforehand, of course. It is possible to do the same via SYSENTER, but actually executing this instruction is achieved by the VDSO virtually mapped to each running process. Since SYSENTER was never meant as a direct replacement of the int 0x80 API, it's never directly executed by userland applications - instead, when an application needs to access some kernel code, it calls the virtually mapped routine in the VDSO (that's what the call *%gs:0x10 in your code is for), which contains all the code supporting the SYSENTER instruction. There's quite a lot of it because of how the instruction actually works.

If you want to read more about this, have a look at this link. It contains a fairly brief overview of the techniques applied in the kernel and the VDSO. See also The Definitive Guide to (x86) Linux System Calls - some system calls like getpid and clock_gettime are so simple the kernel can export code + data that runs in user-space so the VDSO never needs to enter the kernel, making it much faster even than sysenter could be.


It's much easier to use the slower int $0x80 to invoke the 32-bit ABI.

// i386 Linux
#include <asm/unistd.h> // compile with -m32 for 32 bit call numbers
//#define __NR_write 4
ssize_t my_write(int fd, const void *buf, size_t size)
{
ssize_t ret;
asm volatile
(
"int $0x80"
: "=a" (ret)
: "0"(__NR_write), "b"(fd), "c"(buf), "d"(size)
: "memory" // the kernel dereferences pointer args
);
return ret;
}

As you can see, using the int 0x80 API is relatively simple. The number of the syscall goes to the eax register, while all the parameters needed for the syscall go into respectively ebx, ecx, edx, esi, edi, and ebp. System call numbers can be obtained by reading the file /usr/include/asm/unistd_32.h.

Prototypes and descriptions of the functions are available in the 2nd section of the manual, so in this case write(2).

The kernel saves/restores all the registers (except EAX) so we can use them as input-only operands to the inline asm. See What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64

Keep in mind that the clobber list also contains the memory parameter, which means that the instruction listed in the instruction list references memory (via the buf parameter). (A pointer input to inline asm does not imply that the pointed-to memory is also an input. See How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)

amd64

Things look different on the AMD64 architecture which sports a new instruction called SYSCALL. It is very different from the original SYSENTER instruction, and definitely much easier to use from userland applications - it really resembles a normal CALL, actually, and adapting the old int 0x80 to the new SYSCALL is pretty much trivial. (Except it uses RCX and R11 instead of the kernel stack to save the user-space RIP and RFLAGS so the kernel knows where to return).

In this case, the number of the system call is still passed in the register rax, but the registers used to hold the arguments now nearly match the function calling convention: rdi, rsi, rdx, r10, r8 and r9 in that order. (syscall itself destroys rcx so r10 is used instead of rcx, letting libc wrapper functions just use mov r10, rcx / syscall.)

// x86-64 Linux
#include <asm/unistd.h> // compile without -m32 for 64 bit call numbers
// #define __NR_write 1
ssize_t my_write(int fd, const void *buf, size_t size)
{
ssize_t ret;
asm volatile
(
"syscall"
: "=a" (ret)
// EDI RSI RDX
: "0"(__NR_write), "D"(fd), "S"(buf), "d"(size)
: "rcx", "r11", "memory"
);
return ret;
}

(See it compile on Godbolt)

Do notice how practically the only thing that needed changing were the register names, and the actual instruction used for making the call. This is mostly thanks to the input/output lists provided by gcc's extended inline assembly syntax, which automagically provides appropriate move instructions needed for executing the instruction list.



Related Topics



Leave a reply



Submit