How Is the System Call in Linux Implemented

linux system call implementation

A system call is mostly implemented inside the Linux kernel, with a tiny glue code in the C standard library. But see also vdso(7).

From the user-land point of view, a system call (they are listed in syscalls(2)...) is a single machine instruction (often SYSENTER) with some calling conventions (e.g. defining which machine register hold the syscall number - e.g. __NR_stat from /usr/include/asm/unistd_64.h....-, and which other registers contain the arguments to the system call).

Use strace(1) to understand which system calls are done by a given program or process.

The C standard library has a tiny wrapper function (which invokes the kernel, following the ABI, and deals with error reporting & errno).

For stat(2), the C wrapping function is e.g. in stat/stat.c for musl-libc.

Inside the kernel code, most of the work happens in fs/stat.c (e.g. after line 207).

See also this & that answers

how to implement my own system call in Linux kernel 4.x?

You may edit your glibc to add wrapper around your syscall. Something like it is in the syscalls.list file in glibc/sysdeps/unix (search for your platform)
https://github.com/lattera/glibc/blob/master/sysdeps/unix/syscalls.list
https://github.com/lattera/glibc/blob/master/sysdeps/unix/sysv/linux/x86_64/syscalls.list

# File name Caller  Syscall name    Args    Strong name Weak names

accept - accept Ci:iBN __libc_accept accept
access - access i:si __access access
close - close Ci:i __libc_close __close close
open - open Ci:siv __libc_open __open open
read - read Ci:ibn __libc_read __read read
uname - uname i:p __uname uname
write - write Ci:ibn __libc_write __write write

To decode this format, use "comments in the script which processes this file: sysdeps/unix/make-syscalls.sh.", as it was recommended in https://blog.packagecloud.io/eng/2016/04/05/the-definitive-guide-to-linux-system-calls/

# This script is used to process the syscall data encoded in the various
# syscalls.list files to produce thin assembly syscall wrappers around the
# appropriate OS syscall. See syscall-template.s for more details on the
# actual wrapper.
#
# Syscall Signature Prefixes:
#
# E: errno and return value are not set by the call
# V: errno is not set, but errno or zero (success) is returned from the call
#
# Syscall Signature Key Letters:
#
# a: unchecked address (e.g., 1st arg to mmap)
# b: non-NULL buffer (e.g., 2nd arg to read; return value from mmap)
# B: optionally-NULL buffer (e.g., 4th arg to getsockopt)
# f: buffer of 2 ints (e.g., 4th arg to socketpair)
# F: 3rd arg to fcntl
# i: scalar (any signedness & size: int, long, long long, enum, whatever)
# I: 3rd arg to ioctl
# n: scalar buffer length (e.g., 3rd arg to read)
# N: pointer to value/return scalar buffer length (e.g., 6th arg to recvfrom)
# p: non-NULL pointer to typed object (e.g., any non-void* arg)
# P: optionally-NULL pointer to typed object (e.g., 2nd argument to gettimeofday)
# s: non-NULL string (e.g., 1st arg to open)
# S: optionally-NULL string (e.g., 1st arg to acct)
# v: vararg scalar (e.g., optional 3rd arg to open)
# V: byte-per-page vector (3rd arg to mincore)
# W: wait status, optionally-NULL pointer to int (e.g., 2nd arg of wait4)

More information about glibc's syscall wrapper at official site: https://sourceware.org/glibc/wiki/SyscallWrappers

There are three types of OS kernel system call wrappers that are used by glibc: assembly, macro, and bespoke.

Assembly syscalls
Simple kernel system calls in glibc are translated from a list of names into an assembly wrapper that is then compiled. ... The list of syscalls that use wrappers is kept in the syscalls.list files: ... ./sysdeps/unix/sysv/linux/x86_64/syscalls.list

Don't forget to define __NR number in linux headers for your syscall

There are instructions from kernel.org, the only linux kernel developer portal, or in Documentation/adding-syscalls.* files inside linux kernel sources:
https://www.kernel.org/doc/html/v4.10/process/adding-syscalls.html
https://github.com/torvalds/linux/blob/master/Documentation/process/adding-syscalls.rst

The method will be different for other OS like FreeBSD: https://wiki.freebsd.org/AddingSyscalls

Where is SYSCALL() implemented in Linux?

syscall is a wrapper that actually loads the register and executes the instruction syscall on 64 bit x86 or int 80h or sysenter on 32 bit x86 and it is part of the standard library.

example:

syscall:
endbr64
mov rax,rdi
mov rdi,rsi
mov rsi,rdx
mov rdx,rcx
mov r10,r8
mov r8,r9
mov r9,QWORD PTR [rsp+0x8]
syscall

So the answer is that that syscall function is in the glibc.

In the kernel in the assembly file the syscall,sysentry instruction entry or int 80h interrupt handler (depending on the system implementation) does some stack magic, performs some checks and then calls the function which will handle the particular system call. Addresses of those functions are placed in the special table containing function pointers. But this part is very hard to be called the "library".

Simple System Call Implementation example?

This depends on which architecture you want to add a system call for, or if you want to add the system call for all architectures. I will explain one way to add a system call for ARM.

  1. Pick a name for your syscall. For example, mysyscall.
  2. Choose a syscall number. In arch/arm/include/asm/unistd.h, take note of how each syscall has a specific number (__NR__SYSCALL_BASE+<number>) assigned to it. Choose an unused number for your syscall. Let us choose syscall number 223. Then add:

    #define __NR_mysyscall (__NR_SYSCALL_BASE+223

    where the index 223 would be in that header file. This assigns the number 223 to your syscall on ARM architectures.

  3. Modify architecture-specific syscall table. In linux/arch/arm/kernel/calls.S, change the line that corresponds to syscall 223 to:

    CALL(sys_mysyscall)

  4. Add your function prototype. Suppose you wanted to add a non-architecture-specific syscall. Edit the file: include/linux/syscalls.h and add your syscall's prototype:

    asmlinkage long sys_mysyscall(struct dummy_struct *buf);

    If you wanted to add it specifically for ARM, then do the following except in this file: arch/arm/kernel/sys_arm.c.

  5. Implement your syscall somewhere. Create a file whereever you please. For example, in the kernel/ directory. You need to at least have:

#include <linux/syscalls.h>
...
SYSCALL_DEFINE1(mysyscall, struct dummy_struct __user *, buf)
{
/* Implement your syscall */
}

Note the macro, SYSCALL_DEFINE1. The number at the end should correspond to how many input parameters your syscall has. In this case, our system call only has 1 parameter, so you use SYSCALL_DEFINE1. If it had two parameters, you would use SYSCALL_DEFINE2, etc.

Don't forget to add the object (.o) file to the Makefile in the directory where you put it.


  1. Compile your new kernel and test. You haven't modified your C libraries, so you cannot invoke your syscall with mysyscall(). You need to use the syscall() function which takes a system call number as its first argument:
struct dummy_struct *buf = calloc(1, sizeof(buf));   
int res = syscall(223, buf);

Do note that this was for ARM. The process will be very similar for other architectures.

Edit: Don't forget to add your syscall file to the Makefile in kernel/.

How does a system call work

In short, here's how a system call works:

  • First, the user application program sets up the arguments for the system call.
  • After the arguments are all set up, the program executes the "system call" instruction.
  • This instruction causes an exception: an event that causes the processor to jump to a new address and start executing the code there.

  • The instructions at the new address save your user program's state, figure out what system call you want, call the function in the kernel that implements that system call, restores your user program state, and returns control back to the user program.

A visual explanation of a user application invoking the open() system call:

Sample Image

It should be noted that the system call interface (it serves as the link to system calls made available by the operating system) invokes intended system call in OS kernel and returns status of the system call and any return values. The caller need know nothing about how the system call is implemented or what it does during execution.

Another example: A C program invoking printf() library call, which calls write() system call

Sample Image

For more detailed explanation read section 1.5.1 in CH-1 and Section 2.3 in CH-2 from Operating System Concepts.

How does a syscall actually happen on linux?

Assuming we're talking about x86:

  1. The ID of the system call is deposited into the EAX register
  2. Any arguments required by the system call are deposited into the locations dictated by the system call. For example, some system calls expect their argument to reside in the EBX register. Others may expect their argument to be sitting on the top of the stack.
  3. An INT 0x80 interrupt is invoked.
  4. The Linux kernel services the system call identified by the ID in the EAX register, depositing any results in pre-determined locations.
  5. The calling code makes use of any results.

I may be a bit rusty at this, it's been a few years...

When implementing a system call, how do you expose the system call number to userland?

Well, I have a partial answer. Partial because it is Debian specific.

If you use the make deb-pkg target in the kernel sources, then .deb packages are created in the parent directory. If you then install these, then your headers get installed into the system.

After doing this for my kernel described above:

$ grep krun /usr/include
/usr/include/asm/unistd_64.h:#define __NR_krun_read_msrs 317
/usr/include/asm/unistd_64.h:#define __NR_krun_reset_msrs 318


Related Topics



Leave a reply



Submit