How to Access the System Call from User-Space

How to access the system call from user-space?

You first should understand what is the role of the linux kernel, and that applications interact with the kernel only thru system calls.

In effect, an application runs on the "virtual machine" provided by the kernel: it is running in the user space and can only do (at the lowest machine level) the set of machine instructions permitted in user CPU mode augmented by the instruction (e.g. SYSENTER or INT 0x80 ...) used to make system calls. So, from the user-level application point of view, a syscall is an atomic pseudo machine instruction.

The Linux Assembly Howto explains how a syscall can be done at the assembly (i.e. machine instruction) level.

The GNU libc is providing C functions corresponding to the syscalls. So for example the open function is a tiny glue (i.e. a wrapper) above the syscall of number NR__open (it is making the syscall then updating errno). Application usually call such C functions in libc instead of doing the syscall.

You could use some other libc. For instance the MUSL libc is somhow "simpler" and its code is perhaps easier to read. It also is wrapping the raw syscalls into corresponding C functions.

If you add your own syscall, you better also implement a similar C function (in your own library). So you should have also a header file for your library.

See also intro(2) and syscall(2) and syscalls(2) man pages, and the role of VDSO in syscalls.

Notice that syscalls are not C functions. They don't use the call stack (they could even be invoked without any stack). A syscall is basically a number like NR__open from <asm/unistd.h>, a SYSENTER machine instruction with conventions about which registers hold before the arguments to the syscall and which ones hold after the result[s] of the syscall (including the failure result, to set errno in the C library wrapping the syscall). The conventions for syscalls are not the calling conventions for C functions in the ABI spec (e.g. x86-64 psABI). So you need a C wrapper.

How to call system call in kernel space?

Foreword

By definition, a system call is a service offered by the system to the user space applications. When one is running inside the system, he should not call
a service destined to user space. Hence, this is unadvised to make it.

First try with a kernel space buffer

The write() system call is defined in fs/read_write.c. It calls ksys_write() which calls vfs_write():

ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
ssize_t ret;

if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))
return -EINVAL;
if (unlikely(!access_ok(buf, count)))
return -EFAULT;

ret = rw_verify_area(WRITE, file, pos, count);
if (!ret) {
if (count > MAX_RW_COUNT)
count = MAX_RW_COUNT;
file_start_write(file);
ret = __vfs_write(file, buf, count, pos);
if (ret > 0) {
fsnotify_modify(file);
add_wchar(current, ret);
}
inc_syscw(current);
file_end_write(file);
}

return ret;
}
[...]
ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;

if (f.file) {
loff_t pos, *ppos = file_ppos(f.file);
if (ppos) {
pos = *ppos;
ppos = &pos;
}
ret = vfs_write(f.file, buf, count, ppos);
if (ret >= 0 && ppos)
f.file->f_pos = pos;
fdput_pos(f);
}

return ret;
}

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
return ksys_write(fd, buf, count);
}

The file descriptor passed as first parameter is not a problem. The value passed from user space is used to retrieve the file structure of the output file (in ksys_write()). But the second parameter must reference a user space memory area.
In vfs_write(), a check is done on the second parameter:

    if (unlikely(!access_ok(buf, count)))
return -EFAULT;

access_ok() checks if the buffer is in the user-level space. Hence, if you
pass an address referencing the kernel space, the returned code from read() will be -EFAULT (-14).

The example below is a simple module calling the write() system call with a kernel space buffer. On x86_64, the convention for the parameters of the system calls are:

   RDI = arg#0
RSI = arg#1
RDX = arg#2
R10 = arg#3
R8 = arg#4
R9 = arg#5
#include <linux/version.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <asm/ptrace.h>
#include <linux/socket.h>
#include <linux/kallsyms.h>


MODULE_LICENSE("GPL");

typedef int (* syscall_wrapper)(struct pt_regs *);

unsigned long sys_call_table_addr;

#define DEV_NAME "[DEVICE2]"


#define DEV_STR DEV_NAME "String from driver"

static char buf[1024];


static int __init device2_init(void) {

syscall_wrapper write_syscall;
int rc;
struct pt_regs param;

printk(KERN_INFO DEV_NAME "module has been loaded\n");

sys_call_table_addr = kallsyms_lookup_name("sys_call_table");

printk(KERN_INFO DEV_NAME "sys_call_table@%lx\n", sys_call_table_addr);

write_syscall = ((syscall_wrapper *)sys_call_table_addr)[__NR_write];

/*
Call to write() system call with a kernel space buffer
*/
snprintf(buf, sizeof(buf), "%s\n", DEV_STR);
param.di = 1;
param.si = (unsigned long)buf;
param.dx = strlen(buf);
rc = (* write_syscall)(¶m);

printk(KERN_INFO DEV_NAME "write() with a kernel space buffer = %d\n", rc);

return 0;
}

static void __exit device2_exit(void) {
printk(KERN_INFO DEV_NAME "module has been unloaded\n");
}

module_init(device2_init);
module_exit(device2_exit);

At module insertion time, we can verify that the system call returns -EFAULT:

$ sudo insmod ./device2.ko
$ dmesg
[15716.262977] [DEVICE2]module has been loaded
[15716.270566] [DEVICE2]sys_call_table@ffffffff926013a0
[15716.270568] [DEVICE2]write() with a kernel space buffer = -14

But the same module with a system call like dup() which involves a file descriptor but no user space buffers, this works. Let's change the previous code with:

static int __init device2_init(void) {

syscall_wrapper write_syscall;
syscall_wrapper dup_syscall;
syscall_wrapper close_syscall;
int rc;
struct pt_regs param;

printk(KERN_INFO DEV_NAME "module has been loaded\n");

sys_call_table_addr = kallsyms_lookup_name("sys_call_table");

printk(KERN_INFO DEV_NAME "sys_call_table@%lx\n", sys_call_table_addr);

write_syscall = ((syscall_wrapper *)sys_call_table_addr)[__NR_write];
dup_syscall = ((syscall_wrapper *)sys_call_table_addr)[__NR_dup];
close_syscall = ((syscall_wrapper *)sys_call_table_addr)[__NR_close];

/*
Call to write() system call with a kernel space buffer
*/
snprintf(buf, sizeof(buf), "%s\n", DEV_STR);
param.di = 1;
param.si = (unsigned long)buf;
param.dx = strlen(buf);
rc = (* write_syscall)(¶m);

printk(KERN_INFO DEV_NAME "write() with a kernel space buffer = %d\n", rc);

/*
Call to dup() system call
*/
param.di = 1;
rc = (* dup_syscall)(¶m);

printk(KERN_INFO DEV_NAME "dup() = %d\n", rc);

/*
Call to close() system call
*/
param.di = 0;
rc = (* close_syscall)(¶m);

printk(KERN_INFO DEV_NAME "close() = %d\n", rc);

/*
Call to dup() system call ==> Must return 0 as it is available
*/
param.di = 1;
rc = (* dup_syscall)(¶m);

printk(KERN_INFO DEV_NAME "dup() = %d\n", rc);

return 0;
}

The result of dup() is OK:

$ sudo insmod ./device2.ko
$ dmesg
[17444.098469] [DEVICE2]module has been loaded
[17444.106935] [DEVICE2]sys_call_table@ffffffff926013a0
[17444.106937] [DEVICE2]write() with a kernel space buffer = -14
[17444.106939] [DEVICE2]dup() = 4
[17444.106940] [DEVICE2]close() = 0
[17444.106940] [DEVICE2]dup() = 0

The first call to dup() returns 4 because the current process is insmod. The latter opened the module file and got file descriptor 3. Hence, the first available file descriptor is 4. The second call to dup() returns 0 because we closed the file descriptor 0.

Second try with a user space buffer

To use a user space buffer, let's add some file operations to the kernel module (open(), release() and write()). In the write() entry point we echo back what is passed from user space into stderr (file descriptor 2) using the user space buffer passed to the write() entry point:

#include <linux/version.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <asm/ptrace.h>
#include <linux/socket.h>
#include <linux/kallsyms.h>
#include <linux/cdev.h>


MODULE_LICENSE("GPL");

typedef int (* syscall_wrapper)(struct pt_regs *);

static unsigned long sys_call_table_addr;

#define DEV_NAME "[DEVICE2]"

static syscall_wrapper write_syscall;

static ssize_t device2_write(struct file *filp, const char *buff, size_t len, loff_t * off)
{
struct pt_regs param;
int rc;

printk(KERN_INFO DEV_NAME "write %p, %zu\n", buff, len);

/*
Call to write() system call to echo the write to stderr
*/
param.di = 2;
param.si = (unsigned long)buff;
param.dx = len;
rc = (* write_syscall)(¶m);

printk(KERN_INFO DEV_NAME "write() = %d\n", rc);

return len; // <-------------- To stop the write
}

static int device2_open(struct inode *inode, struct file *file)
{
printk(KERN_INFO DEV_NAME "open\n");
return 0;
}

static int device2_release(struct inode *inode, struct file *file)
{
printk(KERN_INFO DEV_NAME "released\n");
return 0;
}

static const struct file_operations fops =
{
.owner= THIS_MODULE,
.write=device2_write,
.open= device2_open,
.release= device2_release

};

struct cdev *device_cdev;
dev_t deviceNumbers;

static int __init device2_init(void) {

int rc;

printk(KERN_INFO DEV_NAME "module has been loaded\n");

// This returns the major number chosen dynamically in deviceNumbers
rc = alloc_chrdev_region(&deviceNumbers, 0, 1, DEV_NAME);

if (rc < 0) {
printk(KERN_ALERT DEV_NAME "Error registering: %d\n", rc);
return -1;
}

device_cdev = cdev_alloc();

cdev_init(device_cdev, &fops);

cdev_add(device_cdev, deviceNumbers, 1);

printk(KERN_INFO DEV_NAME "initialized (major number is %d)\n", MAJOR(deviceNumbers));

sys_call_table_addr = kallsyms_lookup_name("sys_call_table");

printk(KERN_INFO DEV_NAME "sys_call_table@%lx\n", sys_call_table_addr);

write_syscall = ((syscall_wrapper *)sys_call_table_addr)[__NR_write];

printk(KERN_INFO DEV_NAME "write_syscall@%p\n", write_syscall);

return 0;
}

static void __exit device2_exit(void) {
printk(KERN_INFO DEV_NAME "module has been unloaded\n");
}

module_init(device2_init);
module_exit(device2_exit);

The loading of the module:

$ sudo insmod device2.ko
$ dmesg
[ 2255.183196] [DEVICE2]module has been loaded
[ 2255.183202] [DEVICE2]initialized (major number is 508)
[ 2255.193255] [DEVICE2]sys_call_table@ffffffffbcc013a0
[ 2255.193256] [DEVICE2]write_syscall@0000000030394929

Make the device entry in the file system to be able to write into it:

$ sudo mknod /dev/device2 c 508 0
$ sudo chmod 666 /dev/device2
$ sudo ls -l /dev/device2
crw-rw-rw- 1 root root 508, 0 janv. 24 16:55 /dev/device2

The writing into the device triggers the expected echo on stderr:

$ echo "qwerty for test purposes" > /dev/device2
qwerty for test purposes
$ echo "another string" > /dev/device2
another string
$ dmesg
[ 2255.183196] [DEVICE2]module has been loaded
[ 2255.183202] [DEVICE2]initialized (major number is 508)
[ 2255.193255] [DEVICE2]sys_call_table@ffffffffbcc013a0
[ 2255.193256] [DEVICE2]write_syscall@0000000030394929
[ 2441.674250] [DEVICE2]open
[ 2441.674268] [DEVICE2]write 0000000032fb5249, 25
[ 2441.674281] [DEVICE2]write() = 25
[ 2441.674286] [DEVICE2]released
[ 2475.538140] [DEVICE2]open
[ 2475.538159] [DEVICE2]write 0000000032fb5249, 15
[ 2475.538171] [DEVICE2]write() = 15
[ 2475.538175] [DEVICE2]released

Accessing a system call directly from user program

The manpage for _syscall(2) states:

Starting around kernel 2.6.18, the _syscall macros were removed from header files supplied to user space. Use syscall(2) instead. (Some architectures, notably ia64, never provided the _syscall macros; on those architectures, syscall(2) was always required.)

Thus, your desired approach can't work on more modern kernels. (You can clearly see that if you run the preprocessor on your code. It won't resolve the _syscall0 macro) Try to use the syscall function instead:

Here is an example for the usage, cited from syscall(2):

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/types.h>

int
main(int argc, char *argv[])
{
pid_t tid;
tid = syscall(SYS_gettid);
}

As you asked for a direct way to call the Linux kernel without any userspace wrappers, I'll show you examples for the 80386 and the amd64 architecture.

First, you have to get the system call number from a table, such as this one. In case of getpid, the system call number is 39 for amd64 and 20 for 80386. Next, we create a function that calls the system for us. On the 80386 processor you use the interrupt 128 to call the system, on amd64 we use the special syscall instruction. The system call number goes into register eax, the output is also written to this register. In order to make the program easier, we write it in assembly. You can later use strace to verify it works correctly.

This is the code for 80386. It should return the lowest byte of its pid as exit status.

        .global _start
_start: mov $20,%eax #system call number 20:
int $128 #call the system
mov %eax,%ebx #move pid into register ebx
mov $1,%eax #system call number 1: exit, argument in ebx
int $128 #exit

Assemble with:

as -m32 -o 80386.o 80386.s
ld -m elf_i386 -o 80386 80386.o

This is the same code for amd64:

        .global _start
_start: mov $39,%eax #system call 39: getpid
syscall #call the system
mov %eax,%edi #move pid into register edi
mov $60,%eax #system call 60: exit
syscall #call the system

Assemble with:

as -o amd64.o amd64.s
ld -o amd64 amd64.o

How to make system call from another system call in kernel space

There is no generic way of doing this.

If you are in kernel space, you should invoke kernel functions that implement the system call functionality directly instead of using syscall-type instructions, or use other means of extracting the desired information / affecting the desired action.

For the specific case of getpid(), you can simply use current->pid.

The kernel name current is always a pointer to the current task_struct, which is defined via <linux/sched.h> (search for struct task_struct). Code that accesses members of that usually gets inlined, i.e. there's not even a function call (and much less a system call) required to get these when your code is running as part of the kernel.

How to correctly extract a string from a user space pointer in kernel space?

What is most likely happening here is that SMAP (Supervisor Mode Access Prevention) is preventing the kernel from accessing a raw user space pointer, causing a panic.

The correct way to access a string from user space is to copy its content using strncpy_from_user() first. Also, be careful and make sure to correctly terminate the string.

static asmlinkage long our_execl(const char __user * filename,
const char __user * const __user * argv,
const char __user * const __user * envp) {
char buf[256];
buf[255] = '\0';

long res = strncpy_from_user(buf, filename, 255);
if (res > 0)
printk("%s\n", buf);

return original_execl(filename, argv, envp);
}

In this case, since we are specifically talking about a file name, you can use the getname() and putname() functions, which work using a struct filename.

static asmlinkage long our_execl(const char __user * filename,
const char __user * const __user * argv,
const char __user * const __user * envp) {

struct filename *fname = getname(filename);
if (!IS_ERR(fname)) {
printk("%s\n", fname->name);
putname(fname);
}

return original_execl(filename, argv, envp);
}

Get userspace RBP register from kernel syscall

You can use the task_pt_regs() macro to get the current task's user registers (saved at the moment of syscall entry):

#include <asm/processor.h>

SYSCALL_DEFINE1(foo, int, d)
{
const struct pt_regs *user_regs = task_pt_regs(current);
unsigned long rbp = user_regs->bp;

/* Do whatever you need... */

return 0;
}

How is data copied from user space to kernel space and vice versa during I/O tasks?

  1. Since fread calls read underneath, how many read function calls will be invoked respectively?

Because fread() is mostly just slapping a buffer (in user-space, likely in a shared library) in front of read(), the "best case number of read() system calls" will depend on the size of the buffer.

For example; with an 8 KiB buffer; if you read 6 bytes with a single fread(), or if you read 6 individual bytes with 6 fread() calls; then read() will probably be called once (to get up to 8 KiB of data into the buffer).

However; read() may return less data than was requested (and this is very common for some cases - e.g. stdin if the user doesn't type fast enough). This means that fread() might use read() to try to fill it's buffer, but read() might only read a few bytes; so fread() needs to call read() again later when it needs more data in its buffer. For a worst case (where read() only happens to return 1 byte each time) reading 6 bytes with a single fread() may cause read() to be called 6 times.


  1. Is data transfer, whether one single byte or 1mb between user space buffer and kernel space buffer all done by the kernel and no user/kernel mode switch involved during transferring?

Often, read() (in the C standard library) calls some kind of "sys_read()" function provided by the kernel. In this case there's a switch to kernel when "sys_read()" is called, then the kernel does whatever it needs to to obtain and transfer the data, then there's one switch back from kernel to user-space.

However; nothing says that's how a kernel must work. E.g. a kernel could only provide a "sys_mmap()" (and not provide any "sys_read()") and the read() (in the C standard library) could use "sys_mmap()". For another example; with an exo-kernel, file systems might be implemented as shared libraries (with "file system cache" in shared memory) so a read() done by the C library (of a file's data that is in the "file system cache") may not involve the kernel at all.


  1. How many disk accesses are performed respectively? Won't the kernel buffer come into play during scenario two?

There's too many possibilities. E.g.:

a) If you're reading from a pipe (where the data is in a buffer in the kernel and was previously written by a different process) then there will be no disk accesses (because the data was never on any disk to begin with).

b) If you're reading from a file and the OS cached the file's data already; then there may be no disk accesses.

c) If you're reading from a file and the OS cached the file's data already; but the file system needs to update meta-data (e.g. an "accessed time" field in the file's directory entry) then there may be multiple disk accesses that have nothing to do with the file's data.

d) If you're reading from a file and the OS hasn't cached the file's data; then at least one disk access will be necessary. It doesn't matter if it's caused by fread() attempting to read a whole buffer, read() trying to read all 6 bytes at once, or the OS fetching a whole disk block because of the first "read() of one byte" in a series of six separate "read() of one byte" requests. If the OS does no caching at all, then six separate "read() of one byte" requests will be at least 6 separate disk accesses.

e) file system code may need to access some parts of the disk to determine where the file's data actually is before it can read the file's data; and the requested file data may be split between multiple blocks/sectors on the disk; so reading 2 or more bytes from a file (regardless of whether it was caused by fread() or "read() of 2 or more bytes") could cause several disk accesses.

f) with a RAID 5/6 array involving 2 or more physical disks (where reading a "logical block" involves reading the block from one disk and also reading the parity info from a different disk), the number of disk accesses can be doubled.


  1. The read function ssize_t read(int fd, void *buf, size_t count) also has buffer and count parameters, can these replace the role of user space buffer?

Yes; but if you're using it to replace the role of a user space buffer then you're mostly just implementing your own duplicate of fread().

It's more common to use fread() when you want treat the data as stream of bytes, and read() (or maybe mmap()) when you do not want to treat the data as a stream of bytes.

For a random example; maybe you're working with a BMP file; so you read the "guaranteed to be 14 bytes by the file format's spec" header; then check/decode/process the header; then (after determining where it is in the file, how big it is and what format it's in) you might seek() to the pixel data and read all of it into an array (then maybe spawn 8 threads to process the pixel data in the array).



Related Topics



Leave a reply



Submit