Where Is the System Call Table in Linux Kernel

Where is located syscall_table in kernel x86_64?

Version 3.3 onward are different from 2.X that the guide use. You should look for the syscalls directory, in the arch/x86/ directory. So is:

cd /kernel-src/arch/x86/syscalls/syscall_64.tbl

kernel-src being the directory where your kernel sources resides. A good idea would be reading this answer in SO and compare it with your resource.

Where are System Call IDs defined for x86 arch in Linux Kernel 5.0.1?

Arch Linux ships unistd_32.h and unistd_64.h in /usr/include/asm/. Just look at those headers unless you're modifying the kernel to add new system calls.

<asm/unistd.h> checks macros to figure out if its being included in 32 or 64-bit code (and checks for x32), and uses #include to pull in the right set of definitions for the target.

On my up-to-date x86-64 Arch system:

$ pacman -Fo /usr/include/asm/unistd*
usr/include/asm/unistd_32.h is owned by core/linux-api-headers 4.7-1
usr/include/asm/unistd_64.h is owned by core/linux-api-headers 4.7-1
usr/include/asm/unistd.h is owned by core/linux-api-headers 4.7-1
usr/include/asm/unistd_x32.h is owned by core/linux-api-headers 4.7-1

In the kernel source itself, starting with version 3.3, the unistd_32.h for use by user-space is built from other files.

https://github.com/torvalds/linux/search?q=unistd_32.h&unscoped_q=unistd_32.h finds this in arch/x86/entry/syscalls/Makefile

$(uapi)/unistd_32.h: $(syscall32) $(syshdr)
$(call if_changed,syshdr)

The syscall tables are defined in: arch/x86/entry/syscalls/syscall_32.tbl and .../syscall_64.tbl

https://github.com/torvalds/linux/tree/6f0d349d922ba44e4348a17a78ea51b7135965b1/arch/x86/entry/syscalls

The contents of syscall_32.tbl looks like:

# some comments
0 i386 restart_syscall sys_restart_syscall __ia32_sys_restart_syscall
1 i386 exit sys_exit __ia32_sys_exit
2 i386 fork sys_fork __ia32_sys_fork
3 i386 read sys_read __ia32_sys_read
...

/ptregs in syscall table

These are special system calls which require full register dump laid out on the stack (as a struct pt_regs). This is a thing only for the 64-bit x86 architecture because it has more registers (compared to 32-bit).

The system call handler (arch/x86/entry/entry_64.S:entry_SYSCALL_64) saves most of the registers on the stack on system call entry. This is done partially to support ptrace() and partially to pass the arguments to actual system call handlers written in C (this is why they have asmlinkage spec, its makes the function get arguments from stack). System calls have at most 6 arguments (rdi, rsi, rdx, r10, r8, r9), and some registers are used for SYSCALL bookkeeping (rax, rcx, r11). You do not need to save rbp, rbx, r12, r13, r14, r15 (as they are callee-saved), so they are not saved on entry for performance reasons. After the system call handling completes the registers are restored from this backup before returning to userspace.

However, some system calls (like execve(), fork(), sigreturn(), etc.) need to have all registers on the stack (including rbp, rbx, r12–r15), in the struct pt_regs. This is because these system calls can cause the userspace to restart execution from a different place, so they need accurate register values saved. They are marked with /ptregs in syscall_64.tbl so that the following magic happens.

Normally the system call handler table (sys_call_table) contains pointers to C functions. But for those special system calls the handlers are small assembly thunks which first save the extra registers and then jump to the C code (this is what the slow-path does). The /ptregs suffix in the table instructs the script to insert these stubs instead of C functions into the handler table.

How does 32-bit system call table entry point maps to SYSCALL_DEFINE in x86_64

On a 64-bit kernel, SYSCALL_DEFINE0 defines the compat (32-bit) and other ABI (e.g. x32 on x86_64) syscall entry points as aliases for the real 64-bit function. It does not define (and has no way to define; that's not how the preprocessor works) multiple functions built from a single body appearing after the ) of the macro evaluation. So __func__ expands to the name of the actual function that has __func__ written in it, not the name of the alias.

For SYSCALL_DEFINEx with x>0, it's more complicated since arguments have to be converted, and I believe wrappers are involved.

You can find all the magic in arch/x86/include/asm/syscall_wrapper.h (under the top-level kernel tree).

If you really want/need there to be separate functions, I believe there's a way to skip the magic and do it. But it makes your code harder to maintain since it may break when the mechanisms behind the magic break. It's likely preferable to probe whether the calling (current) userspace process is 32-bit or 64-bit and act differently according to that.



Related Topics



Leave a reply



Submit