Linux Assembly: How to Call Syscall

Hello, world in assembly language with Linux system calls?

How does $ work in NASM, exactly? explains how $ - msg gets NASM to calculate the string length as an assemble-time constant for you, instead of hard-coding it.

I originally wrote the rest of this for SO Docs (topic ID: 1164, example ID: 19078), rewriting a basic less-well-commented example by @runner. This looks like a better place to put it than as part of my answer to another question where I had previously moved it after the SO docs experiment ended.

Making a system call is done by putting arguments into registers, then running int 0x80 (32-bit mode) or syscall (64-bit mode). What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 and The Definitive Guide to Linux System Calls.

Think of int 0x80 as a way to "call" into the kernel, across the user/kernel privilege boundary. The kernel does stuff according to the values that were in registers when int 0x80 executed, then eventually returns. The return value is in EAX.

When execution reaches the kernel's entry point, it looks at EAX and dispatches to the right system call based on the call number in EAX. Values from other registers are passed as function args to the kernel's handler for that system call. (e.g. eax=4 / int 0x80 will get the kernel to call its sys_write kernel function, implementing the POSIX write system call.)

And see also What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? - that answer includes a look at the asm in the kernel entry point that is "called" by int 0x80. (Also applies to 32-bit user-space, not just 64-bit where you shouldn't use int 0x80).

If you don't already know low-level Unix systems programming, you might want to just write functions in asm that take args and return a value (or update arrays via a pointer arg) and call them from C or C++ programs. Then you can just worry about learning how to handle registers and memory, without also learning the POSIX system-call API and the ABI for using it. That also makes it very easy to compare your code with compiler output for a C implementation. Compilers usually do a pretty good job at making efficient code, but are rarely perfect.

libc provides wrapper functions for system calls, so compiler-generated code would call write rather than invoking it directly with int 0x80 (or if you care about performance, sysenter). (In x86-64 code, use syscall for the 64-bit ABI.) See also syscalls(2).

System calls are documented in section 2 manual pages, like write(2). See the NOTES section for differences between the libc wrapper function and the underlying Linux system call. Note that the wrapper for sys_exit is _exit(2), not the exit(3) ISO C function that flushes stdio buffers and other cleanup first. There's also an exit_group system call that ends all threads. exit(3) actually uses that, because there's no downside in a single-threaded process.

This code makes 2 system calls:

sys_write(1, "Hello, World!\n", sizeof(...));
sys_exit(0);

I commented it heavily (to the point where it it's starting to obscure the actual code without color syntax highlighting). This is an attempt to point things out to total beginners, not how you should comment your code normally.

section .text             ; Executable code goes in the .text section
global _start             ; The linker looks for this symbol to set the process entry point, so execution start here
;;;a name followed by a colon defines a symbol.  The global _start directive modifies it so it's a global symbol, not just one that we can CALL or JMP to from inside the asm.
;;; note that _start isn't really a "function".  You can't return from it, and the kernel passes argc, argv, and env differently than main() would expect.
 _start:
    ;;; write(1, msg, len);
    ; Start by moving the arguments into registers, where the kernel will look for them
    mov     edx,len       ; 3rd arg goes in edx: buffer length
    mov     ecx,msg       ; 2nd arg goes in ecx: pointer to the buffer
    ;Set output to stdout (goes to your terminal, or wherever you redirect or pipe)
    mov     ebx,1         ; 1st arg goes in ebx: Unix file descriptor. 1 = stdout, which is normally connected to the terminal.

    mov     eax,4         ; system call number (from SYS_write / __NR_write from unistd_32.h).
    int     0x80          ; generate an interrupt, activating the kernel's system-call handling code.  64-bit code uses a different instruction, different registers, and different call numbers.
    ;; eax = return value, all other registers unchanged.

    ;;;Second, exit the process.  There's nothing to return to, so we can't use a ret instruction (like we could if this was main() or any function with a caller)
    ;;; If we don't exit, execution continues into whatever bytes are next in the memory page,
    ;;; typically leading to a segmentation fault because the padding 00 00 decodes to  add [eax],al.

    ;;; _exit(0);
    xor     ebx,ebx       ; first arg = exit status = 0.  (will be truncated to 8 bits).  Zeroing registers is a special case on x86, and mov ebx,0 would be less efficient.
                      ;; leaving out the zeroing of ebx would mean we exit(1), i.e. with an error status, since ebx still holds 1 from earlier.
    mov     eax,1         ; put __NR_exit into eax
    int     0x80          ;Execute the Linux function

section     .rodata       ; Section for read-only constants

             ;; msg is a label, and in this context doesn't need to be msg:.  It could be on a separate line.
             ;; db = Data Bytes: assemble some literal bytes into the output file.
msg     db  'Hello, world!',0xa     ; ASCII string constant plus a newline (0x10)

             ;;  No terminating zero byte is needed, because we're using write(), which takes a buffer + length instead of an implicit-length string.
             ;; To make this a C string that we could pass to puts or strlen, we'd need a terminating 0 byte. (e.g. "...", 0x10, 0)

len     equ $ - msg       ; Define an assemble-time constant (not stored by itself in the output file, but will appear as an immediate operand in insns that use it)
                          ; Calculate len = string length.  subtract the address of the start
                          ; of the string from the current position ($)
  ;; equivalently, we could have put a str_end: label after the string and done   len equ str_end - str

Notice that we don't store the string length in data memory anywhere. It's an assemble-time constant, so it's more efficient to have it as an immediate operand than a load. We could also have pushed the string data onto the stack with three push imm32 instructions, but bloating the code-size too much isn't a good thing.

On Linux, you can save this file as Hello.asm and build a 32-bit executable from it with these commands:

nasm -felf32 Hello.asm                  # assemble as 32-bit code.  Add -Worphan-labels -g -Fdwarf  for debug symbols and warnings
gcc -static -nostdlib -m32 Hello.o -o Hello     # link without CRT startup code or libc, making a static binary

See this answer for more details on building assembly into 32 or 64-bit static or dynamically linked Linux executables, for NASM/YASM syntax or GNU AT&T syntax with GNU as directives. (Key point: make sure to use -m32 or equivalent when building 32-bit code on a 64-bit host, or you will have confusing problems at run-time.)

You can trace its execution with strace to see the system calls it makes:

$ strace ./Hello 
execve("./Hello", ["./Hello"], [/* 72 vars */]) = 0
[ Process PID=4019 runs in 32 bit mode. ]
write(1, "Hello, world!\n", 14Hello, world!
)         = 14
_exit(0)                                = ?
+++ exited with 0 +++

Compare this with the trace for a dynamically linked process (like gcc makes from hello.c, or from running strace /bin/ls) to get an idea just how much stuff happens under the hood for dynamic linking and C library startup.

The trace on stderr and the regular output on stdout are both going to the terminal here, so they interfere in the line with the write system call. Redirect or trace to a file if you care. Notice how this lets us easily see the syscall return values without having to add code to print them, and is actually even easier than using a regular debugger (like gdb) to single-step and look at eax for this. See the bottom of the x86 tag wiki for gdb asm tips. (The rest of the tag wiki is full of links to good resources.)

The x86-64 version of this program would be extremely similar, passing the same args to the same system calls, just in different registers and with syscall instead of int 0x80. See the bottom of What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for a working example of writing a string and exiting in 64-bit code.

related: A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux. The smallest binary file you can run that just makes an exit() system call. That is about minimizing the binary size, not the source size or even just the number of instructions that actually run.

reference of syscall in asm

System calls are defined at the kernel level (OS specific) for each CPU architecture. The code you provided is x86_64 assembly, so that is your target CPU architecture. Based on your example you are using a Linux kernel. A detailed list of native system calls for x86_64 on Linux can be found here: https://filippo.io/linux-syscall-table/

You can actually edit this table on your system to create your own system calls, but be very careful when doing so! Kernel-level programming can be quite dangerous. The system call table on linux exists in the arch/x86/syscalls directory, which is in the directory that stores your kernel source.

cat /kernel-src/arch/x86/syscalls/syscall_64.tbl

As mentioned by @PeterCordes you can also find system call numbers on your machine in asm/unistd.h, which in the case of my machine was found in /usr/include/x86_64-linux-gnu/asm/unistd_64.h. If you are interested you should be able to find x86 calls in the same directory.

How to invoke a system call via syscall or sysenter in inline assembly?

First of all, you can't safely use GNU C Basic asm(""); syntax for this (without input/output/clobber constraints). You need Extended asm to tell the compiler about registers you modify. See the inline asm in the GNU C manual and the inline-assembly tag wiki for links to other guides for details on what things like "D"(1) means as part of an asm() statement.

You also need asm volatile because that's not implicit for Extended asm statements with 1 or more output operands.

I'm going to show you how to execute system calls by writing a program that writes Hello World! to standard output by using the write() system call. Here's the source of the program without an implementation of the actual system call :

#include <sys/types.h>

ssize_t my_write(int fd, const void *buf, size_t size);

int main(void)
{
    const char hello[] = "Hello world!\n";
    my_write(1, hello, sizeof(hello));
    return 0;
}

You can see that I named my custom system call function as my_write in order to avoid name clashes with the "normal" write, provided by libc. The rest of this answer contains the source of my_write for i386 and amd64.

i386

System calls in i386 Linux are implemented using the 128th interrupt vector, e.g. by calling int 0x80 in your assembly code, having set the parameters accordingly beforehand, of course. It is possible to do the same via SYSENTER, but actually executing this instruction is achieved by the VDSO virtually mapped to each running process. Since SYSENTER was never meant as a direct replacement of the int 0x80 API, it's never directly executed by userland applications - instead, when an application needs to access some kernel code, it calls the virtually mapped routine in the VDSO (that's what the call *%gs:0x10 in your code is for), which contains all the code supporting the SYSENTER instruction. There's quite a lot of it because of how the instruction actually works.

If you want to read more about this, have a look at this link. It contains a fairly brief overview of the techniques applied in the kernel and the VDSO. See also The Definitive Guide to (x86) Linux System Calls - some system calls like getpid and clock_gettime are so simple the kernel can export code + data that runs in user-space so the VDSO never needs to enter the kernel, making it much faster even than sysenter could be.

It's much easier to use the slower int $0x80 to invoke the 32-bit ABI.

// i386 Linux
#include <asm/unistd.h>      // compile with -m32 for 32 bit call numbers
//#define __NR_write 4
ssize_t my_write(int fd, const void *buf, size_t size)
{
    ssize_t ret;
    asm volatile
    (
        "int $0x80"
        : "=a" (ret)
        : "0"(__NR_write), "b"(fd), "c"(buf), "d"(size)
        : "memory"    // the kernel dereferences pointer args
    );
    return ret;
}

As you can see, using the int 0x80 API is relatively simple. The number of the syscall goes to the eax register, while all the parameters needed for the syscall go into respectively ebx, ecx, edx, esi, edi, and ebp. System call numbers can be obtained by reading the file /usr/include/asm/unistd_32.h.

Prototypes and descriptions of the functions are available in the 2nd section of the manual, so in this case write(2).

The kernel saves/restores all the registers (except EAX) so we can use them as input-only operands to the inline asm. See What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64

Keep in mind that the clobber list also contains the memory parameter, which means that the instruction listed in the instruction list references memory (via the buf parameter). (A pointer input to inline asm does not imply that the pointed-to memory is also an input. See How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)

amd64

Things look different on the AMD64 architecture which sports a new instruction called SYSCALL. It is very different from the original SYSENTER instruction, and definitely much easier to use from userland applications - it really resembles a normal CALL, actually, and adapting the old int 0x80 to the new SYSCALL is pretty much trivial. (Except it uses RCX and R11 instead of the kernel stack to save the user-space RIP and RFLAGS so the kernel knows where to return).

In this case, the number of the system call is still passed in the register rax, but the registers used to hold the arguments now nearly match the function calling convention: rdi, rsi, rdx, r10, r8 and r9 in that order. (syscall itself destroys rcx so r10 is used instead of rcx, letting libc wrapper functions just use mov r10, rcx / syscall.)

// x86-64 Linux
#include <asm/unistd.h>      // compile without -m32 for 64 bit call numbers
// #define __NR_write 1
ssize_t my_write(int fd, const void *buf, size_t size)
{
    ssize_t ret;
    asm volatile
    (
        "syscall"
        : "=a" (ret)
        //                 EDI      RSI       RDX
        : "0"(__NR_write), "D"(fd), "S"(buf), "d"(size)
        : "rcx", "r11", "memory"
    );
    return ret;
}

(See it compile on Godbolt)

Do notice how practically the only thing that needed changing were the register names, and the actual instruction used for making the call. This is mostly thanks to the input/output lists provided by gcc's extended inline assembly syntax, which automagically provides appropriate move instructions needed for executing the instruction list.

The "0"(callnum) matching constraint could be written as "a" because operand 0 (the "=a"(ret) output) only has one register to pick from; we know it will pick EAX. Use whichever you find more clear.

Note that non-Linux OSes, like MacOS, use different call numbers. And even different arg-passing conventions for 32-bit.

Where can I get a list of syscall functions for x86 Assembly in Linux

strace has tables where these are listed. You can find the x86_64 calls here.

Assembly and System Calls

Here's a trick to make progress quickly with these aspects of assembly: ask a C compiler to show you how it does it! Write a C program that does what you want to do and type gcc -S.

Example:

Manzana:ppc pascal$ cat t.c
#define NULL ((void*)0)
char *args[] = { "foo", NULL } ;
char *env[] = { "PATH=/bin", NULL } ;


int execve(const char *filename, char *const argv[], char *const envp[]);

int main()
{

  execve("/bin/bash", args, env);

}

then:

Manzana:ppc pascal$ gcc -S -fno-PIC t.c  # added no-PIC for readability of generated code
Manzana:ppc pascal$ cat t.s
.globl _args
    .cstring
LC0:
    .ascii "foo\0"
    .data
    .align 2
_args:
    .long   LC0
    .long   0
.globl _env
    .cstring
LC1:
    .ascii "PATH=/bin\0"
    .data
    .align 2
_env:
    .long   LC1
    .long   0
    .cstring
LC2:
    .ascii "/bin/bash\0"
    .text
.globl _main
_main:
    pushl   %ebp
    movl    %esp, %ebp
    subl    $24, %esp
    movl    $_env, 8(%esp)
    movl    $_args, 4(%esp)
    movl    $LC2, (%esp)
    call    _execve
    leave
    ret
    .subsections_via_symbols

Understanding Linux x86_64 Syscall Implementation in NASM

The 7th arg is passed on the stack in x86-64 System V.

This is taking the first C arg and putting it in RAX, then copying the next 6 C args to the 6 arg-passing registers of the kernel's system-call calling convention. (Like functions, but with R10 instead of RCX).

The only reason the craptastic glibc syscall() function exists / is written that way is because there's no way to tell C compilers about a custom calling convention where an arg is also passed in RAX. That wrapper makes it look just like any other C function with a prototype.

It's fine for messing around with new system calls, but as you noted it's inefficient. If you wanted something better in C, use inline asm macros for your ISA, e.g. https://github.com/linux-on-ibm-z/linux-syscall-support/blob/master/linux_syscall_support.h. Inline asm is hard, and historically some syscall1 / syscall2 (per number of args) macros have been missing things like a "memory" clobber to tell the compiler that pointed-to memory could also be an input or output. That github project is safe and has code for various ISAs. (Some missed optimizations, like could use a dummy input operand instead of a full "memory" clobber... But that's irrelevant to asm)

Of course, you can do much better if you're writing in asm:

Just use the syscall instruction directly with args in the right registers (RDI, RSI, RDX, R10, R8, R9) instead of call _syscall with the function-calling convention. That's strictly worse than just inlining the syscall instruction: With syscall you know that registers are unmodified except for RAX (return value) and RCX/R11 (syscall itself uses them to save RIP and RFLAGS before kernel code runs.) And it would take just as much code to get args into registers for a function call as it would for syscall.

If you do want a wrapper function at all (e.g. to cmp rax, -4095 / jae handle_syscall_error afterwards and maybe set errno), use the same calling convention for it as the kernel expects, so the first instruction can be syscall, not all that stupid shuffling of args over by 1.

Functions in asm (that you only need to call from asm) can use whatever calling convention is convenient. It's a good idea to use a good standard one most of the time, but any "obviously special" function can certainly use a special convention.

linux x86_64 nasm assembly syscalls

Look at the Linux man pages (section 2). http://man7.org/linux/man-pages/dir_section_2.html

It doesn't matter what assembler (or C compiler) you use to create x86-64 machine code, the system calls you can make are the same. (Put a call number in RAX and run the syscall instruction; inside the kernel it uses that number to index a table of function pointers. Or returns -ENOSYS if it's out of range.)

Debug your program with strace ./my_program to trace the system calls it makes. This decodes the args and return values into meaningful stuff on a per-call basis, so you can easily see if you passed a bad pointer making the syscall return -EFAULT for example. (System calls don't raise SIGSEGV / segfault, they just return an error.)

/usr/include/asm/unistd_64.h has the actual numbers. (Included by <asm/unistd.h> when compiling for 64-bit). The man pages will document the args in terms of C syntax. Given the C prototype, you can work out the asm ABI according to the x86-64 System V ABI. (Same as the function-call ABI except with R10 instead of RCX for the 4th arg, if present.) What are the calling conventions for UNIX & Linux system calls on i386 and x86-64

syscall(2) is a glibc wrapper function for system calls, and the syscall man page also documents is asm ABI for various Linux platforms (x86-64, SPARC, ARM, etc.), including registers for the call number and ret val, and the instruction for entering the kernel. Note that the function name being the same as the x86-64 syscall instruction is just a coincidence.

Nobody bothers to make exhaustive documentation for every system call for every different flavour of asm syntax - the information is all there in the man pages plus the calling convention doc; the NOTES section of the Linux man pages document differences between the C library wrapper API vs. the underlying asm system call.

See also https://blog.packagecloud.io/eng/2016/04/05/the-definitive-guide-to-linux-system-calls/ for more including VDSO stuff for efficient getpid / clock_gettime without even entering the kernel.

However, some people do compile tables of system call name and Linux x86-64 call number and arg registers. I've never found that useful (the syscall calling convention is so close to the function calling convention that it's easy to remember), but https://blog.rchapman.org/posts/Linux_System_Call_Table_for_x86_64/ is there if you want it.

Notable differences between the POSIX function and the raw Linux system call exist for a couple calls: For example brk / sbrk, and also getpriority where the "nice" level return values are biased so they're not in the -4095..-1 range of error codes. But most system calls have an ABI that exactly matches the C library wrapper prototype in which case the NOTES section doesn't mention anything.