System Calls: Difference Between Sys_Exit(), Sys_Exit and Exit()

System calls : difference between sys_exit(), SYS_exit and exit()

I'll use exit() as in your example although this applies to all system calls.

The functions of the form sys_exit() are the actual entry points to the kernel routine that implements the function you think of as exit(). These symbols are not even available to user-mode programmers. That is, unless you are hacking the kernel, you cannot link to these functions because their symbols are not available outside the kernel. If I wrote libmsw.a which had a file scope function like

static int msw_func() {}

defined in it, you would have no success trying to link to it because it is not exported in the libmsw symbol table; that is:

cc your_program.c libmsw.a

would yield an error like:

ld: cannot resolve symbol msw_func

because it isn't exported; the same applies for sys_exit() as contained in the kernel.

In order for a user program to get to kernel routines, the syscall(2) interface needs to be used to effect a switch from user-mode to kernel mode. When that mode-switch (somtimes called a trap) occurs a small integer is used to look up the proper kernel routine in a kernel table that maps integers to kernel functions. An entry in the table has the form

{SYS_exit, sys_exit},

Where SYS_exit is an preprocessor macro which is

#define SYS_exit (1)

and has been 1 since before you were born because there hasn't been reason to change it. It also happens to be the first entry in the table of system calls which makes look up a simple array index.

As you note in your question, the proper way for a regular user-mode program to access sys_exit is through the thin wrapper in glibc (or similar core library). The only reason you'd ever need to mess with SYS_exit or sys_exit is if you were writing kernel code.

What is the difference between calling ret vs calling the sys_exit number assembly gcc

The code that calls main looks like this:

int status = main(argc, argv, envp);
exit(status);

if main returns, exit(status) is executed. exit is a C library function which flushes all stdio streams, invokes atexit() handlers and finally calls _exit(status), which is the C wrapper for the SYS_exit system call. If you use the C runtime (e.g. by having your program start at main or by using any libc functions), I strongly recommend you to never call SYS_exit directly so the C runtime has a chance to correctly deinitialize the program. The best idea is usually to call exit() or to return from main unless you know exactly what you are doing.

On x64 Linux, what is the difference between syscall, int 0x80 and ret to exit a program?

If you use printf or other libc functions, it's best to ret from main or call exit. (Which are equivalent; main's caller will call the libc exit function.)

If not, if you were only making other raw system calls like write with syscall, it's also appropriate and consistent to exit that way, but either way, or call exit are 100% fine in main.

If you want to work without libc at all, e.g. put your code under _start: instead of main: and link with ld or gcc -static -nostdlib, then you can't use ret. Use mov eax, 231 (__NR_exit_group) / syscall.

main is a real & normal function like any other (called with a valid return address), but _start (the process entry point) isn't. On entry to _start, the stack holds argc and argv, so trying to ret would set RIP=argc, and then code-fetch would segfault on that unmapped address. Nasm segmentation fault on RET in _start

System call vs. ret-from-main

Exiting via a system call is like calling _exit() in C - skip atexit() and libc cleanup, notably not flushing any buffered stdout output (line buffered on a terminal, full-buffered otherwise).
This leads to symptoms such as Using printf in assembly leads to empty output when piping, but works on the terminal (or if your output doesn't end with \n, even on a terminal.)

main is a function, called (indirectly) from CRT startup code. (Assuming you link your program normally, like you would a C program.) Your hand-written main works exactly like a compiler-generate C main function would. Its caller (__libc_start_main) really does do something like int result = main(argc, argv); exit(result);,

e.g. call rax (pointer passed by _start) / mov edi, eax / call exit.

So returning from main is exactly¹ like calling exit.

Syscall implementation of exit() for a comparison of the relevant C functions, exit vs. _exit vs. exit_group and the underlying asm system calls.
C question: What is the difference between exit and return? is primarily about exit() vs. return, although there is mention of calling _exit() directly, i.e. just making a system call. It's applicable because C main compiles to an asm main just like you'd write by hand.

Footnote 1: You can invent a hypothetical intentionally weird case where it's different. e.g. you used stack space in main as your stdio buffer with sub rsp, 1024 / mov rsi, rsp / ... / call setvbuf. Then returning from main would involve putting RSP above that buffer, and __libc_start_main's call to exit could overwrite some of that buffer with return addresses and locals before execution reached the fflush cleanup. This mistake is more obvious in asm than C because you need leave or mov rsp, rbp or add rsp, 1024 or something to point RSP at your return address.

In C++, return from main runs destructors for its locals (before global/static exit stuff), exit doesn't. But that just means the compiler makes asm that does more stuff before actually running the ret, so it's all manual in asm, like in C.

The other difference is of course the asm / calling-convention details: exit status in EAX (return value) or EDI (first arg), and of course to ret you have to have RSP pointing at your return address, like it was on function entry. With call exit you don't, and you can even do a conditional tailcall of exit like jne exit. Since it's a noreturn function, you don't really need RSP pointing at a valid return address. (RSP should be aligned by 16 before a call, though, or RSP%16 = 8 before a tailcall, matching the alignment after call pushes a return address. It's unlikely that exit / fflush cleanup will do any alignment-required stores/loads to the stack, but it's a good habit to get this right.)

(This whole footnote is about ret vs. call exit, not syscall, so it's a bit of a tangent from the rest of the answer. You can also run syscall without caring where the stack-pointer points.)

`SYS_exit` vs. `SYS_exit_group` raw system calls

The raw SYS_exit system call is for exiting the current thread, like pthread_exit().

(eax=60 / syscall, or eax=1 / int 0x80).

SYS_exit_group is for exiting the whole program, like _exit.

(eax=231 / syscall, or eax=252 / int 0x80).

In a single-threaded program you can use either, but conceptually exit_group makes more sense to me if you're going to use raw system calls. glibc's _exit() wrapper function actually uses the exit_group system call (since glibc 2.3). See Syscall implementation of exit() for more details.

However, nearly all the hand-written asm you'll ever see uses SYS_exit¹. It's not "wrong", and SYS_exit is perfectly acceptable for a program that didn't start more threads. Especially if you're trying to save code size with xor eax,eax / inc eax (3 bytes in 32-bit mode) or push 60 / pop rax (3 bytes in 64-bit mode), while push 231/pop rax would be even larger than mov eax,231 because it doesn't fit in a signed imm8.

Note 1: (Usually actually hard-coding the number, not using __NR_... constants from asm/unistd.h or their SYS_... names from sys/syscall.h)

And historically, it's all there was. Note that in unistd_32.h, __NR_exit has call number 1, but __NR_exit_group = 252 wasn't added until years later when the kernel gained support for tasks that share virtual address space with their parent, aka threads started by clone(2). This is when SYS_exit conceptually became "exit current thread". (But one could easily and convincingly argue that in a single-threaded program, SYS_exit does still mean exit the whole program, because it only differs from exit_group if there are multiple threads.)

To be honest, I've never used eax=252 / int 0x80 in anything, only ever eax=1. It's only in 64-bit code where I often use mov eax,231 instead of mov eax,60 because neither number is "simple" or memorable the way 1 is, so might as well be a cool guy and use the "modern" exit_group way in my single-threaded toy program / experiment / microbenchmark / SO answer. :P (If I didn't enjoy tilting at windmills, I wouldn't spend so much time on assembly, especially on SO.)

And BTW, I usually use NASM for one-off experiments so it's inconvenient to use pre-defined symbolic constants for call numbers; with GCC to preprocess a .S before running GAS you can make your code self-documenting with #include <sys/syscall.h> so you can use mov $SYS_exit_group, %eax (or $__NR_exit_group), or mov eax, __NR_exit_group with .intel_syntax noprefix.

Don't use the 32-bit `int 0x80` ABI in 64-bit code:

What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? explains what happens if you use the COMPAT_IA32_EMULATION int 0x80 ABI in 64-bit code.

It's totally fine for just exiting, as long as your kernel has that support compiled in, otherwise it will segfault just like any other random int number like int 0x7f. (e.g. on WSL1, or people that built custom kernels and disabled that support.)

But the only reason you'd do it that way in asm would be so you could build the same source file with nasm -felf32 or nasm -felf64. (You can't use syscall in 32-bit code, except on some AMD CPUs which have a 32-bit version of syscall. And the 32-bit ABI uses different call numbers anyway so this wouldn't let the same source be useful for both modes.)

Why am I allowed to exit main using ret? (CRT startup code calls main, you're not returning directly to the kernel.)
Nasm segmentation fault on RET in _start - you can't ret from _start
Using printf in assembly leads to empty output when piping, but works on the terminal stdout buffer (not) flushing with raw system call exit
Syscall implementation of exit() call exit vs. mov eax,60/syscall (_exit) vs. mov eax,231/syscall (exit_group).
Can't call C standard library function on 64-bit Linux from assembly (yasm) code - modern Linux distros config GCC in a way that call exit or call puts won't link with nasm -felf64 foo.asm && gcc foo.o.
Is main() really start of a C++ program? - Ciro's answer is a deep dive into how glibc + its CRT startup code actually call main (including x86-64 asm disassembly in GDB), and shows the glibc source code for __libc_start_main.
Linux x86 Program Start Up
or - How the heck do we get to main()? 32-bit asm, and more detail than you'll probably want until you're a lot more comfortable with asm, but if you've ever wondered why CRT runs so much code before getting to main, that covers what's happening at a level that's a couple steps up from using GDB with starti (stop at the process entry point, e.g. in the dynamic linker's _start) and stepi until you get to your own _start or main.
https://stackoverflow.com/tags/x86/info lots of good links about this and everything else.

Syscall implementation of exit()

The Linux and glibc man pages document all of this (See especially the "C library/kernel differences" in the NOTES section).

_exit(2): In glibc 2.3 and later, this wrapper function actually uses the Linux SYS_exit_group system call to exit all threads. Before glibc2.3, it was a wrapper for SYS_exit to exit just the current thread.
exit_group(2): glibc wrapper for SYS_exit_group, which exits all threads.
exit(3): The ISO C89 function which flushes buffers and then exits the whole process. (It always uses exit_group() because there's no benefit to checking if the process was single-threaded and deciding to use SYS_exit vs. SYS_exit_group). As @Matteo points out, recent ISO C / POSIX standards are thread-aware and one or both probably require this behaviour.
But apparently exit(3) itself is not thread-safe (in the C library cleanup parts), so I guess don't call it from multiple threads at once.
syscall / int 0x80 with SYS_exit: terminates just the current thread, leaving others running. AFAIK, modern glibc has no thin wrapper function for this Linux system call, but I think pthread_exit() uses it if this isn't the last thread. (Otherwise exit(3) -> exit_group(2).)

Only exit(), not _exit() or exit_group(), flushes stdout, leading to "printf doesn't print anything" problems in newbie asm programs if writing to a pipe (which makes stdout full-buffered instead of line-buffered), or if you forgot the \n in the format string. For example, How come _exit(0) (exiting by syscall) prevents me from receiving any stdout content?. If you use any buffered I/O functions, or at_exit, or anything like that, it's usually a good idea to call the libc exit(3) function instead of the system call directly. But of course you can call fflush before SYS_exit_group.

(Also related: On x64 Linux, what is the difference between syscall, int 0x80 and ret to exit a program? - ret from main is equivalent to calling exit(3))

It's not of course the compiler that chose anything, it's libc. When you include headers and write read(fd, buf, 123) or exit(1), the C compiler just sees an ordinary function call.

Some C libraries (e.g. musl, but not glibc) may use inline asm to inline a syscall instruction into your binary, but still the headers are part of the C library, not the compiler.

Difference between exit() and sys.exit() in Python

exit is a helper for the interactive shell - sys.exit is intended for use in programs.

The site module (which is imported automatically during startup, except if the -S command-line option is given) adds several constants to the built-in namespace (e.g. exit). They are useful for the interactive interpreter shell and should not be used in programs.

Technically, they do mostly the same: raising SystemExit. sys.exit does so in sysmodule.c:

static PyObject *
sys_exit(PyObject *self, PyObject *args)
{
    PyObject *exit_code = 0;
    if (!PyArg_UnpackTuple(args, "exit", 0, 1, &exit_code))
        return NULL;
    /* Raise SystemExit so callers may catch it or clean up. */
    PyErr_SetObject(PyExc_SystemExit, exit_code);
   return NULL;
}

While exit is defined in site.py and _sitebuiltins.py, respectively.

class Quitter(object):
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        return 'Use %s() or %s to exit' % (self.name, eof)
    def __call__(self, code=None):
        # Shells like IDLE catch the SystemExit, but listen when their
        # stdin wrapper is closed.
        try:
            sys.stdin.close()
        except:
            pass
        raise SystemExit(code)
__builtin__.quit = Quitter('quit')
__builtin__.exit = Quitter('exit')

Note that there is a third exit option, namely os._exit, which exits without calling cleanup handlers, flushing stdio buffers, etc. (and which should normally only be used in the child process after a fork()).

In a Linux system call, are system call parameters preserved in registers after the syscall finished (at the sys_exit tracepoint)?

Is it guaranteed to be able to read all the syscall parameters at sys_exit tracepoint?

Yes... and no, we need to distinguish parameters from registers. Linux syscalls should preserve all general purpose userspace registers, except the register used for the return value (and on some architectures also a second register to indicate if an error occurred). However, this does not mean that the input parameters of the syscall cannot change between entry and exit: if a register holds the value of a pointer to some data, while the register itself does not change, the data it points to could very well change.

Looking at the code for the static tracepoint sys_exit, you can see that only the syscall number (id) and its return value (ret) are traced. See note at the bottom of my answer for more.

Why not read all parameters at sys_exit? Is this because some parameters may be not available at sys_exit?

Yes, I would say that ensuring the correctness of the traced parameters is the main reason why tracing only at the exit would be a bad idea. Even if you get the values of the register, you cannot know the real parameters at syscall exit. Even if a syscall per se is guaranteed to save and restore the state of user registers, the syscall itself can alter the data that is being passed as argument. For example, the recvmsg syscall takes a pointer to a struct msghdr in memory which is used both as an input and an output parameter; the poll syscall does the same with a pointer to struct pollfd. Furthermore, another thread or program could have very well modified the memory of the program while it was making a syscall, therefore altering the data.

Under specific circumstances a syscall can also take a very long time before returning (think for example of a sleep, or a blocking read on your terminal, an accept on a listening socket, etc). If you only trace at the exit, you will have very incorrect timing information, and most importantly you will have to wait a lot before any meaningful information can be captured, even though that information is already available at the entry point.

Note on sys_exit tracepoint

Although you could thecnically extract the values of the saved registers of the current task, I am not entirely sure about the semantics of doing so while in the sys_exit tracepoint. I searched for some documentation on this specific case, but had no luck, and kernel code is well... complex.

The chain of calls to reach the exit hook should be:

Arch specific entry point (e.g. entry_INT80_32 for x86 int 0x80)
- Arch specific entry handler (e.g. do_int80_syscall_32() for x86 int 0x80)
  - syscall_exit_to_user_code()
    - syscall_exit_to_user_mode_prepare()
      - syscall_exit_work()
        trace_sys_exit()

If a deadly signal is delivered to a process during a syscall, while the actual process will never reach the exit of the syscall (i.e. no value is ever returned to user space), the tracepoint will still be hit. When a signal delivery of this kind happens, a special internal return value is used, like -ERESTARTSYS (see here). This value is not an actual syscall return value (it is not returned to user space), but rather it is only meant to be used by kernel. So it looks like the sys_exit tracepoint is being hit with the special -ERESTARTSYS if a deadly signal is received by the process. This does not happen for example in the case of SIGSTOP + SIGCONT. Take this with a grain of salt though, since I was not able to find proper documentation for this.

What is difference between sys.exit(0) and os._exit(0)

According to the documentation:

os._exit():
Exit the process with status n, without calling cleanup handlers, flushing stdio buffers, etc.

Note The standard way to exit is sys.exit(n). _exit() should normally only be used in the child process after a fork().

What is the correct constant for the exit system call?

The correct header file to get the system call numbers is sys/syscall.h. The constants are called SYS_### where ### is the name of the system call you are interested in. The __NR_### macros are implementation details and should not be used. As a rule of thumb, if an identifier begins with an underscore it should not be used, if it begins with two it should definitely not be used. The arguments go into rdi, rsi, rdx, r10, r8, and r9. Here is a sample program for Linux:

#include <sys/syscall.h>

    .globl _start
_start:
    mov $SYS_exit,%eax
    xor %edi,%edi
    syscall

These conventions are mostly portable to other UNIX-like operating systems.

/ptregs in syscall table

These are special system calls which require full register dump laid out on the stack (as a struct pt_regs). This is a thing only for the 64-bit x86 architecture because it has more registers (compared to 32-bit).

The system call handler (arch/x86/entry/entry_64.S:entry_SYSCALL_64) saves most of the registers on the stack on system call entry. This is done partially to support ptrace() and partially to pass the arguments to actual system call handlers written in C (this is why they have asmlinkage spec, its makes the function get arguments from stack). System calls have at most 6 arguments (rdi, rsi, rdx, r10, r8, r9), and some registers are used for SYSCALL bookkeeping (rax, rcx, r11). You do not need to save rbp, rbx, r12, r13, r14, r15 (as they are callee-saved), so they are not saved on entry for performance reasons. After the system call handling completes the registers are restored from this backup before returning to userspace.

However, some system calls (like execve(), fork(), sigreturn(), etc.) need to have all registers on the stack (including rbp, rbx, r12–r15), in the struct pt_regs. This is because these system calls can cause the userspace to restart execution from a different place, so they need accurate register values saved. They are marked with /ptregs in syscall_64.tbl so that the following magic happens.

Normally the system call handler table (sys_call_table) contains pointers to C functions. But for those special system calls the handlers are small assembly thunks which first save the extra registers and then jump to the C code (this is what the slow-path does). The /ptregs suffix in the table instructs the script to insert these stubs instead of C functions into the handler table.

System Calls: Difference Between Sys_Exit(), Sys_Exit and Exit()