On x64 Linux, what is the difference between syscall, int 0x80 and ret to exit a program?
If you use printf
or other libc functions, it's best to ret
from main or call exit
. (Which are equivalent; main's caller will call the libc exit
function.)
If not, if you were only making other raw system calls like write
with syscall
, it's also appropriate and consistent to exit that way, but either way, or call exit
are 100% fine in main.
If you want to work without libc at all, e.g. put your code under _start:
instead of main:
and link with ld
or gcc -static -nostdlib
, then you can't use ret
. Use mov eax, 231
(__NR_exit_group) / syscall
.
main
is a real & normal function like any other (called with a valid return address), but _start
(the process entry point) isn't. On entry to _start
, the stack holds argc
and argv
, so trying to ret
would set RIP=argc, and then code-fetch would segfault on that unmapped address. Nasm segmentation fault on RET in _start
System call vs. ret-from-main
Exiting via a system call is like calling _exit()
in C - skip atexit()
and libc cleanup, notably not flushing any buffered stdout output (line buffered on a terminal, full-buffered otherwise).
This leads to symptoms such as Using printf in assembly leads to empty output when piping, but works on the terminal (or if your output doesn't end with \n
, even on a terminal.)
main
is a function, called (indirectly) from CRT startup code. (Assuming you link your program normally, like you would a C program.) Your hand-written main works exactly like a compiler-generate C main
function would. Its caller (__libc_start_main
) really does do something like int result = main(argc, argv); exit(result);
,
e.g. call rax
(pointer passed by _start
) / mov edi, eax
/ call exit
.
So returning from main is exactly1 like calling exit
.
Syscall implementation of exit() for a comparison of the relevant C functions,
exit
vs._exit
vs.exit_group
and the underlying asm system calls.C question: What is the difference between exit and return? is primarily about
exit()
vs.return
, although there is mention of calling_exit()
directly, i.e. just making a system call. It's applicable because C main compiles to an asm main just like you'd write by hand.
Footnote 1: You can invent a hypothetical intentionally weird case where it's different. e.g. you used stack space in main
as your stdio buffer with sub rsp, 1024
/ mov rsi, rsp
/ ... / call setvbuf
. Then returning from main would involve putting RSP above that buffer, and __libc_start_main's call to exit could overwrite some of that buffer with return addresses and locals before execution reached the fflush cleanup. This mistake is more obvious in asm than C because you need leave
or mov rsp, rbp
or add rsp, 1024
or something to point RSP at your return address.
In C++, return from main runs destructors for its locals (before global/static exit stuff), exit
doesn't. But that just means the compiler makes asm that does more stuff before actually running the ret
, so it's all manual in asm, like in C.
The other difference is of course the asm / calling-convention details: exit status in EAX (return value) or EDI (first arg), and of course to ret
you have to have RSP pointing at your return address, like it was on function entry. With call exit
you don't, and you can even do a conditional tailcall of exit like jne exit
. Since it's a noreturn function, you don't really need RSP pointing at a valid return address. (RSP should be aligned by 16 before a call, though, or RSP%16 = 8 before a tailcall, matching the alignment after call pushes a return address. It's unlikely that exit / fflush cleanup will do any alignment-required stores/loads to the stack, but it's a good habit to get this right.)
(This whole footnote is about ret
vs. call exit
, not syscall
, so it's a bit of a tangent from the rest of the answer. You can also run syscall
without caring where the stack-pointer points.)
SYS_exit
vs. SYS_exit_group
raw system calls
The raw SYS_exit
system call is for exiting the current thread, like pthread_exit()
.
(eax=60 / syscall
, or eax=1 / int 0x80
).
SYS_exit_group
is for exiting the whole program, like _exit
.
(eax=231 / syscall
, or eax=252 / int 0x80
).
In a single-threaded program you can use either, but conceptually exit_group makes more sense to me if you're going to use raw system calls. glibc's _exit()
wrapper function actually uses the exit_group
system call (since glibc 2.3). See Syscall implementation of exit() for more details.
However, nearly all the hand-written asm you'll ever see uses SYS_exit
1. It's not "wrong", and SYS_exit
is perfectly acceptable for a program that didn't start more threads. Especially if you're trying to save code size with xor eax,eax
/ inc eax
(3 bytes in 32-bit mode) or push 60
/ pop rax
(3 bytes in 64-bit mode), while push 231
/pop rax
would be even larger than mov eax,231
because it doesn't fit in a signed imm8.
Note 1: (Usually actually hard-coding the number, not using __NR_
... constants from asm/unistd.h
or their SYS_
... names from sys/syscall.h
)
And historically, it's all there was. Note that in unistd_32.h, __NR_exit
has call number 1, but __NR_exit_group
= 252 wasn't added until years later when the kernel gained support for tasks that share virtual address space with their parent, aka threads started by clone(2)
. This is when SYS_exit
conceptually became "exit current thread". (But one could easily and convincingly argue that in a single-threaded program, SYS_exit
does still mean exit the whole program, because it only differs from exit_group
if there are multiple threads.)
To be honest, I've never used eax=252 / int 0x80 in anything, only ever eax=1. It's only in 64-bit code where I often use mov eax,231
instead of mov eax,60
because neither number is "simple" or memorable the way 1 is, so might as well be a cool guy and use the "modern" exit_group
way in my single-threaded toy program / experiment / microbenchmark / SO answer. :P (If I didn't enjoy tilting at windmills, I wouldn't spend so much time on assembly, especially on SO.)
And BTW, I usually use NASM for one-off experiments so it's inconvenient to use pre-defined symbolic constants for call numbers; with GCC to preprocess a .S
before running GAS you can make your code self-documenting with #include <sys/syscall.h>
so you can use mov $SYS_exit_group, %eax
(or $__NR_exit_group
), or mov eax, __NR_exit_group
with .intel_syntax noprefix
.
Don't use the 32-bit int 0x80
ABI in 64-bit code:
What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? explains what happens if you use the COMPAT_IA32_EMULATION int 0x80
ABI in 64-bit code.
It's totally fine for just exiting, as long as your kernel has that support compiled in, otherwise it will segfault just like any other random int number like int 0x7f
. (e.g. on WSL1, or people that built custom kernels and disabled that support.)
But the only reason you'd do it that way in asm would be so you could build the same source file with nasm -felf32
or nasm -felf64
. (You can't use syscall
in 32-bit code, except on some AMD CPUs which have a 32-bit version of syscall
. And the 32-bit ABI uses different call numbers anyway so this wouldn't let the same source be useful for both modes.)
Related:
- Why am I allowed to exit main using ret? (CRT startup code calls main, you're not returning directly to the kernel.)
- Nasm segmentation fault on RET in _start - you can't
ret
from_start
- Using printf in assembly leads to empty output when piping, but works on the terminal stdout buffer (not) flushing with raw system call exit
- Syscall implementation of exit()
call exit
vs.mov eax,60
/syscall
(_exit) vs.mov eax,231
/syscall
(exit_group). - Can't call C standard library function on 64-bit Linux from assembly (yasm) code - modern Linux distros config GCC in a way that
call exit
orcall puts
won't link withnasm -felf64 foo.asm
&&gcc foo.o
. - Is main() really start of a C++ program? - Ciro's answer is a deep dive into how glibc + its CRT startup code actually call main (including x86-64 asm disassembly in GDB), and shows the glibc source code for __libc_start_main.
- Linux x86 Program Start Up
or - How the heck do we get to main()? 32-bit asm, and more detail than you'll probably want until you're a lot more comfortable with asm, but if you've ever wondered why CRT runs so much code before getting to main, that covers what's happening at a level that's a couple steps up from using GDB withstarti
(stop at the process entry point, e.g. in the dynamic linker's_start
) andstepi
until you get to your own_start
ormain
. - https://stackoverflow.com/tags/x86/info lots of good links about this and everything else.
Why am I allowed to exit main using ret?
C main
is called (indirectly) from CRT startup code, not directly from the kernel.
After main
returns, that code calls atexit
functions to do stuff like flushing stdio buffers, then passes main's return value to a raw _exit
system call. Or exit_group
which exits all threads.
You make several wrong assumptions, all I think based on a misunderstanding of how kernels work.
The kernel runs at a different privilege level from user-space (ring 0 vs. ring 3 on x86). Even if user-space knew the right address to jump to, it can't jump into kernel code. (And even if it could, it wouldn't be running with kernel privilege level).
ret
isn't magic, it's basically justpop %rip
and doesn't let you jump anywhere you couldn't jump to with other instructions. Also doesn't change privilege level1.Kernel addresses aren't mapped / accessible when user-space code is running; those page-table entries are marked as supervisor-only. (Or they're not mapped at all in kernels that mitigate the Meltdown vulnerability, so entering the kernel goes through a "wrapper" block of code that changes CR3.)
Virtual memory is how the kernel protects itself from user-space. User-space can't modify page tables directly, only by asking the kernel to do it via
mmap
andmprotect
system calls. (And user-space can't execute privileged instructions likemov cr3, rax
to install new page tables. That's the purpose of having ring 0 (kernel mode) vs. ring 3 (user mode).)The kernel stack is separate from the user-space stack for a process. (In the kernel, there's also a small kernel stack for each task (aka thread) that's used during system calls / interrupts while that user-space thread is running. At least that's how Linux does it, IDK about others.)
The kernel doesn't literally
call
user-space code; The user-space stack doesn't hold any return address back into the kernel. A kernel->user transition involves swapping stack pointers, as well as changing privilege levels. e.g. with an instruction likeiret
(interrupt-return).Plus, leaving a kernel code address anywhere user-space can see it would defeat kernel ASLR.
Footnote 1: (The compiler-generated ret
will always be a normal near ret
, not a retf
that could return through a call gate or something to a privileged cs
value. x86 handles privilege levels via the low 2 bits of CS but nevermind that. MacOS / Linux don't set up call gates that user-space can use to call into the kernel; that's done with syscall
or int 0x80
instructions.)
In a fresh process (after an execve
system call replaced the previous process with this PID with a new one), execution begins at the process entry point (usually labeled _start
), not at the C main
function directly.
C implementations come with CRT (C RunTime) startup code that has (among other things) a hand-written asm implementation of _start
which (indirectly) calls main
, passing args to main according to the calling convention.
_start
itself is not a function. On process entry, RSP points at argc
, and above that on the user-space stack is argv[0]
, argv[1]
, etc. (i.e. the char *argv[]
array is right there by value, and above that the envp
array.) _start
loads argc
into a register and puts pointers to the argv and envp into registers. (The x86-64 System V ABI that MacOS and Linux both use documents all this, including the process-startup environment and the calling convention.)
If you try to ret
from _start
, you're just going to pop argc
into RIP, and then code-fetch from absolute address 1
or 2
(or other small number) will segfault. For example, Nasm segmentation fault on RET in _start shows an attempt to ret
from the process entry point (linked without CRT startup code). It has a hand-written _start
that just falls through into main
.
When you run gcc main.c
, the gcc
front-end runs multiple other programs (use gcc -v
to show details). This is how the CRT startup code gets linked into your process:
- gcc preprocesses (CPP) and compiles+assembles
main.c
tomain.o
(or a temporary file). On MacOS, thegcc
command is actually clang which has a built-in assembler, but realgcc
really does compile to asm and then runas
on that. (The C preprocessor is built-in to the compiler, though.) - gcc runs something like
ld -dynamic-linker /lib64/ld-linux-x86-64.so.2 -pie /usr/lib/Scrt1.o /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/crtbeginS.o main.o -lc -lgcc /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/crtendS.o
. That's actually simplified a lot, with some of the CRT files left out, and paths canonicalized to remove../../lib
parts. Also, it doesn't runld
directly, it runscollect2
which is a wrapper forld
. But anyway, that statically links in those.o
CRT files that contain_start
and some other stuff, and dynamically links libc (-lc
) and libgcc (for GCC helper functions like implementing__int128
multiply and divide with 64-bit registers, in case your program uses those).
.intel_syntax
.text:
.global _rbp
_rbp:
mov rax, rbp
ret;
This is not allowed, ...
The only reason that doesn't assemble is because you tried to declare .text:
as a label, instead of using the .text
directive. If you remove the trailing :
it does assemble with clang (which treats .intel_syntax
the same as .intel_syntax noprefix
).
For GCC / GAS to assemble it, you'd also need the noprefix
to tell it that register names aren't prefixed by %
. (Yes you can have Intel op dst, src order but still with %rsp
register names. No you shouldn't do this!) And of course GNU/Linux doesn't use leading underscores.
Not that it would always do what you want if you called it, though! If you compiled main
without optimization (so -fno-omit-frame-pointer
was in effect), then yes you'd get a pointer to the stack slot below the return address.
And you definitely use the value incorrectly. (*p)-4;
loads the saved RBP value (*p
) and then offsets by four 8-byte void-pointers. (Because that's how C pointer math works; *p
has type void*
because p
has type void **
).
I think you're trying to get your own return address and re-run the call
instruction (in main's caller) that reached main, eventually leading to a stack overflow from pushing more return addresses. In GNU C, use void * __builtin_return_address (0)
to get your own return address.
x86 call rel32
instructions are 5 bytes, but the call
that called main was probably an indirect call, using a pointer in a register. So it might be a 2-byte call *%rax
or a 3-byte call *%r12
, you don't know unless you disassemble your caller. (I'd suggest single-stepping by instructions (GDB / LLDB stepi
) off the end of main
using a debugger in disassembly mode. If it has any symbol info for main's caller, you'll be able to scroll backward and see what the previous instruction was.
If not, you might have to try and see what looks sane; x86 machine code can't be unambiguously decoded backwards because it's variable-length. You can't tell the difference between a byte within an instruction (like an immediate or ModRM) vs. the start of an instruction. It all depends on where you start disassembling from. If you try a few byte offsets, usually only one will produce anything that looks sane.
asm("movq %rax, 0"); //Exit code is 11, so now it should be 0
This is a store of RAX to absolute address 0
, in AT&T syntax. This of course segfaults. exit code 11 is from SIGSEGV, which is signal 11. (Use kill -l
to see signal numbers).
Perhaps you wanted mov $0, %eax
. Although that's still pointless here, you're about to call through your function pointer. In debug mode, the compiler might load it into RAX and step on your value.
Also, writing a register in an asm
statement is never safe when you don't tell the compiler which registers you're modifying (using constraints).
printf("Main: %p\n", main);
printf("&Main: %p\n", &main); //WTF
main
and &main
are the same thing because main
is a function. That's just how C syntax works for function names. main
isn't an object that can have its address taken. & operator optional in function pointer assignment
It's similar for arrays: the bare name of an array can be assigned to a pointer or passed to functions as a pointer arg. But &array
is also the same pointer, same as &array[0]
. This is true only for arrays like int array[10]
, not for pointers like int *ptr
; in the latter case the pointer object itself has storage space and can have its own address taken.
Syscall implementation of exit()
The Linux and glibc man pages document all of this (See especially the "C library/kernel differences" in the NOTES section).
_exit(2)
: In glibc 2.3 and later, this wrapper function actually uses the LinuxSYS_exit_group
system call to exit all threads. Before glibc2.3, it was a wrapper forSYS_exit
to exit just the current thread.exit_group(2)
: glibc wrapper forSYS_exit_group
, which exits all threads.exit(3)
: The ISO C89 function which flushes buffers and then exits the whole process. (It always usesexit_group()
because there's no benefit to checking if the process was single-threaded and deciding to useSYS_exit
vs.SYS_exit_group
). As @Matteo points out, recent ISO C / POSIX standards are thread-aware and one or both probably require this behaviour.But apparently
exit(3)
itself is not thread-safe (in the C library cleanup parts), so I guess don't call it from multiple threads at once.syscall
/int 0x80
withSYS_exit
: terminates just the current thread, leaving others running. AFAIK, modern glibc has no thin wrapper function for this Linux system call, but I thinkpthread_exit()
uses it if this isn't the last thread. (Otherwise exit(3) -> exit_group(2).)
Only exit()
, not _exit()
or exit_group()
, flushes stdout
, leading to "printf
doesn't print anything" problems in newbie asm programs if writing to a pipe (which makes stdout
full-buffered instead of line-buffered), or if you forgot the \n
in the format string. For example, How come _exit(0) (exiting by syscall) prevents me from receiving any stdout content?. If you use any buffered I/O functions, or at_exit
, or anything like that, it's usually a good idea to call the libc exit(3)
function instead of the system call directly. But of course you can call fflush
before SYS_exit_group
.
(Also related: On x64 Linux, what is the difference between syscall, int 0x80 and ret to exit a program? - ret
from main is equivalent to calling exit(3)
)
It's not of course the compiler that chose anything, it's libc. When you include headers and write read(fd, buf, 123)
or exit(1)
, the C compiler just sees an ordinary function call.
Some C libraries (e.g. musl, but not glibc) may use inline asm to inline a syscall
instruction into your binary, but still the headers are part of the C library, not the compiler.
What is the difference between calling ret vs calling the sys_exit number assembly gcc
The code that calls main
looks like this:
int status = main(argc, argv, envp);
exit(status);
if main
returns, exit(status)
is executed. exit
is a C library function which flushes all stdio streams, invokes atexit()
handlers and finally calls _exit(status)
, which is the C wrapper for the SYS_exit
system call. If you use the C runtime (e.g. by having your program start at main
or by using any libc functions), I strongly recommend you to never call SYS_exit
directly so the C runtime has a chance to correctly deinitialize the program. The best idea is usually to call exit()
or to return from main
unless you know exactly what you are doing.
What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?
TL:DR: int 0x80
works when used correctly, as long as any pointers fit in 32 bits (stack pointers don't fit). But beware that strace
decodes it wrong unless you have a very recent strace + kernel.
int 0x80
zeros r8-r11 for reasons, and preserves everything else. Use it exactly like you would in 32-bit code, with the 32-bit call numbers. (Or better, don't use it!)
Not all systems even support int 0x80
: The Windows Subsystem for Linux version 1 (WSL1) is strictly 64-bit only: int 0x80
doesn't work at all. It's also possible to build Linux kernels without IA-32 emulation either. (No support for 32-bit executables, no support for 32-bit system calls). See this re: making sure your WSL is actually WSL2 (which uses an actual Linux kernel in a VM.)
The details: what's saved/restored, which parts of which regs the kernel uses
int 0x80
uses eax
(not the full rax
) as the system-call number, dispatching to the same table of function-pointers that 32-bit user-space int 0x80
uses. (These pointers are to sys_whatever
implementations or wrappers for the native 64-bit implementation inside the kernel. System calls are really function calls across the user/kernel boundary.)
Only the low 32 bits of arg registers are passed. The upper halves of rbx
-rbp
are preserved, but ignored by int 0x80
system calls. Note that passing a bad pointer to a system call doesn't result in SIGSEGV; instead the system call returns -EFAULT
. If you don't check error return values (with a debugger or tracing tool), it will appear to silently fail.
All registers (except eax of course) are saved/restored (including RFLAGS, and the upper 32 of integer regs), except that r8-r11 are zeroed. r12-r15
are call-preserved in the x86-64 SysV ABI's function calling convention, so the registers that get zeroed by int 0x80
in 64-bit are the call-clobbered subset of the "new" registers that AMD64 added.
This behaviour has been preserved over some internal changes to how register-saving was implemented inside the kernel, and comments in the kernel mention that it's usable from 64-bit, so this ABI is probably stable. (I.e. you can count on r8-r11 being zeroed, and everything else being preserved.)
The return value is sign-extended to fill 64-bit rax
. (Linux declares 32-bit sys_ functions as returning signed long
.) This means that pointer return values (like from void *mmap()
) need to be zero-extended before use in 64-bit addressing modes
Unlike sysenter
, it preserves the original value of cs
, so it returns to user-space in the same mode that it was called in. (Using sysenter
results in the kernel setting cs
to $__USER32_CS
, which selects a descriptor for a 32-bit code segment.)
Older strace
decodes int 0x80
incorrectly for 64-bit processes. It decodes as if the process had used syscall
instead of int 0x80
. This can be very confusing. e.g. strace
prints write(0, NULL, 12 <unfinished ... exit status 1>
for eax=1
/ int $0x80
, which is actually _exit(ebx)
, not write(rdi, rsi, rdx)
.
I don't know the exact version where the PTRACE_GET_SYSCALL_INFO
feature was added, but Linux kernel 5.5 / strace 5.5 handle it. It misleadingly says the process "runs in 32-bit mode" but does decode correctly. (Example).
int 0x80
works as long as all arguments (including pointers) fit in the low 32 of a register. This is the case for static code and data in the default code model ("small") in the x86-64 SysV ABI. (Section 3.5.1
: all symbols are known to be located in the virtual addresses in the range 0x00000000
to 0x7effffff
, so you can do stuff like mov edi, hello
(AT&T mov $hello, %edi
) to get a pointer into a register with a 5 byte instruction).
But this is not the case for position-independent executables, which many Linux distros now configure gcc
to make by default (and they enable ASLR for executables). For example, I compiled a hello.c
on Arch Linux, and set a breakpoint at the start of main. The string constant passed to puts
was at 0x555555554724
, so a 32-bit ABI write
system call would not work. (GDB disables ASLR by default, so you always see the same address from run to run, if you run from within GDB.)
Linux puts the stack near the "gap" between the upper and lower ranges of canonical addresses, i.e. with the top of the stack at 2^48-1. (Or somewhere random, with ASLR enabled). So rsp
on entry to _start
in a typical statically-linked executable is something like 0x7fffffffe550
, depending on size of env vars and args. Truncating this pointer to esp
does not point to any valid memory, so system calls with pointer inputs will typically return -EFAULT
if you try to pass a truncated stack pointer. (And your program will crash if you truncate rsp
to esp
and then do anything with the stack, e.g. if you built 32-bit asm source as a 64-bit executable.)
How it works in the kernel:
In the Linux source code, arch/x86/entry/entry_64_compat.S
definesENTRY(entry_INT80_compat)
. Both 32 and 64-bit processes use the same entry point when they execute int 0x80
.
entry_64.S
is defines native entry points for a 64-bit kernel, which includes interrupt / fault handlers and syscall
native system calls from long mode (aka 64-bit mode) processes.
entry_64_compat.S
defines system-call entry-points from compat mode into a 64-bit kernel, plus the special case of int 0x80
in a 64-bit process. (sysenter
in a 64-bit process may go to that entry point as well, but it pushes $__USER32_CS
, so it will always return in 32-bit mode.) There's a 32-bit version of the syscall
instruction, supported on AMD CPUs, and Linux supports it too for fast 32-bit system calls from 32-bit processes.
I guess a possible use-case for int 0x80
in 64-bit mode is if you wanted to use a custom code-segment descriptor that you installed with modify_ldt
. int 0x80
pushes segment registers itself for use with iret
, and Linux always returns from int 0x80
system calls via iret
. The 64-bit syscall
entry point sets pt_regs->cs
and ->ss
to constants, __USER_CS
and __USER_DS
. (It's normal that SS and DS use the same segment descriptors. Permission differences are done with paging, not segmentation.)
entry_32.S
defines entry points into a 32-bit kernel, and is not involved at all.
Related Topics
Redirecting Output to a File in C
Component Based Web Project Directory Layout with Git and Symlinks
Using Objdump for Arm Architecture: Disassembling to Arm
Using Output of Awk to Run Command
Syntax Error Near Unexpected Token ' - Bash
How to Strip Path While Archiving with Tar
Comparing Variables with Strings Bash
Can't Get Private Key with Openssl (No Start Line:Pem_Lib.C:703:Expecting: Any Private Key)
Bash - Surround All Array Elements or Arguments with Quotes
How Bash Handles the Jobs When Logout
Docker Change Cgroup Driver to Systemd
Sftp on Linux Server Gives Error "Received Message Too Long"
Xampp: Another Web Server Daemon Is Already Running
Tcp: Server Sends [Rst, Ack] Immediately After Receiving [Syn] from Client