Using Interrupt 0X80 on 64-Bit Linux

Using interrupt 0x80 on 64-bit Linux

Obviously you write a 64-bit program and you use the "int 0x80" instruction. "int 0x80" however only works correctly in 32-bit programs.

The address of the stack is in a range that cannot be accessed by 32-bit programs. Therefore it is quite probable that "int 0x80"-style system calls do not allow accessing this memory area.

To solve this problem there are two possibilities:

  • Compile as 32-bit application (use 32-bit registers like EAX instead of 64-bit registers like RAX). When you link without using any shared libraries 32-bit programs will work perfectly on 64-bit Linux.
  • Use "syscall"-style system calls instead of "int 0x80"-style system calls. The use of these differs a lot from "int 0x80"-style ones!

32-bit code:

mov eax,4    ; In "int 0x80" style 4 means: write
mov ebx,1 ; ... and the first arg. is stored in ebx
mov ecx,esp ; ... and the second arg. is stored in ecx
mov edx,1 ; ... and the third arg. is stored in edx
int 0x80

64-bit code:

mov rax,1    ; In "syscall" style 1 means: write
mov rdi,1 ; ... and the first arg. is stored in rdi (not rbx)
mov rsi,rsp ; ... and the second arg. is stored in rsi (not rcx)
mov rdx,1 ; ... and the third arg. is stored in rdx
syscall

--- Edit ---

Background information:

"int 0x80" is intended for 32-bit programs. When called from a 64-bit program it behaves the same way it would behave like if it has been called from a 32-bit program (using the 32-bit calling convention).

This also means that the parameters for "int 0x80" will be passed in 32-bit registers and the upper 32 bits of the 64-bit registers are ignored.

(I just tested that on Ubuntu 16.10, 64 bit.)

This however means that you can only access memory below 2^32 (or even below 2^31) when using "int 0x80" because you cannot pass an address above 2^32 in a 32-bit register.

If the data to be written is located at an address below 2^31 you may use "int 0x80" to write the data. If it is located above 2^32 you can't. The stack (RSP) is very likely located above 2^32 so you cannot write data on the stack using "int 0x80".

Because it is very likely that your program will use memory above 2^32 I have written: "int 0x80 does not work with 64-bit programs."

What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?

TL:DR: int 0x80 works when used correctly, as long as any pointers fit in 32 bits (stack pointers don't fit). But beware that strace decodes it wrong unless you have a very recent strace + kernel.

int 0x80 zeros r8-r11 for reasons, and preserves everything else. Use it exactly like you would in 32-bit code, with the 32-bit call numbers. (Or better, don't use it!)

Not all systems even support int 0x80: The Windows Subsystem for Linux version 1 (WSL1) is strictly 64-bit only: int 0x80 doesn't work at all. It's also possible to build Linux kernels without IA-32 emulation either. (No support for 32-bit executables, no support for 32-bit system calls). See this re: making sure your WSL is actually WSL2 (which uses an actual Linux kernel in a VM.)



The details: what's saved/restored, which parts of which regs the kernel uses

int 0x80 uses eax (not the full rax) as the system-call number, dispatching to the same table of function-pointers that 32-bit user-space int 0x80 uses. (These pointers are to sys_whatever implementations or wrappers for the native 64-bit implementation inside the kernel. System calls are really function calls across the user/kernel boundary.)

Only the low 32 bits of arg registers are passed. The upper halves of rbx-rbp are preserved, but ignored by int 0x80 system calls. Note that passing a bad pointer to a system call doesn't result in SIGSEGV; instead the system call returns -EFAULT. If you don't check error return values (with a debugger or tracing tool), it will appear to silently fail.

All registers (except eax of course) are saved/restored (including RFLAGS, and the upper 32 of integer regs), except that r8-r11 are zeroed. r12-r15 are call-preserved in the x86-64 SysV ABI's function calling convention, so the registers that get zeroed by int 0x80 in 64-bit are the call-clobbered subset of the "new" registers that AMD64 added.

This behaviour has been preserved over some internal changes to how register-saving was implemented inside the kernel, and comments in the kernel mention that it's usable from 64-bit, so this ABI is probably stable. (I.e. you can count on r8-r11 being zeroed, and everything else being preserved.)

The return value is sign-extended to fill 64-bit rax. (Linux declares 32-bit sys_ functions as returning signed long.) This means that pointer return values (like from void *mmap()) need to be zero-extended before use in 64-bit addressing modes

Unlike sysenter, it preserves the original value of cs, so it returns to user-space in the same mode that it was called in. (Using sysenter results in the kernel setting cs to $__USER32_CS, which selects a descriptor for a 32-bit code segment.)


Older strace decodes int 0x80 incorrectly for 64-bit processes. It decodes as if the process had used syscall instead of int 0x80. This can be very confusing. e.g. strace prints write(0, NULL, 12 <unfinished ... exit status 1> for eax=1 / int $0x80, which is actually _exit(ebx), not write(rdi, rsi, rdx).

I don't know the exact version where the PTRACE_GET_SYSCALL_INFO feature was added, but Linux kernel 5.5 / strace 5.5 handle it. It misleadingly says the process "runs in 32-bit mode" but does decode correctly. (Example).


int 0x80 works as long as all arguments (including pointers) fit in the low 32 of a register. This is the case for static code and data in the default code model ("small") in the x86-64 SysV ABI. (Section 3.5.1
: all symbols are known to be located in the virtual addresses in the range 0x00000000 to 0x7effffff, so you can do stuff like mov edi, hello (AT&T mov $hello, %edi) to get a pointer into a register with a 5 byte instruction).

But this is not the case for position-independent executables, which many Linux distros now configure gcc to make by default (and they enable ASLR for executables). For example, I compiled a hello.c on Arch Linux, and set a breakpoint at the start of main. The string constant passed to puts was at 0x555555554724, so a 32-bit ABI write system call would not work. (GDB disables ASLR by default, so you always see the same address from run to run, if you run from within GDB.)

Linux puts the stack near the "gap" between the upper and lower ranges of canonical addresses, i.e. with the top of the stack at 2^48-1. (Or somewhere random, with ASLR enabled). So rsp on entry to _start in a typical statically-linked executable is something like 0x7fffffffe550, depending on size of env vars and args. Truncating this pointer to esp does not point to any valid memory, so system calls with pointer inputs will typically return -EFAULT if you try to pass a truncated stack pointer. (And your program will crash if you truncate rsp to esp and then do anything with the stack, e.g. if you built 32-bit asm source as a 64-bit executable.)



How it works in the kernel:

In the Linux source code, arch/x86/entry/entry_64_compat.S defines
ENTRY(entry_INT80_compat). Both 32 and 64-bit processes use the same entry point when they execute int 0x80.

entry_64.S is defines native entry points for a 64-bit kernel, which includes interrupt / fault handlers and syscall native system calls from long mode (aka 64-bit mode) processes.

entry_64_compat.S defines system-call entry-points from compat mode into a 64-bit kernel, plus the special case of int 0x80 in a 64-bit process. (sysenter in a 64-bit process may go to that entry point as well, but it pushes $__USER32_CS, so it will always return in 32-bit mode.) There's a 32-bit version of the syscall instruction, supported on AMD CPUs, and Linux supports it too for fast 32-bit system calls from 32-bit processes.

I guess a possible use-case for int 0x80 in 64-bit mode is if you wanted to use a custom code-segment descriptor that you installed with modify_ldt. int 0x80 pushes segment registers itself for use with iret, and Linux always returns from int 0x80 system calls via iret. The 64-bit syscall entry point sets pt_regs->cs and ->ss to constants, __USER_CS and __USER_DS. (It's normal that SS and DS use the same segment descriptors. Permission differences are done with paging, not segmentation.)

entry_32.S defines entry points into a 32-bit kernel, and is not involved at all.

The int 0x80 entry point in Linux 4.12's entry_64_compat.S:

/*
* 32-bit legacy system call entry.
*
* 32-bit x86 Linux system calls traditionally used the INT $0x80
* instruction. INT $0x80 lands here.
*
* This entry point can be used by 32-bit and 64-bit programs to perform
* 32-bit system calls. Instances of INT $0x80 can be found inline in
* various programs and libraries. It is also used by the vDSO's
* __kernel_vsyscall fallback for hardware that doesn't support a faster
* entry method. Restarted 32-bit system calls also fall back to INT
* $0x80 regardless of what instruction was originally used to do the
* system call.
*
* This is considered a slow path. It is not used by most libc
* implementations on modern hardware except during process startup.
...
*/
ENTRY(entry_INT80_compat)
... (see the github URL for the full source)

The code zero-extends eax into rax, then pushes all the registers onto the kernel stack to form a struct pt_regs. This is where it will restore from when the system call returns. It's in a standard layout for saved user-space registers (for any entry point), so ptrace from other process (like gdb or strace) will read and/or write that memory if they use ptrace while this process is inside a system call. (ptrace modification of registers is one thing that makes return paths complicated for the other entry points. See comments.)

But it pushes $0 instead of r8/r9/r10/r11. (sysenter and AMD syscall32 entry points store zeros for r8-r15.)

I think this zeroing of r8-r11 is to match historical behaviour. Before the Set up full pt_regs for all compat syscalls commit, the entry point only saved the C call-clobbered registers. It dispatched directly from asm with call *ia32_sys_call_table(, %rax, 8), and those functions follow the calling convention, so they preserve rbx, rbp, rsp, and r12-r15. Zeroing r8-r11 instead of leaving them undefined was to avoid info leaks from a 64-bit kernel to 32-bit user-space (which could far jmp to a 64-bit code segment to read anything the kernel left there).

The current implementation (Linux 4.12) dispatches 32-bit-ABI system calls from C, reloading the saved ebx, ecx, etc. from pt_regs. (64-bit native system calls dispatch directly from asm, with only a mov %r10, %rcx needed to account for the small difference in calling convention between functions and syscall. Unfortunately it can't always use sysret, because CPU bugs make it unsafe with non-canonical addresses. It does try to, so the fast-path is pretty damn fast, although syscall itself still takes tens of cycles.)

Anyway, in current Linux, 32-bit syscalls (including int 0x80 from 64-bit) eventually end up indo_syscall_32_irqs_on(struct pt_regs *regs). It dispatches to a function pointer ia32_sys_call_table, with 6 zero-extended args. This maybe avoids needing a wrapper around the 64-bit native syscall function in more cases to preserve that behaviour, so more of the ia32 table entries can be the native system call implementation directly.

Linux 4.12 arch/x86/entry/common.c

if (likely(nr < IA32_NR_syscalls)) {
/*
* It's possible that a 32-bit syscall implementation
* takes a 64-bit parameter but nonetheless assumes that
* the high bits are zero. Make sure we zero-extend all
* of the args.
*/
regs->ax = ia32_sys_call_table[nr](
(unsigned int)regs->bx, (unsigned int)regs->cx,
(unsigned int)regs->dx, (unsigned int)regs->si,
(unsigned int)regs->di, (unsigned int)regs->bp);
}

syscall_return_slowpath(regs);

In older versions of Linux that dispatch 32-bit system calls from asm (like 64-bit still did until 4.151), the int80 entry point itself puts args in the right registers with mov and xchg instructions, using 32-bit registers. It even uses mov %edx,%edx to zero-extend EDX into RDX (because arg3 happen to use the same register in both conventions). code here. This code is duplicated in the sysenter and syscall32 entry points.

Footnote 1: Linux 4.15 (I think) introduced Spectre / Meltdown mitigations, and a major revamp of the entry points that made them them a trampoline for the meltdown case. It also sanitized the incoming registers to avoid user-space values other than actual args being in registers during the call (when some Spectre gadget might run), by storing them, zeroing everything, then calling to a C wrapper that reloads just the right widths of args from the struct saved on entry.

I'm planning to leave this answer describing the much simpler mechanism because the conceptually useful part here is that the kernel side of a syscall involves using EAX or RAX as an index into a table of function pointers, with other incoming register values copied going to the places where the calling convention wants args to go. i.e. syscall is just a way to make a call into the kernel, to its dispatch code.



Simple example / test program:

I wrote a simple Hello World (in NASM syntax) which sets all registers to have non-zero upper halves, then makes two write() system calls with int 0x80, one with a pointer to a string in .rodata (succeeds), the second with a pointer to the stack (fails with -EFAULT).

Then it uses the native 64-bit syscall ABI to write() the chars from the stack (64-bit pointer), and again to exit.

So all of these examples are using the ABIs correctly, except for the 2nd int 0x80 which tries to pass a 64-bit pointer and has it truncated.

If you built it as a position-independent executable, the first one would fail too. (You'd have to use a RIP-relative lea instead of mov to get the address of hello: into a register.)

I used gdb, but use whatever debugger you prefer. Use one that highlights changed registers since the last single-step. gdbgui works well for debugging asm source, but is not great for disassembly. Still, it does have a register pane that works well for integer regs at least, and it worked great on this example.

See the inline ;;; comments describing how register are changed by system calls

global _start
_start:
mov rax, 0x123456789abcdef
mov rbx, rax
mov rcx, rax
mov rdx, rax
mov rsi, rax
mov rdi, rax
mov rbp, rax
mov r8, rax
mov r9, rax
mov r10, rax
mov r11, rax
mov r12, rax
mov r13, rax
mov r14, rax
mov r15, rax

;; 32-bit ABI
mov rax, 0xffffffff00000004 ; high garbage + __NR_write (unistd_32.h)
mov rbx, 0xffffffff00000001 ; high garbage + fd=1
mov rcx, 0xffffffff00000000 + .hello
mov rdx, 0xffffffff00000000 + .hellolen
;std
after_setup: ; set a breakpoint here
int 0x80 ; write(1, hello, hellolen); 32-bit ABI
;; succeeds, writing to stdout
;;; changes to registers: r8-r11 = 0. rax=14 = return value

; ebx still = 1 = STDOUT_FILENO
push 'bye' + (0xa<<(3*8))
mov rcx, rsp ; rcx = 64-bit pointer that won't work if truncated
mov edx, 4
mov eax, 4 ; __NR_write (unistd_32.h)
int 0x80 ; write(ebx=1, ecx=truncated pointer, edx=4); 32-bit
;; fails, nothing printed
;;; changes to registers: rax=-14 = -EFAULT (from /usr/include/asm-generic/errno-base.h)

mov r10, rax ; save return value as exit status
mov r8, r15
mov r9, r15
mov r11, r15 ; make these regs non-zero again

;; 64-bit ABI
mov eax, 1 ; __NR_write (unistd_64.h)
mov edi, 1
mov rsi, rsp
mov edx, 4
syscall ; write(edi=1, rsi='bye\n' on the stack, rdx=4); 64-bit
;; succeeds: writes to stdout and returns 4 in rax
;;; changes to registers: rax=4 = length return value
;;; rcx = 0x400112 = RIP. r11 = 0x302 = eflags with an extra bit set.
;;; (This is not a coincidence, it's how sysret works. But don't depend on it, since iret could leave something else)

mov edi, r10d
;xor edi,edi
mov eax, 60 ; __NR_exit (unistd_64.h)
syscall ; _exit(edi = first int 0x80 result); 64-bit
;; succeeds, exit status = low byte of first int 0x80 result = 14

section .rodata
_start.hello: db "Hello World!", 0xa, 0
_start.hellolen equ $ - _start.hello

Build it into a 64-bit static binary with

yasm -felf64 -Worphan-labels -gdwarf2 abi32-from-64.asm
ld -o abi32-from-64 abi32-from-64.o

Run gdb ./abi32-from-64. In gdb, run set disassembly-flavor intel and layout reg if you don't have that in your ~/.gdbinit already. (GAS .intel_syntax is like MASM, not NASM, but they're close enough that it's easy to read if you like NASM syntax.)

(gdb)  set disassembly-flavor intel
(gdb) layout reg
(gdb) b after_setup
(gdb) r
(gdb) si # step instruction
press return to repeat the last command, keep stepping

Press control-L when gdb's TUI mode gets messed up. This happens easily, even when programs don't print to stdout themselves.

On x64 Linux, what is the difference between syscall, int 0x80 and ret to exit a program?

If you use printf or other libc functions, it's best to ret from main or call exit. (Which are equivalent; main's caller will call the libc exit function.)

If not, if you were only making other raw system calls like write with syscall, it's also appropriate and consistent to exit that way, but either way, or call exit are 100% fine in main.

If you want to work without libc at all, e.g. put your code under _start: instead of main: and link with ld or gcc -static -nostdlib, then you can't use ret. Use mov eax, 231 (__NR_exit_group) / syscall.

main is a real & normal function like any other (called with a valid return address), but _start (the process entry point) isn't. On entry to _start, the stack holds argc and argv, so trying to ret would set RIP=argc, and then code-fetch would segfault on that unmapped address. Nasm segmentation fault on RET in _start



System call vs. ret-from-main

Exiting via a system call is like calling _exit() in C - skip atexit() and libc cleanup, notably not flushing any buffered stdout output (line buffered on a terminal, full-buffered otherwise).
This leads to symptoms such as Using printf in assembly leads to empty output when piping, but works on the terminal (or if your output doesn't end with \n, even on a terminal.)

main is a function, called (indirectly) from CRT startup code. (Assuming you link your program normally, like you would a C program.) Your hand-written main works exactly like a compiler-generate C main function would. Its caller (__libc_start_main) really does do something like int result = main(argc, argv); exit(result);,

e.g. call rax (pointer passed by _start) / mov edi, eax / call exit.

So returning from main is exactly1 like calling exit.

  • Syscall implementation of exit() for a comparison of the relevant C functions, exit vs. _exit vs. exit_group and the underlying asm system calls.

  • C question: What is the difference between exit and return? is primarily about exit() vs. return, although there is mention of calling _exit() directly, i.e. just making a system call. It's applicable because C main compiles to an asm main just like you'd write by hand.

Footnote 1: You can invent a hypothetical intentionally weird case where it's different. e.g. you used stack space in main as your stdio buffer with sub rsp, 1024 / mov rsi, rsp / ... / call setvbuf. Then returning from main would involve putting RSP above that buffer, and __libc_start_main's call to exit could overwrite some of that buffer with return addresses and locals before execution reached the fflush cleanup. This mistake is more obvious in asm than C because you need leave or mov rsp, rbp or add rsp, 1024 or something to point RSP at your return address.

In C++, return from main runs destructors for its locals (before global/static exit stuff), exit doesn't. But that just means the compiler makes asm that does more stuff before actually running the ret, so it's all manual in asm, like in C.

The other difference is of course the asm / calling-convention details: exit status in EAX (return value) or EDI (first arg), and of course to ret you have to have RSP pointing at your return address, like it was on function entry. With call exit you don't, and you can even do a conditional tailcall of exit like jne exit. Since it's a noreturn function, you don't really need RSP pointing at a valid return address. (RSP should be aligned by 16 before a call, though, or RSP%16 = 8 before a tailcall, matching the alignment after call pushes a return address. It's unlikely that exit / fflush cleanup will do any alignment-required stores/loads to the stack, but it's a good habit to get this right.)

(This whole footnote is about ret vs. call exit, not syscall, so it's a bit of a tangent from the rest of the answer. You can also run syscall without caring where the stack-pointer points.)



SYS_exit vs. SYS_exit_group raw system calls

The raw SYS_exit system call is for exiting the current thread, like pthread_exit().

(eax=60 / syscall, or eax=1 / int 0x80).

SYS_exit_group is for exiting the whole program, like _exit.

(eax=231 / syscall, or eax=252 / int 0x80).

In a single-threaded program you can use either, but conceptually exit_group makes more sense to me if you're going to use raw system calls. glibc's _exit() wrapper function actually uses the exit_group system call (since glibc 2.3). See Syscall implementation of exit() for more details.

However, nearly all the hand-written asm you'll ever see uses SYS_exit1. It's not "wrong", and SYS_exit is perfectly acceptable for a program that didn't start more threads. Especially if you're trying to save code size with xor eax,eax / inc eax (3 bytes in 32-bit mode) or push 60 / pop rax (3 bytes in 64-bit mode), while push 231/pop rax would be even larger than mov eax,231 because it doesn't fit in a signed imm8.

Note 1: (Usually actually hard-coding the number, not using __NR_... constants from asm/unistd.h or their SYS_... names from sys/syscall.h)

And historically, it's all there was. Note that in unistd_32.h, __NR_exit has call number 1, but __NR_exit_group = 252 wasn't added until years later when the kernel gained support for tasks that share virtual address space with their parent, aka threads started by clone(2). This is when SYS_exit conceptually became "exit current thread". (But one could easily and convincingly argue that in a single-threaded program, SYS_exit does still mean exit the whole program, because it only differs from exit_group if there are multiple threads.)

To be honest, I've never used eax=252 / int 0x80 in anything, only ever eax=1. It's only in 64-bit code where I often use mov eax,231 instead of mov eax,60 because neither number is "simple" or memorable the way 1 is, so might as well be a cool guy and use the "modern" exit_group way in my single-threaded toy program / experiment / microbenchmark / SO answer. :P (If I didn't enjoy tilting at windmills, I wouldn't spend so much time on assembly, especially on SO.)

And BTW, I usually use NASM for one-off experiments so it's inconvenient to use pre-defined symbolic constants for call numbers; with GCC to preprocess a .S before running GAS you can make your code self-documenting with #include <sys/syscall.h> so you can use mov $SYS_exit_group, %eax (or $__NR_exit_group), or mov eax, __NR_exit_group with .intel_syntax noprefix.



Don't use the 32-bit int 0x80 ABI in 64-bit code:

What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? explains what happens if you use the COMPAT_IA32_EMULATION int 0x80 ABI in 64-bit code.

It's totally fine for just exiting, as long as your kernel has that support compiled in, otherwise it will segfault just like any other random int number like int 0x7f. (e.g. on WSL1, or people that built custom kernels and disabled that support.)

But the only reason you'd do it that way in asm would be so you could build the same source file with nasm -felf32 or nasm -felf64. (You can't use syscall in 32-bit code, except on some AMD CPUs which have a 32-bit version of syscall. And the 32-bit ABI uses different call numbers anyway so this wouldn't let the same source be useful for both modes.)


Related:

  • Why am I allowed to exit main using ret? (CRT startup code calls main, you're not returning directly to the kernel.)
  • Nasm segmentation fault on RET in _start - you can't ret from _start
  • Using printf in assembly leads to empty output when piping, but works on the terminal stdout buffer (not) flushing with raw system call exit
  • Syscall implementation of exit() call exit vs. mov eax,60/syscall (_exit) vs. mov eax,231/syscall (exit_group).
  • Can't call C standard library function on 64-bit Linux from assembly (yasm) code - modern Linux distros config GCC in a way that call exit or call puts won't link with nasm -felf64 foo.asm && gcc foo.o.
  • Is main() really start of a C++ program? - Ciro's answer is a deep dive into how glibc + its CRT startup code actually call main (including x86-64 asm disassembly in GDB), and shows the glibc source code for __libc_start_main.
  • Linux x86 Program Start Up
    or - How the heck do we get to main()? 32-bit asm, and more detail than you'll probably want until you're a lot more comfortable with asm, but if you've ever wondered why CRT runs so much code before getting to main, that covers what's happening at a level that's a couple steps up from using GDB with starti (stop at the process entry point, e.g. in the dynamic linker's _start) and stepi until you get to your own _start or main.
  • https://stackoverflow.com/tags/x86/info lots of good links about this and everything else.

What does int 0x80 mean in assembly code?

It passes control to interrupt vector 0x80

See http://en.wikipedia.org/wiki/Interrupt_vector

On Linux, have a look at this: it was used to handle system_call. Of course on another OS this could mean something totally different.

Does int 0x80 overwrite register values?

int 0x80 just causes a software interrupt. In your case it's being used to make a system call. Whether or not any registers are affected will depend on the particular system call you're invoking and the system call calling convention of your platform. Read your documentation for the details.

Specifically, from the System V Application Binary Interface x86-64™ Architecture Processor Supplement [PDF link], Appendix A, x86-64 Linux Kernel Conventions:

The interface between the C library and the Linux kernel is the same as for the user-level applications...

For user-level applications, r8 is a scratch register, which means it's caller-saved. If you want it to be preserved over the system call, you'll need to do it yourself.



Related Topics



Leave a reply



Submit