Why Can't I Sys_Write from a Pointer to Stack Memory, Using Int 0X80

Why can't I sys_write from a pointer to stack memory, using int 0x80?

amd64 uses a different method for system calls than int 0x80, although that might still work with 32-bit libraries installed, etc. Whereas on x86 one would do:

mov eax, SYSCALL_NUMBER
mov ebx, param1
mov ecx, param2
mov edx, param3
int 0x80

on amd64 one would instead do this:

mov rax, SYSCALL_NUMBER_64 ; different from the x86 equivalent, usually
mov rdi, param1
mov rsi, param2
mov rdx, param3
syscall

For what you want to do, consider the following example:

        bits 64
        global _start

section .text

_start:
        push            0x0a424242
        mov             rdx, 04h
        lea             rsi, [rsp]
        call            write
        call            exit
exit:
        mov             rax, 60     ; exit()
        xor             rdi, rdi    ; errno
        syscall

write:
        mov             rax, 1      ; write()
        mov             rdi, 1      ; stdout
        syscall
        ret

What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?

TL:DR: int 0x80 works when used correctly, as long as any pointers fit in 32 bits (stack pointers don't fit). But beware that strace decodes it wrong unless you have a very recent strace + kernel.

int 0x80 zeros r8-r11 for reasons, and preserves everything else. Use it exactly like you would in 32-bit code, with the 32-bit call numbers. (Or better, don't use it!)

Not all systems even support int 0x80: The Windows Subsystem for Linux version 1 (WSL1) is strictly 64-bit only: int 0x80 doesn't work at all. It's also possible to build Linux kernels without IA-32 emulation either. (No support for 32-bit executables, no support for 32-bit system calls). See this re: making sure your WSL is actually WSL2 (which uses an actual Linux kernel in a VM.)

The details: what's saved/restored, which parts of which regs the kernel uses

int 0x80 uses eax (not the full rax) as the system-call number, dispatching to the same table of function-pointers that 32-bit user-space int 0x80 uses. (These pointers are to sys_whatever implementations or wrappers for the native 64-bit implementation inside the kernel. System calls are really function calls across the user/kernel boundary.)

Only the low 32 bits of arg registers are passed. The upper halves of rbx-rbp are preserved, but ignored by int 0x80 system calls. Note that passing a bad pointer to a system call doesn't result in SIGSEGV; instead the system call returns -EFAULT. If you don't check error return values (with a debugger or tracing tool), it will appear to silently fail.

All registers (except eax of course) are saved/restored (including RFLAGS, and the upper 32 of integer regs), except that r8-r11 are zeroed. r12-r15 are call-preserved in the x86-64 SysV ABI's function calling convention, so the registers that get zeroed by int 0x80 in 64-bit are the call-clobbered subset of the "new" registers that AMD64 added.

This behaviour has been preserved over some internal changes to how register-saving was implemented inside the kernel, and comments in the kernel mention that it's usable from 64-bit, so this ABI is probably stable. (I.e. you can count on r8-r11 being zeroed, and everything else being preserved.)

The return value is sign-extended to fill 64-bit rax. (Linux declares 32-bit sys_ functions as returning signed long.) This means that pointer return values (like from void *mmap()) need to be zero-extended before use in 64-bit addressing modes

Unlike sysenter, it preserves the original value of cs, so it returns to user-space in the same mode that it was called in. (Using sysenter results in the kernel setting cs to $__USER32_CS, which selects a descriptor for a 32-bit code segment.)

Older strace decodes int 0x80 incorrectly for 64-bit processes. It decodes as if the process had used syscall instead of int 0x80. This can be very confusing. e.g. strace prints write(0, NULL, 12 <unfinished ... exit status 1> for eax=1 / int $0x80, which is actually _exit(ebx), not write(rdi, rsi, rdx).

I don't know the exact version where the PTRACE_GET_SYSCALL_INFO feature was added, but Linux kernel 5.5 / strace 5.5 handle it. It misleadingly says the process "runs in 32-bit mode" but does decode correctly. (Example).

int 0x80 works as long as all arguments (including pointers) fit in the low 32 of a register. This is the case for static code and data in the default code model ("small") in the x86-64 SysV ABI. (Section 3.5.1
: all symbols are known to be located in the virtual addresses in the range 0x00000000 to 0x7effffff, so you can do stuff like mov edi, hello (AT&T mov $hello, %edi) to get a pointer into a register with a 5 byte instruction).

But this is not the case for position-independent executables, which many Linux distros now configure gcc to make by default (and they enable ASLR for executables). For example, I compiled a hello.c on Arch Linux, and set a breakpoint at the start of main. The string constant passed to puts was at 0x555555554724, so a 32-bit ABI write system call would not work. (GDB disables ASLR by default, so you always see the same address from run to run, if you run from within GDB.)

Linux puts the stack near the "gap" between the upper and lower ranges of canonical addresses, i.e. with the top of the stack at 2^48-1. (Or somewhere random, with ASLR enabled). So rsp on entry to _start in a typical statically-linked executable is something like 0x7fffffffe550, depending on size of env vars and args. Truncating this pointer to esp does not point to any valid memory, so system calls with pointer inputs will typically return -EFAULT if you try to pass a truncated stack pointer. (And your program will crash if you truncate rsp to esp and then do anything with the stack, e.g. if you built 32-bit asm source as a 64-bit executable.)

How it works in the kernel:

In the Linux source code, arch/x86/entry/entry_64_compat.S defines
ENTRY(entry_INT80_compat). Both 32 and 64-bit processes use the same entry point when they execute int 0x80.

entry_64.S is defines native entry points for a 64-bit kernel, which includes interrupt / fault handlers and syscall native system calls from long mode (aka 64-bit mode) processes.

entry_64_compat.S defines system-call entry-points from compat mode into a 64-bit kernel, plus the special case of int 0x80 in a 64-bit process. (sysenter in a 64-bit process may go to that entry point as well, but it pushes $__USER32_CS, so it will always return in 32-bit mode.) There's a 32-bit version of the syscall instruction, supported on AMD CPUs, and Linux supports it too for fast 32-bit system calls from 32-bit processes.

I guess a possible use-case for int 0x80 in 64-bit mode is if you wanted to use a custom code-segment descriptor that you installed with modify_ldt. int 0x80 pushes segment registers itself for use with iret, and Linux always returns from int 0x80 system calls via iret. The 64-bit syscall entry point sets pt_regs->cs and ->ss to constants, __USER_CS and __USER_DS. (It's normal that SS and DS use the same segment descriptors. Permission differences are done with paging, not segmentation.)

entry_32.S defines entry points into a 32-bit kernel, and is not involved at all.

The int 0x80 entry point in Linux 4.12's entry_64_compat.S:

/*
 * 32-bit legacy system call entry.
 *
 * 32-bit x86 Linux system calls traditionally used the INT $0x80
 * instruction.  INT $0x80 lands here.
 *
 * This entry point can be used by 32-bit and 64-bit programs to perform
 * 32-bit system calls.  Instances of INT $0x80 can be found inline in
 * various programs and libraries.  It is also used by the vDSO's
 * __kernel_vsyscall fallback for hardware that doesn't support a faster
 * entry method.  Restarted 32-bit system calls also fall back to INT
 * $0x80 regardless of what instruction was originally used to do the
 * system call.
 *
 * This is considered a slow path.  It is not used by most libc
 * implementations on modern hardware except during process startup.
 ...
 */
 ENTRY(entry_INT80_compat)
 ...  (see the github URL for the full source)

The code zero-extends eax into rax, then pushes all the registers onto the kernel stack to form a struct pt_regs. This is where it will restore from when the system call returns. It's in a standard layout for saved user-space registers (for any entry point), so ptrace from other process (like gdb or strace) will read and/or write that memory if they use ptrace while this process is inside a system call. (ptrace modification of registers is one thing that makes return paths complicated for the other entry points. See comments.)

But it pushes $0 instead of r8/r9/r10/r11. (sysenter and AMD syscall32 entry points store zeros for r8-r15.)

I think this zeroing of r8-r11 is to match historical behaviour. Before the Set up full pt_regs for all compat syscalls commit, the entry point only saved the C call-clobbered registers. It dispatched directly from asm with call *ia32_sys_call_table(, %rax, 8), and those functions follow the calling convention, so they preserve rbx, rbp, rsp, and r12-r15. Zeroing r8-r11 instead of leaving them undefined was to avoid info leaks from a 64-bit kernel to 32-bit user-space (which could far jmp to a 64-bit code segment to read anything the kernel left there).

The current implementation (Linux 4.12) dispatches 32-bit-ABI system calls from C, reloading the saved ebx, ecx, etc. from pt_regs. (64-bit native system calls dispatch directly from asm, with only a mov %r10, %rcx needed to account for the small difference in calling convention between functions and syscall. Unfortunately it can't always use sysret, because CPU bugs make it unsafe with non-canonical addresses. It does try to, so the fast-path is pretty damn fast, although syscall itself still takes tens of cycles.)

Anyway, in current Linux, 32-bit syscalls (including int 0x80 from 64-bit) eventually end up indo_syscall_32_irqs_on(struct pt_regs *regs). It dispatches to a function pointer ia32_sys_call_table, with 6 zero-extended args. This maybe avoids needing a wrapper around the 64-bit native syscall function in more cases to preserve that behaviour, so more of the ia32 table entries can be the native system call implementation directly.

Linux 4.12 arch/x86/entry/common.c

if (likely(nr < IA32_NR_syscalls)) {
  /*
   * It's possible that a 32-bit syscall implementation
   * takes a 64-bit parameter but nonetheless assumes that
   * the high bits are zero.  Make sure we zero-extend all
   * of the args.
   */
  regs->ax = ia32_sys_call_table[nr](
      (unsigned int)regs->bx, (unsigned int)regs->cx,
      (unsigned int)regs->dx, (unsigned int)regs->si,
      (unsigned int)regs->di, (unsigned int)regs->bp);
}

syscall_return_slowpath(regs);

In older versions of Linux that dispatch 32-bit system calls from asm (like 64-bit still did until 4.15¹), the int80 entry point itself puts args in the right registers with mov and xchg instructions, using 32-bit registers. It even uses mov %edx,%edx to zero-extend EDX into RDX (because arg3 happen to use the same register in both conventions). code here. This code is duplicated in the sysenter and syscall32 entry points.

Footnote 1: Linux 4.15 (I think) introduced Spectre / Meltdown mitigations, and a major revamp of the entry points that made them them a trampoline for the meltdown case. It also sanitized the incoming registers to avoid user-space values other than actual args being in registers during the call (when some Spectre gadget might run), by storing them, zeroing everything, then calling to a C wrapper that reloads just the right widths of args from the struct saved on entry.

I'm planning to leave this answer describing the much simpler mechanism because the conceptually useful part here is that the kernel side of a syscall involves using EAX or RAX as an index into a table of function pointers, with other incoming register values copied going to the places where the calling convention wants args to go. i.e. syscall is just a way to make a call into the kernel, to its dispatch code.

Simple example / test program:

I wrote a simple Hello World (in NASM syntax) which sets all registers to have non-zero upper halves, then makes two write() system calls with int 0x80, one with a pointer to a string in .rodata (succeeds), the second with a pointer to the stack (fails with -EFAULT).

Then it uses the native 64-bit syscall ABI to write() the chars from the stack (64-bit pointer), and again to exit.

So all of these examples are using the ABIs correctly, except for the 2nd int 0x80 which tries to pass a 64-bit pointer and has it truncated.

If you built it as a position-independent executable, the first one would fail too. (You'd have to use a RIP-relative lea instead of mov to get the address of hello: into a register.)

I used gdb, but use whatever debugger you prefer. Use one that highlights changed registers since the last single-step. gdbgui works well for debugging asm source, but is not great for disassembly. Still, it does have a register pane that works well for integer regs at least, and it worked great on this example.

See the inline ;;; comments describing how register are changed by system calls

global _start
_start:
    mov  rax, 0x123456789abcdef
    mov  rbx, rax
    mov  rcx, rax
    mov  rdx, rax
    mov  rsi, rax
    mov  rdi, rax
    mov  rbp, rax
    mov  r8, rax
    mov  r9, rax
    mov  r10, rax
    mov  r11, rax
    mov  r12, rax
    mov  r13, rax
    mov  r14, rax
    mov  r15, rax

    ;; 32-bit ABI
    mov  rax, 0xffffffff00000004          ; high garbage + __NR_write (unistd_32.h)
    mov  rbx, 0xffffffff00000001          ; high garbage + fd=1
    mov  rcx, 0xffffffff00000000 + .hello
    mov  rdx, 0xffffffff00000000 + .hellolen
    ;std
after_setup:       ; set a breakpoint here
    int  0x80                   ; write(1, hello, hellolen);   32-bit ABI
    ;; succeeds, writing to stdout
;;; changes to registers:   r8-r11 = 0.  rax=14 = return value

    ; ebx still = 1 = STDOUT_FILENO
    push 'bye' + (0xa<<(3*8))
    mov  rcx, rsp               ; rcx = 64-bit pointer that won't work if truncated
    mov  edx, 4
    mov  eax, 4                 ; __NR_write (unistd_32.h)
    int  0x80                   ; write(ebx=1, ecx=truncated pointer,  edx=4);  32-bit
    ;; fails, nothing printed
;;; changes to registers: rax=-14 = -EFAULT  (from /usr/include/asm-generic/errno-base.h)

    mov  r10, rax               ; save return value as exit status
    mov  r8, r15
    mov  r9, r15
    mov  r11, r15               ; make these regs non-zero again

    ;; 64-bit ABI
    mov  eax, 1                 ; __NR_write (unistd_64.h)
    mov  edi, 1
    mov  rsi, rsp
    mov  edx, 4
    syscall                     ; write(edi=1, rsi='bye\n' on the stack,  rdx=4);  64-bit
    ;; succeeds: writes to stdout and returns 4 in rax
;;; changes to registers: rax=4 = length return value
;;; rcx = 0x400112 = RIP.   r11 = 0x302 = eflags with an extra bit set.
;;; (This is not a coincidence, it's how sysret works.  But don't depend on it, since iret could leave something else)

    mov  edi, r10d
    ;xor  edi,edi
    mov  eax, 60                ; __NR_exit (unistd_64.h)
    syscall                     ; _exit(edi = first int 0x80 result);  64-bit
    ;; succeeds, exit status = low byte of first int 0x80 result = 14

section .rodata
_start.hello:    db "Hello World!", 0xa, 0
_start.hellolen  equ   $ - _start.hello

Build it into a 64-bit static binary with

yasm -felf64 -Worphan-labels -gdwarf2 abi32-from-64.asm
ld -o abi32-from-64 abi32-from-64.o

Run gdb ./abi32-from-64. In gdb, run set disassembly-flavor intel and layout reg if you don't have that in your ~/.gdbinit already. (GAS .intel_syntax is like MASM, not NASM, but they're close enough that it's easy to read if you like NASM syntax.)

(gdb)  set disassembly-flavor intel
(gdb)  layout reg
(gdb)  b  after_setup
(gdb)  r
(gdb)  si                     # step instruction
    press return to repeat the last command, keep stepping

Press control-L when gdb's TUI mode gets messed up. This happens easily, even when programs don't print to stdout themselves.

Example of an x86_64 system call which reads parameters from the stack or from fixed memory locations

I don't know about Windows; maybe it does something different.

Linux only ever uses 6 registers for system-call args, not fixed or user-stack locations. If a system call needs more things, one of the args will be a pointer to a struct (like clone3). I think most other x86-64 OSes that use the x86-64 System V ABI are similar. (i.e. all non-Windows one.)

Linux with sysenter from 32-bit user-space may look at the user-space stack for something, but I think just what it needs to be able to return to user-space, not args per-se.

*BSD and MacOS with 32-bit int 0x80 read args from user stack memory, instead of registers, but for 64-bit code they use the x86-64 System V ABI the way Linux does.

The *BSD int 0x80 convention of reading from user ESP is optimized for libc system-call wrappers: it looks for the first arg at 4(%esp), leaving room for a return address at 0(%esp). So the libc wrapper for most system calls could just be int $0x80 / ret, because i386 System V uses a stack-args calling convention.

Obviously it's possible to make a system-calling convention that isn't exclusively register-based, like *BSD in 32-bit mode. It means extra checking, though, since the kernel can't trust any pointers from user-space, not even RSP. For example, mov rsp, 0xffffff...1230 / syscall could try to trick the kernel into reading args from somewhere in kernel space, with the error return value maybe telling you something about what they were. Or causing an invalid page fault if you pass a bad address (or GPF for a non-canonical address).

So it's less convenient. But of course a kernel needs to be able to sanity-check pointer args to syscalls because many like read do take pointers to user-space memory. Still, having to do that on every system call, even ones that should be simpler, is less good.

Register args also lets hand-written asm set up args for a C function safely without needing to do any address checking. Or in modern Linux, just pass a pointer to the register-save area, with C code deciding how many and what width to load. I guess this makes Spectre and ROP attacks harder by not letting user-space enter the kernel with so many user-controlled values in registers for system calls that don't take 6x 64-bit args.

OTOH, with args all on the user stack, an asm entry point just has to pass the user stack pointer to some C function that does the checking and loading.

Linux Kernel systemcall call with an int 0x80

For 64-bit systems the Linux system call ABI is completely different from i*86 one unless there's a layer of compatibility.
This may help:
http://callumscode.com/blog/3

I also found the syscall source in the eglibc, it looks different indeed:
http://www.eglibc.org/cgi-bin/viewvc.cgi/trunk/libc/sysdeps/unix/sysv/linux/x86_64/syscall.S?view=markup

So it looks like int $0x80 does not work for x86_64 Linux kernels, you need to use syscall instead.

Using interrupt 0x80 on 64-bit Linux

Obviously you write a 64-bit program and you use the "int 0x80" instruction. "int 0x80" however only works correctly in 32-bit programs.

The address of the stack is in a range that cannot be accessed by 32-bit programs. Therefore it is quite probable that "int 0x80"-style system calls do not allow accessing this memory area.

To solve this problem there are two possibilities:

Compile as 32-bit application (use 32-bit registers like EAX instead of 64-bit registers like RAX). When you link without using any shared libraries 32-bit programs will work perfectly on 64-bit Linux.
Use "syscall"-style system calls instead of "int 0x80"-style system calls. The use of these differs a lot from "int 0x80"-style ones!

32-bit code:

mov eax,4    ; In "int 0x80" style 4 means: write
mov ebx,1    ; ... and the first arg. is stored in ebx
mov ecx,esp  ; ... and the second arg. is stored in ecx
mov edx,1    ; ... and the third arg. is stored in edx
int 0x80

64-bit code:

mov rax,1    ; In "syscall" style 1 means: write
mov rdi,1    ; ... and the first arg. is stored in rdi (not rbx)
mov rsi,rsp  ; ... and the second arg. is stored in rsi (not rcx)
mov rdx,1    ; ... and the third arg. is stored in rdx
syscall

--- Edit ---

Background information:

"int 0x80" is intended for 32-bit programs. When called from a 64-bit program it behaves the same way it would behave like if it has been called from a 32-bit program (using the 32-bit calling convention).

This also means that the parameters for "int 0x80" will be passed in 32-bit registers and the upper 32 bits of the 64-bit registers are ignored.

(I just tested that on Ubuntu 16.10, 64 bit.)

This however means that you can only access memory below 2^32 (or even below 2^31) when using "int 0x80" because you cannot pass an address above 2^32 in a 32-bit register.

If the data to be written is located at an address below 2^31 you may use "int 0x80" to write the data. If it is located above 2^32 you can't. The stack (RSP) is very likely located above 2^32 so you cannot write data on the stack using "int 0x80".

Because it is very likely that your program will use memory above 2^32 I have written: "int 0x80 does not work with 64-bit programs."

What is better int 0x80 or syscall in 32-bit code on Linux?

syscall is the default way of entering kernel mode on x86-64. This instruction is not available in 32 bit modes of operation on Intel processors.
sysenter is an instruction most frequently used to invoke system calls in 32 bit modes of operation. It is similar to syscall, a bit more difficult to use though, but that is the kernel's concern.
int 0x80 is a legacy way to invoke a system call and should be avoided.

The preferred way to invoke a system call is to use vDSO, a part of memory mapped in each process address space that allows to use system calls more efficiently (for example, by not entering kernel mode in some cases at all). vDSO also takes care of more difficult, in comparison to the legacy int 0x80 way, handling of syscall or sysenter instructions.

Also, see this and this.

NASM programming - `int0x80` versus `int 0x80`

NASM is giving me this warning:

warning: label alone on a line without a colon might be in error

Apparently the typo gets treated as a label and you can reference the new int0x80 label in your program as usual:

segment .text
    global _start
    _start:
        mov eax, 1 ; 1 is the system identifier for sys_exit
        mov ebx, 0 ; exit code
        int0x80 ; interrupt to invoke the system call

        jmp int0x80 ; jump to typo indefinitely

NASM supports labels without colon, I often use that for data declarations:

error_msg   db "Ooops", 0
flag        db 0x80
nullpointer dd 0

Why Can't I Sys_Write from a Pointer to Stack Memory, Using Int 0X80