How Does The 64 Bit Linux Kernel Kick Off a 32 Bit Process from an Elf

How does the 64 bit linux kernel kick off a 32 bit process from an ELF

If the execveat system call is used to start a new process, we first enter fs/exec.c in the kernel source into the SYSCALL_DEFINEx(execveat..) function.
This one then calls these functions:

do_execveat(..)
- do_execveat_common(..)
  - exec_binprm(..)
    - search_binary_handler(..)

The search_binary_handler iterates over the various binary handlers. In a 64 bit Linux kernel, there will be one handler for 64 bit ELFs and one for 32 bit ELFs. Both handlers are ultimately built from the same source fs/binfmt_elf.c. However, the 32 bit handler is built via fs/compat_binfmt_elf.c which redefines a number of macros before including the source file binfmt_elf.c itself.

Inside binfmt_elf.c, elf_check_arch is called. This is a macro defined in arch/x86/include/asm/elf.h and defined differently in the 64 bit handler vs the 32 bit handler. For 64 bit, it compares with EM_X86_64 ( 62 - defined in include/uapi/ilnux/elf-em.h). For 32 bit, it compares with EM_386 (3) or EM_486 (6) (defined in the same file). If the comparison fails, the binary handler gives up, so we end up with only one of the handlers taking care of the ELF parsing and execution - depending on whether the ELF is 64 bit or 32 bit.

All differences on parsing 32 bit ELFs vs 64 bit ELFs in 64 bit Linux should therefore be found in the file fs/compat_binfmt_elf.c.

The main clue seems to be compat_start_thread. start_thread is redefined to compat_start_thread. This function definition is found in arch/x86/kernel/process_64.c. compat_start_thread then calls start_thread_common with these arguments:

start_thread_common(regs, new_ip, new_sp,
             test_thread_flag(TIF_X32)
             ? __USER_CS : __USER32_CS,
             __USER_DS, __USER_DS);

while the normal start_thread function calls start_thread_common with these arguments:

start_thread_common(regs, new_ip, new_sp,
             __USER_CS, __USER_DS, 0);

Here we already see the architecture dependent code doing something with CS differently for 64 bit ELFs vs 32 bit ELFs.

Then we have the definitions for __USER_CS and __USER32_CS in arch/x86/include/asm/segment.h:

#define __USER_CS           (GDT_ENTRY_DEFAULT_USER_CS*8 + 3)
#define __USER32_CS         (GDT_ENTRY_DEFAULT_USER32_CS*8 + 3)

and:

#define GDT_ENTRY_DEFAULT_USER_CS   6
#define GDT_ENTRY_DEFAULT_USER32_CS 4

So __USER_CS is 6*8 + 3 = 51 = 0x33

And __USER32_CS is 4*8 + 3 = 35 = 0x23

These numbers match what is used for CS in these examples:

For going from 64 bit mode to 32 bit in the middle of a process
For going from 32 bit mode to 64 bit in the middle of a process

Since the CPU is not running in real mode, the segment register is not filled with the segment itself, but a 16-bit selector:

From Wikipedia (Protected mode):

In protected mode, the segment_part is replaced by a 16-bit selector, in which the 13 upper bits (bit 3 to bit 15) contain the index of an entry inside a descriptor table. The next bit (bit 2) specifies whether the operation is used with the GDT or the LDT. The lowest two bits (bit 1 and bit 0) of the selector are combined to define the privilege of the request, where the values of 0 and 3 represent the highest and the lowest privilege, respectively.

With the CS value 0x23, bit 1 and 0 is 3, meaning "lowest privilege". Bit 2 is 0, meaning GDT, and bit 3 to bit 15 is 4, meaning we get index 4 from the global descriptor table (GDT).

This is how far I have been able to dig so far.

Is it possible to use both 64 bit and 32 bit instructions in the same executable in 64 bit Linux?

Switching between long mode and compatibility mode is done by changing CS. User mode code cannot modify the descriptor table, but it can perform a far jump or far call to a code segment that is already present in the descriptor table. I think that in Linux (for example) the required compatibility mode descriptor is present.

Here is sample code for Linux (Ubuntu). Build with

$ gcc -no-pie switch_mode.c switch_cs.s

switch_mode.c:

#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>

extern bool switch_cs(int cs, bool (*f)());
extern bool check_mode();

int main(int argc, char **argv)
{
    int cs = 0x23;
    if (argc > 1)
        cs = strtoull(argv[1], 0, 16);
    printf("switch to CS=%02x\n", cs);

    bool r = switch_cs(cs, check_mode);

    if (r)
        printf("cs=%02x: 64-bit mode\n", cs);
    else
        printf("cs=%02x: 32-bit mode\n", cs);

    return 0;
}

switch_cs.s:

        .intel_syntax noprefix
        .code64
        .text
        .globl switch_cs
switch_cs:
        push    rbx
        push    rbp
        mov     rbp, rsp
        sub     rsp, 0x18

        mov     rbx, rsp
        movq    [rbx], offset .L1
        mov     [rbx+4], edi

        // Before the lcall, switch to a stack below 4GB.
        // This assumes that the data segment is below 4GB.
        mov     rsp, offset stack+0xf0
        lcall   [rbx]

        // restore rsp to the original stack
        leave
        pop     rbx
        ret

        .code32
.L1:
        call    esi
        lret

        .code64
        .globl check_mode
// returns false for 32-bit mode; true for 64-bit mode
check_mode:
        xor     eax, eax
        // In 32-bit mode, this instruction is executed as
        // inc eax; test eax, eax
        test    rax, rax
        setz    al
        ret

        .data
        .align  16
stack:  .space 0x100

Detect if a 32bit process is running in a 64bit environment under Linux

You could test for the presence of /lib64/ld-linux-x86-64.so.2. Theoretically this doesn't always work because it's possible for a Linux system to put the dynamic linker somewhere else, but this particular path is by far the most common, plus the path to the dynamic linker is hardcoded into ELF binaries, so this works at least as well as actually bundling a 64-bit library with your software (provided there's a matching libc, anyway).

Switch from 32bit mode to 64 bit (long mode) on 64bit linux

Contrary to the other answers, I assert that in principle the short answer is YES. This is likely not supported officially in any way, but it appears to work. At the end of this answer I present a demo.

On Linux-x86_64, a 32 bit (and X32 too, according to GDB sources) process gets CS register equal to 0x23 — a selector of 32-bit ring 3 code segment defined in GDT (its base is 0). And 64 bit processes get another selector: 0x33 — a selector of long mode (i.e. 64 bit) ring 3 code segment (bases for ES, CS, SS, DS are treated unconditionally as zeros in 64 bit mode). Thus if we do far jump, far call or something similar with target segment selector of 0x33, we'll load the corresponding descriptor to the shadow part of CS and will end up in a 64 bit segment.

The demo at the bottom of this answer uses jmp far instruction to jump to 64 bit code. Note that I've chosen a special constant to load into rax, so that for 32 bit code that instruction looks like

dec eax
mov eax, 0xfafafafa
ud2
cli ; these two are unnecessary, but leaving them here for fun :)
hlt

This must fail if we execute it having 32 bit descriptor in CS shadow part (will raise SIGILL on ud2 instruction).

Now here's the demo (compile it with fasm).

format ELF executable
segment readable executable

SYS_EXIT_32BIT=1
SYS_EXIT_64BIT=60
SYS_WRITE=4
STDERR=2

entry $
    mov ax,cs
    cmp ax,0x23 ; 32 bit process on 64 bit kernel has this selector in CS
    jne kernelIs32Bit
    jmp 0x33:start64 ; switch to 64-bit segment
start64:
use64
    mov rax, qword 0xf4fa0b0ffafafafa ; would crash inside this if executed as 32 bit code
    xor rdi,rdi
    mov eax, SYS_EXIT_64BIT
    syscall
    ud2

use32
kernelIs32Bit:
    mov edx, msgLen
    mov ecx, msg
    mov ebx, STDERR
    mov eax, SYS_WRITE
    int 0x80
    dec ebx
    mov eax, SYS_EXIT_32BIT
    int 0x80
msg:
    db "Kernel appears to be 32 bit, can't jump to long mode segment",10
msgLen = $-msg

How come a 32 bit kernel can run a 64 bit binary?

The CPU can be switched from 64 bit execution mode to 32 bit when it traps into kernel context, and a 32 bit kernel can still be constructed to understand the structures passed in from 64 bit user-space apps.

The MacOS X kernel does not directly dereference pointers from the user app anyway, as it resides its own separate address space. A user-space pointer in an ioctl call, for example, must first be resolved to its physical address and then a new virtual address created in the kernel address space. It doesn't really matter whether that pointer in the ioctl was 64 bits or 32 bits, the kernel does not dereference it directly in either case.

So mixing a 32 bit kernel and 64 bit binaries can work, and vice-versa. The thing you cannot do is mix 32 bit libraries with a 64 bit application, as pointers passed between them would be truncated. MacOS X supplies more of its frameworks in both 32 and 64 bit versions in each release.

Inline 64bit Assembly in 32bit GCC C Program

No, this isn't possible. You can't run 64-bit assembly from a 32-bit binary, as the processor will not be in long mode while running your program.

Copying 64-bit code to an executable page will result in that code being interpreted incorrectly as 32-bit code, which will have unpredictable and undesirable results.

Why does ptrace show a 32-bit execve system call having EAX = 59, the 64-bit call number? How do 32-bit system calls work on x86-64?

execve is special; it's the only one that has special interaction with PTRACE_TRACEME. The way strace works, other system calls do show the 32-bit call number. (And modern strace needs special help to know whether that's a 32-bit call number for int 0x80 / sysenter, or a 64-bit call number, since 64-bit processes can still invoke int 0x80, although they normally shouldn't. This support was only added in 2019, with PTRACE_GET_SYSCALL_INFO)

You're right, when the kernel is actually invoked, EAX holds 11, __NR_execve from unistd_32.h. It's set by mov $0xb,%eax before glibc's execve wrapper jumps to the VDSO page to enter the kernel via whatever efficient method is supported on this hardware (normally sysenter.)

But execution doesn't actually stop until it reaches some code in the main execve implementation that checks for PTRACE_TRACEME and raises SIGTRAP.

Apparently sometime before that happens, it calls void set_personality_64bit(void) in arch/x86/kernel/process_64.c, which includes

    /* Pretend that this comes from a 64bit execve */
    task_pt_regs(current)->orig_ax = __NR_execve;

I found that by searching for __NR_execve in a kernel source browser, and looking at the most likely file in arch/x86. I didn't keep cross-referencing to find where that's called from; the fact that it exists (and the assumption of a sane non-obfuscated design) points very strongly to this being the answer to your mystery.

Running 32 bit assembly code on a 64 bit Linux & 64 bit Processor : Explain the anomaly

Remember that everything by default on a 64-bit OS tends to assume 64-bit. You need to make sure that you are (a) using the 32-bit versions of your #includes where appropriate (b) linking with 32-bit libraries and (c) building a 32-bit executable. It would probably help if you showed the contents of your makefile if you have one, or else the commands that you are using to build this example.

FWIW I changed your code slightly (_start -> main):

#include <asm/unistd.h>
#include <syscall.h>
#define STDOUT 1

    .data
hellostr:
    .ascii "hello wolrd\n" ;
helloend:

    .text
    .globl main

main:
    movl $(SYS_write) , %eax  //ssize_t write(int fd, const void *buf, size_t count);
    movl $(STDOUT) , %ebx
    movl $hellostr , %ecx
    movl $(helloend-hellostr) , %edx
    int $0x80

    movl $(SYS_exit), %eax //void _exit(int status);
    xorl %ebx, %ebx
    int $0x80

    ret

and built it like this:

$ gcc -Wall test.S -m32 -o test

verfied that we have a 32-bit executable:

$ file test
test: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.6.4, dynamically linked (uses shared libs), not stripped

and it appears to run OK:

$ ./test
hello wolrd

why my x64 process base address not start from 0x400000?

I learned from this link Why is address 0x400000 chosen as a start of text segment in x86_64

That address is used for executables (ELF type ET_EXEC).

I only found my bash process starts from a very high base address (0x55971cea6000). Any one knows why?

Because your bash is (newer) position-independent executable (ELF type ET_DYN). It behaves much like a shared library, and is relocated to random address at runtime.

The 0x55971cea6000 address you found will vary from one execution to another. In contrast, ET_EXEC executables can only run correctly when loaded at their "linked at" address (typically 0x400000).

how does dynamic linker choose the start address for a 64-bit process?

The dynamic linker doesn't choose the start address of the executable -- the kernel does (by the time the dynamic linker starts running, the executable has already been mmaped into memory).

The kernel looks at the .e_type in the ELF header and .p_vaddr field of the first program header and goes from there. IFF .e_type == ET_EXEC, then the kernel maps executable segments at their .p_vaddr addresses. For ET_DYN, if ASLR is in effect, the kernel performs mmaps at a random address.

How to limit the address space of 32bit application on 64bit Linux to 3GB?

OK, I found the answer of this question elsewhere.

The solution is to change the "personality" of your program to PER_LINUX32_3GB, using the Linux system call sys_personality.

But there is a problem. After switching to PER_LINUX32_3GB Linux kernel will not allocate space in the upper 1GB, but the already allocated space, for example the application stack, remains there.

The solution is to "restart" your program through sys_execve system call.

Here is the code where I packed everything in one:

proc ___SwitchLinuxTo3GB
begin
        cmp     esp, $c0000000
        jb      .finish                 ; the system is native 32bit

; check the current personality.

        mov     eax, sys_personality
        mov     ebx, -1
        int     $80

; and exit if it is what intended

        test    eax, ADDR_LIMIT_3GB
        jnz     .finish                         ; everything is OK.

; set the needed personality

        mov     eax, sys_personality
        mov     ebx, PER_LINUX32_3GB
        int     $80

; and restart the process

        mov     eax, [esp+4]          ; argument count
        mov     ebx, [esp+8]          ; the filename of the executable.
        lea     ecx, [esp+8]          ; the arguments list.
        lea     edx, [ecx+4*eax+4]    ; the environment list.

        mov     eax, sys_execve
        int     $80

        ; if something gone wrong, it comes here and stops!
        int3

.finish:
        return
endp

How Does The 64 Bit Linux Kernel Kick Off a 32 Bit Process from an Elf