How to Use Both 64 Bit and 32 Bit Instructions in the Same Executable in 64 Bit Linux

Is it possible to use both 64 bit and 32 bit instructions in the same executable in 64 bit Linux?

Switching between long mode and compatibility mode is done by changing CS. User mode code cannot modify the descriptor table, but it can perform a far jump or far call to a code segment that is already present in the descriptor table. I think that in Linux (for example) the required compatibility mode descriptor is present.

Here is sample code for Linux (Ubuntu). Build with

$ gcc -no-pie switch_mode.c switch_cs.s

switch_mode.c:

#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>

extern bool switch_cs(int cs, bool (*f)());
extern bool check_mode();

int main(int argc, char **argv)
{
    int cs = 0x23;
    if (argc > 1)
        cs = strtoull(argv[1], 0, 16);
    printf("switch to CS=%02x\n", cs);

    bool r = switch_cs(cs, check_mode);

    if (r)
        printf("cs=%02x: 64-bit mode\n", cs);
    else
        printf("cs=%02x: 32-bit mode\n", cs);

    return 0;
}

switch_cs.s:

        .intel_syntax noprefix
        .code64
        .text
        .globl switch_cs
switch_cs:
        push    rbx
        push    rbp
        mov     rbp, rsp
        sub     rsp, 0x18

        mov     rbx, rsp
        movq    [rbx], offset .L1
        mov     [rbx+4], edi

        // Before the lcall, switch to a stack below 4GB.
        // This assumes that the data segment is below 4GB.
        mov     rsp, offset stack+0xf0
        lcall   [rbx]

        // restore rsp to the original stack
        leave
        pop     rbx
        ret

        .code32
.L1:
        call    esi
        lret


        .code64
        .globl check_mode
// returns false for 32-bit mode; true for 64-bit mode
check_mode:
        xor     eax, eax
        // In 32-bit mode, this instruction is executed as
        // inc eax; test eax, eax
        test    rax, rax
        setz    al
        ret

        .data
        .align  16
stack:  .space 0x100

How does the 64 bit linux kernel kick off a 32 bit process from an ELF

If the execveat system call is used to start a new process, we first enter fs/exec.c in the kernel source into the SYSCALL_DEFINEx(execveat..) function.
This one then calls these functions:

do_execveat(..)
- do_execveat_common(..)
  - exec_binprm(..)
    - search_binary_handler(..)

The search_binary_handler iterates over the various binary handlers. In a 64 bit Linux kernel, there will be one handler for 64 bit ELFs and one for 32 bit ELFs. Both handlers are ultimately built from the same source fs/binfmt_elf.c. However, the 32 bit handler is built via fs/compat_binfmt_elf.c which redefines a number of macros before including the source file binfmt_elf.c itself.

Inside binfmt_elf.c, elf_check_arch is called. This is a macro defined in arch/x86/include/asm/elf.h and defined differently in the 64 bit handler vs the 32 bit handler. For 64 bit, it compares with EM_X86_64 ( 62 - defined in include/uapi/ilnux/elf-em.h). For 32 bit, it compares with EM_386 (3) or EM_486 (6) (defined in the same file). If the comparison fails, the binary handler gives up, so we end up with only one of the handlers taking care of the ELF parsing and execution - depending on whether the ELF is 64 bit or 32 bit.

All differences on parsing 32 bit ELFs vs 64 bit ELFs in 64 bit Linux should therefore be found in the file fs/compat_binfmt_elf.c.

The main clue seems to be compat_start_thread. start_thread is redefined to compat_start_thread. This function definition is found in arch/x86/kernel/process_64.c. compat_start_thread then calls start_thread_common with these arguments:

start_thread_common(regs, new_ip, new_sp,
             test_thread_flag(TIF_X32)
             ? __USER_CS : __USER32_CS,
             __USER_DS, __USER_DS);

while the normal start_thread function calls start_thread_common with these arguments:

start_thread_common(regs, new_ip, new_sp,
             __USER_CS, __USER_DS, 0);

Here we already see the architecture dependent code doing something with CS differently for 64 bit ELFs vs 32 bit ELFs.

Then we have the definitions for __USER_CS and __USER32_CS in arch/x86/include/asm/segment.h:

#define __USER_CS           (GDT_ENTRY_DEFAULT_USER_CS*8 + 3)
#define __USER32_CS         (GDT_ENTRY_DEFAULT_USER32_CS*8 + 3)

and:

#define GDT_ENTRY_DEFAULT_USER_CS   6
#define GDT_ENTRY_DEFAULT_USER32_CS 4

So __USER_CS is 6*8 + 3 = 51 = 0x33

And __USER32_CS is 4*8 + 3 = 35 = 0x23

These numbers match what is used for CS in these examples:

For going from 64 bit mode to 32 bit in the middle of a process
For going from 32 bit mode to 64 bit in the middle of a process

Since the CPU is not running in real mode, the segment register is not filled with the segment itself, but a 16-bit selector:

From Wikipedia (Protected mode):

In protected mode, the segment_part is replaced by a 16-bit selector, in which the 13 upper bits (bit 3 to bit 15) contain the index of an entry inside a descriptor table. The next bit (bit 2) specifies whether the operation is used with the GDT or the LDT. The lowest two bits (bit 1 and bit 0) of the selector are combined to define the privilege of the request, where the values of 0 and 3 represent the highest and the lowest privilege, respectively.

With the CS value 0x23, bit 1 and 0 is 3, meaning "lowest privilege". Bit 2 is 0, meaning GDT, and bit 3 to bit 15 is 4, meaning we get index 4 from the global descriptor table (GDT).

This is how far I have been able to dig so far.

Am I guaranteed to not encounter non-64-bit instructions if there are no compatibility mode switches in x86-64?

Every sequence of bytes of machine code either decodes as instructions or raises a #UD illegal-instruction exception. With the CPU in 64-bit mode, that means they're decoded as 64-bit mode instructions if they don't fault. See also Is x86 32-bit assembly code valid x86 64-bit assembly code? (no, not in general).

If it's a normal program emitted by a compiler, it's unlikely there are any illegal instructions in its machine code, unless someone used inline asm, or used your program to disassemble a non-code section. Or an obfuscated program that puts partial instructions ahead of actual jump target, so simple disassemblers get confused and decode with instruction boundaries different from how it will actually run. x86 machine code is a byte stream that is not self-synchronizing.

TL:DR: in a normal program, yes, every sequence of bytes you encounter when disassembling is valid 64-bit-mode instructions.

66 and 67 do not switch modes, they merely switch the operand size for that one instruction. e.g. 66 40 90 is still a REX prefix in 64-bit mode (for the NOP instruction that follows). So it's just a nop (xchg ax,ax), not overriding it to decode as it would in 32-bit mode as inc ax / xchg eax,eax.

Try assembling and then disassembling db 0x66, 0x40, 0x90 with nasm -felf32 then with nasm -felf64 to see how that same sequence decodes in 64-bit mode, not like it would in 32-bit mode.

Many instruction encodings are the same in both 32 and 64-bit mode, since they share the same default operand-size (for non-stack instructions). e.g. b8 39 30 00 00 mov eax,0x3039 is the code for mov eax, 12345 in either 32 or 64-bit mode.

(When you say "64-bit instruction", I hope you don't mean 64-bit operand-size, because that's not the case. All operand-sizes from 8 to 64-bit are encodeable in 64-bit mode for most instructions.)

And yes, it's safe to assume that user-space programs don't switch modes by doing a far jmp. Unless you're on Windows, then the WOW64 DLLs do that for some reason instead of directly calling into the kernel. (Linux has 32-bit user-space use sysenter or other direct system call).

Inline 64bit Assembly in 32bit GCC C Program

No, this isn't possible. You can't run 64-bit assembly from a 32-bit binary, as the processor will not be in long mode while running your program.

Copying 64-bit code to an executable page will result in that code being interpreted incorrectly as 32-bit code, which will have unpredictable and undesirable results.

Run 64 bit assembly code on a 32 bit operating system

The thing you would most need to know on this is to make sure you make your processor mode transitions correctly. You need to do some basic work to transition from 32 bit mode into 64 bit mode (also called long mode). The biggest issue would be making sure you set up the descriptor table correctly. Some more info is here:
http://www.codeproject.com/Articles/45788/The-Real-Protected-Long-mode-assembly-tutorial-for

Hope this helps.

Can ptrace tell if an x86 system call used the 64-bit or 32-bit ABI?

Interesting, I hadn't realized that there wasn't an obvious smarter way that strace could use to correctly decode int 0x80 from 64-bit processes. (This is being worked on, see this answer for links to a proposed kernel patch to add PTRACE_GET_SYSCALL_INFO to the ptrace API. strace 4.26 already supports it on patched kernels.)

Update: now supports per-syscall detection IDK which mainline kernel version added the feature. I tested on Arch Linux with kernel version 5.5 and strace version 5.5.

e.g. this NASM source assembled into a static executable:

mov eax, 4
int 0x80
mov eax, 60
syscall

gives this trace: nasm -felf64 foo.asm && ld foo.o && strace ./a.out

execve("./foo", ["./foo"], 0x7ffcdc233180 /* 51 vars */) = 0
strace: [ Process PID=1262249 runs in 32 bit mode. ]
write(0, NULL, 0)                       = 0
strace: [ Process PID=1262249 runs in 64 bit mode. ]
exit(0)                                 = ?
+++ exited with 0 +++

strace prints a message every time a system call uses a different ABI bitness than previously. Note that the message about runs in 32 bit mode is completely wrong; it's merely using the 32-bit ABI from 64-bit mode. "Mode" has a specific technical meaning for x86-64, and this is not it.

With older kernels

As a workaround, I think you could disassemble the code at RIP and check whether it was the syscall instruction (0F 05) or not, because ptrace does let you read the target process's memory.

But for a security use-case like disallowing some system calls, this would be vulnerable to a race condition: another thread in the syscall process could rewrite the syscall bytes to int 0x80 after they execute, but before you can peek at them with ptrace.

You only need to do that if the process is running in 64-bit mode, otherwise only the 32-bit ABI is available. If it's not, you don't need to check. (The vdso page can potentially use 32-bit mode syscall on AMD CPUs that support it but not sysenter. Not checking in the first place for 32-bit processes avoids this corner case.) I think you're saying you have a reliable way to detect that at least.

(I haven't used the ptrace API directly, just the tools like strace that use it. So I hope this answer makes sense.)

Need Interpretation of 64bit Assembly Instruction As Opposed to 32bit

Probably easiest to just build 32-bit executables so you can follow the book more closely, with gcc -m32. Don't try to port a tutorial to another OS or ISA while you're learning from it, that rarely goes well. (e.g. different calling conventions, not just different sizes.)

And in GDB, use set disassembly-flavor intel to get GAS .intel_syntax noprefix disassembly like your book shows, instead of the default AT&T syntax. For objdump, use objdump -drwC -Mintel. See also How to remove "noise" from GCC/clang assembly output? for more about looking at GCC ouptut.

(See https://stackoverflow.com/tags/att/info vs. https://stackoverflow.com/tags/intel_syntax/info).

Both instructions are a dword store of an immediate 0, to an offset of -4 relative to where the frame pointer is pointing. (This is how it implements i=0 because you compiled with optimization disabled.)

How can I instruct the MSVC compiler to use a 64bit/32bit division instead of the slower 128bit/64bit division?

No current compilers (gcc/clang/ICC/MSVC) will do this optimization from portable ISO C source, even if you let them prove that b < a so the quotient will fit in 32 bits. (For example with GNU C if(b>=a) __builtin_unreachable(); on Godbolt). This is a missed optimization; until that's fixed, you have to work around it with intrinsics or inline asm.

(Or use a GPU or SIMD instead; if you have the same divisor for many elements see https://libdivide.com/ for SIMD to compute a multiplicative inverse once and apply it repeatedly.)

_udiv64 is available starting in Visual Studio 2019 RTM.

In C mode (-TC) it's apparently always defined. In C++ mode, you need to #include <immintrin.h>, as per the Microsoft docs. or intrin.h.

https://godbolt.org/z/vVZ25L (Or on Godbolt.ms because recent MSVC on the main Godbolt site is not working¹.)

#include <stdint.h>
#include <immintrin.h>       // defines the prototype

// pre-condition: a > b else 64/32-bit division overflows
uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
    uint32_t remainder;
    uint64_t d = ((uint64_t) b) << 32;
    return _udiv64(d, a, &remainder);
}

int main() {
    uint32_t c = ScaledDiv(5, 4);
    return c;
}

_udiv64 will produce 64/32 div. The two shifts left and right are a missed optimization.

;; MSVC 19.20 -O2 -TC
a$ = 8
b$ = 16
ScaledDiv PROC                                      ; COMDAT
        mov     edx, edx
        shl     rdx, 32                             ; 00000020H
        mov     rax, rdx
        shr     rdx, 32                             ; 00000020H
        div     ecx
        ret     0
ScaledDiv ENDP

main    PROC                                            ; COMDAT
        xor     eax, eax
        mov     edx, 4
        mov     ecx, 5
        div     ecx
        ret     0
main    ENDP

So we can see that MSVC doesn't do constant-propagation through _udiv64, even though in this case it doesn't overflow and it could have compiled main to just mov eax, 0ccccccccH / ret.

UPDATE #2 https://godbolt.org/z/n3Dyp-
Added a solution with Intel C++ Compiler, but this is less efficient and will defeat constant-propagation because it's inline asm.

#include <stdio.h>
#include <stdint.h>

__declspec(regcall, naked) uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
    __asm mov edx, eax
    __asm xor eax, eax
    __asm div ecx
    __asm ret
    // implicit return of EAX is supported by MSVC, and hopefully ICC
    // even when inlining + optimizing
}

int main()
{
    uint32_t a = 3 , b = 4, c = ScaledDiv(a, b);
    printf( "(%u << 32) / %u = %u\n", a, b, c);
    uint32_t d = ((uint64_t)a << 32) / b;
    printf( "(%u << 32) / %u = %u\n", a, b, d);
    return c != d;
}

Footnote 1: Matt Godbolt's main site's non-WINE MSVC compilers are temporarily(?) gone. Microsoft runs https://www.godbolt.ms/ to host the recent MSVC compilers on real Windows, and normally the main Godbolt.org site relayed to that for MSVC.)

It seems godbolt.ms will generate short links, but not expand them again! Full links are better anyway for their resistance to link-rot.