What's the Purpose of the Ud2 Opcode in the Linux Kernel

What's the purpose of the UD2 opcode in the Linux kernel?

It's the BUG() macro from include/asm-i386/bug.h.

/*
 * Tell the user there is some problem.
 * The offending file and line are encoded after the "officially
 * undefined" opcode for parsing in the trap handler.
 */

#ifdef CONFIG_DEBUG_BUGVERBOSE
#define BUG()                           \
 __asm__ __volatile__(  "ud2\n"         \
                        "\t.word %c0\n" \
                        "\t.long %c1\n" \
                         : : "i" (__LINE__), "i" (__FILE__))

For example, the one at 0C05AF is for file with name at 0xC0274A86 and line number 117 (0x75).

Why does Clang generate ud2 opcode on OSX?

I'm not 100% sure about clang, but gcc sometimes inserts ud2 to mark code areas which exhibit undefined behavior and thus are not supposed to be executed. It does give a warning in such cases, however.

So I suspect there are some warning from the compiler which you are ignoring or suppressing. Try adding -Wall -Werror to the command line.

How is PTRACE_SINGLESTEP implemented?

Yes, there's an architectural single-step flag on x86. Returning from kernel to user-space gives the kernel a chance to set both RIP/RFLAGS at the same time, so it can set the single-step for user-space without having it trigger on a kernel instruction.

For some reason, the Trap Flag has its own wikipedia article! See also wikipedia's EFLAGS article.

See the x86 tag wiki for links to Intel's architecture manuals which document all of this.

Perhaps instead you could execute the instruction followed by the 'ud2' opcode to trigger a signal

Then you'd need code to determine x86 instruction lengths, to know where to set a software breakpoint. And you wouldn't use ud2, you'd use int3 which exists for this purpose.

x86 also has debug registers (dr0..7) which can set hardware breakpoints without modifying the code, or can monitor for access or write to a given data address. (GDB hbreak uses those, as do GDB watchpoints on constant addresses)

But for jump/call/ret and other instructions that might have a special effect on RIP, you'd need to decode and emulate to figure out the destination to put an int3 at the destination. A memory-indirect jump using an addressing mode like jmp qword [fs: rax] would require the debugger to know the FS segment base to even know what address it will load a pointer from. (I assume you can get this with ptrace as easily as actual register values, unlike inside the guest program itself rdfsbase is a new extension.) So it's possible as long as your debugger has stopped all other threads so you can't have a TOCTOU race condition with another thread modifying the jump target pointer between reading it and continuing execution.

Fun fact: not all ISAs have hardware support for PTRACE_SINGLESTEP.

Case in point, the Linux kernel used to emulate it for ARM, but that required an ARM disassembler in the kernel to place a breakpoint at the next instruction, even if a branch target. It was removed in ~2011; now ptrace(PTRACE_SINGLESTEP) returns -ENOSYS on ARM.

They just ripped out all that complexity instead of trying to make it SMP-safe and support every new instruction like Thumb-2 and so on. (http://lists.infradead.org/pipermail/linux-arm-kernel/2011-February/041324.html)

So debuggers have to manually use breakpoints on such ISAs instead of having the kernel do it for them. If that means other threads notice a debug-break opcode in memory temporarily, that's not the kernel's problem. (Normally debuggers like GDB do stop all threads while you're single-stepping.)

And it means debuggers will have to decode branch instructions to figure out where to put the breakpoint. Including register-indirect and/or predicated branches.

What is Code in Linux Kernel crash messages?

Code is a hexdump of x86 machine code (presumably 32-bit mode from a legacy 32-bit kernel since it only dumped 32-bit register contents).

The byte marked with <> is where EIP is pointing, so it's the faulting instruction inside ex_handler_fprestore

Feed it to a disassembler, e.g. https://defuse.ca/online-x86-assembler.htm#disassembly2, or Linux's crashdump decoding script https://elixir.bootlin.com/linux/latest/source/scripts/decodecode

Remember that x86 machine code uses a variable-length encoding that can't be unambiguously decoded backwards. But this is compiler-generated code, so at least we can assume there aren't supposed to be overlapping instructions or static data mixed with code (because x86 has no benefit for that). If we find the start of a function in compiler-generated code, the rest of the instructions will all be "sane".

The 00 byte looks like part of a previous instruction or padding between functions: Decoding from there would give us add BYTE PTR [ebp-0x77],dl which is plausible, in eax,0x57 after that isn't, for a non-driver function.

Much more likely is that the 0x89 byte is the opcode of a MOV instruction.

If we drop the 00 byte and start from 55 (which is push ebp), we get a normal function body including the stack-frame setup prologue you'd expect if compiled with -Os or -fno-omit-frame-pointer.

In general, you can drop bytes one at a time until you get a sane-looking decoding that at least has an instruction-boundary on the faulting instruction. (But some experience is required for "sane-looking"; disassembly may have gotten in sync by chance after starting wrong. That's not rare for x86 machine code.)

# skipped the 00 byte which would desync decoding
0:  55                      push   ebp
1:  89 e5                   mov    ebp,esp
3:  57                      push   edi
4:  8b 48 04                mov    ecx,DWORD PTR [eax+0x4]      # EAX = 1st function arg, ECX = tmp
7:  8d 44 08 04             lea    eax,[eax+ecx*1+0x4]
b:  89 42 30                mov    DWORD PTR [edx+0x30],eax     # EDX = 2rd function arg
e:  80 3d e7 fb a0 c1 00    cmp    BYTE PTR ds:0xc1a0fbe7,0x0
15: 75 16                   jne    0x2d
17: c6 05 e7 fb a0 c1 01    mov    BYTE PTR ds:0xc1a0fbe7,0x1
1e: 50                      push   eax
1f: 68 b4 38 87 c1          push   0xc18738b4
24: e8 98 ba 00 00          call   0xbac1
29: 0f 0b                   ud2                     ### <=== EIP points here

# stuff after this probably isn't real code; it's unreachable
2b: 58                      pop    eax
2c: 5a                      pop    edx
2d: 90                      nop
2e: 8d 74 26 00             lea    esi,[esi+eiz*1+0x0]
32: eb                      .byte 0xeb

So this function really ends with a call to a noreturn function with stack args. (32-bit x86 Linux kernels are built with -mregparm=3 so the first 3 args are in EAX, EDX, ECX in that order, so either this function is not regparm or it has more than 3 args. You can see this function uses EAX and EDX as incoming args: reading them before writing.)

But it's not a jmp tailcall for some reason; maybe for exception backtracing it wants this function's stack frame on the stack. (Which might explain the push ebp / mov ebp,esp even if this kernel was built with -fomit-frame-pointer as part of -O2.)

You'd have to look at the C source for ex_handler_fprestore to guess why that might be.

ud2 is an illegal instruction. The compiler (or inline asm?) put it there so it would fault if the function returned. It's a clear sign that this path of execution is supposed to be unreachable, or is marked to intentionally trap as an assert() type of mechanism. (In Linux, look for BUG_ON()).

What's the trace path of select() function in kernel source?

Select is implemented in /fs/select.c and a copy in fs/compat.c compat_core_sys_select.

Kernel uses poll for waiting on FDs and it is used to emulate select.

glibc call the select system call that has an
entry point defined in:
arch/x86/syscalls/syscall_32.tbl:142 i386 _newselect sys_select compat_sys_select
arch/x86/syscalls/syscall_64.tbl:23 common select sys_select

fs/compat.c:asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp, compat_ulong_t __user *outp, compat_ulong_t __user *exp,
struct compat_timeval __user *tvp)

This is the actual implementation.

There is also an old number of system call for select that is not used for ages. The difference is in the number of arguments select call takes.
It's source is in:
arch/x86/syscalls/syscall_32.tbl:82 i386 select sys_old_select compat_sys_old_select
fs/compat.c:asmlinkage long compat_sys_old_select(struct compat_sel_arg_struct __user *arg)

You may want to find more about how vfs works in /Documentation/filesystems/vfs.txt

Linux kernel BUG() call does not hang the kernel

BUG() itself is not supposed to hang the box, so the behaviour of your system is OK.

On x86, BUG() eventually tries to execute ud2 machine instruction which leads to "Invalid opcode" exception. It is up to the kernel how to handle that, whether to output a message and continue working or to stop. Different kernels may react in different ways here.

What's the Purpose of the Ud2 Opcode in the Linux Kernel