How to Simulate a Iret on Linux X86_64

what's the difference between iret and iretd,iretq?

From this link:

IRET returns from an interrupt (hardware or software) by means of
popping IP (or EIP), CS, and the flags off the stack and then
continuing execution from the new CS:IP.

IRETW pops IP, CS and the flags as 2 bytes each, taking 6 bytes off
the stack in total. IRETD pops EIP as 4 bytes, pops a further 4 bytes
of which the top two are discarded and the bottom two go into CS, and
pops the flags as 4 bytes as well, taking 12 bytes off the stack.

IRET is a shorthand for either IRETW or IRETD, depending on the
default BITS setting at the time.

Very similar is also for IRETQ

iret error:general protection fault fffc

the bug is that rsp is changed after push.

asm volatile(
    "mov %%ss,%%ax \n\t"
    "push %%rax \n\t"/*ss*/
    "push %%rsp \n\t"/*rsp ##########error here!!!!!!! */ 
    "pushfq \n\t"/*rflags*/
    "mov %%cs,%%ax \n\t"
    "push %%rax \n\t"/*cs*/
    "mov $._restart_code,%%rax \n\t"
    "push %%rax \n\t"/*rip*/
    "iret \n\t"/*###iretq should be used under 64bit mode*/
    "._restart_code:"
    "nop" :);

so,save the rsp before all push instructions. and that correct code is:

asm volatile(
        "mov %%rsp,%%rbx \n\t"
        "mov %%ss,%%ax \n\t"
        "push %%rax \n\t"/*ss*/
        "push %%rbx \n\t"/*rsp*/ 
        "pushfq \n\t"/*rflags*/
        "mov %%cs,%%ax \n\t"
        "push %%rax \n\t"/*cs*/
        "mov $._restart_code,%%rax \n\t"
        "push %%rax \n\t"/*rip*/
        "iretq \n\t"
        "._restart_code:"
        "nop" :);

Thanks!!!

Why does iret from a page fault handler generate interrupt 13 (general protection fault) and error code 0x18?

If the exception is from IRET itself, then most likely IRET is failing to restore one of the saved segment registers, but the value (8 or 0x18, btw?) is somehow wrong. It can be wrong because you never (re)initialized the register in protected mode or your handler set it to a bad value before doing IRET or something happened to the GDT...

EDIT: From the picture it's apparent that the page fault handler didn't remove the exception code (value of 4 at address in ESP) before executing IRET. And so IRET interpreted 4 as the new value for EIP, 0x1000018 as the new value for CS and 0x23 as the new value for EFLAGS, whereas it should be using 0x1000018, 0x23 and 0x3206 for those three registers. Obviously, a data segment selector (which 0x1000018 is interpreted as after truncation to 0x0018) cannot be loaded into CS and this causes #GP(0x18).

How and when host CPU state is saved in the VMCS host-state area?

The CPU never saves the host state.

The VMM (aka: the hypervisor) controls when to execute vmlaunch/vmresume and can thus set the host state area accordingly before their execution.

When a VM-entry fails due to an invalid VMCS, the execution falls through to the next instruction after vmlaunch/vmresume.

When the VM-entry fails due to an invalid guest state, the execution resumes from the RIP set in the host state area (just like a VM-exit occurred).

If the CPU were to set the host state area, the two cases will be identical.

This is also why the CPU checks the host state area before entering VMX non-root mode (i.e. launching a VM).

x86/x64: modifying TSS fields

The processor uses the TSS to store the current context and load the next-to-be-scheduled context during task switching.

Changing the TSS structure won't affect any context until the CPU switches to such TSS.

The CPU performs a task switch when

Software or the processor can dispatch a task for execution in one of the following ways:

• A explicit call to a task with the CALL instruction.

• A explicit jump to a task with the JMP instruction.

• An implicit call (by the processor) to an interrupt-handler task.

• An implicit call to an exception-handler task.

• A return (initiated with an IRET instruction) when the NT flag in the EFLAGS register is set.

You can read about TSS on Chapter 7 of Intel manual 3.

The ltr doesn't perform a switch, from the Intel manual 2:

After the segment selector is loaded in the task register, the processor uses the segment selector to locate the
segment descriptor for the TSS in the global descriptor table (GDT).

It then loads the segment limit and base
address for the TSS from the segment descriptor into the task register.

The task pointed to by the task register is
marked busy, but a switch to the task does not occur.

EDIT: I've actually tested if the CPU cached the static values from the TSS.

The test consisted in a boot program (attached) that

Create a GDT with two Code segments with DPL 0 and 3, two Data segments with DPL 0 and 3, a TSS and a Call gate with DPL 3 to the code segment with DPL 0.
Switch to protected mode, set the value of ESP0 in the TSS to v1 and load the tr.
Return to the code segment with DPL 3, change the value of ESP0 to v2 and call the Call gate.
Check if ESP is v1-10h or v2-10h, print 1 or 2 respectively (or 0 if for some reason none match).

On my Haswell and on Bochs the result is 2, meaning that the CPU read the TSS from the memory (hierarchy) when needed.

Though a test on a model cannot be generalised to the ISA, it is unlikely that this is not the case.

BITS 16

xor ax, ax          ;Most EFI CPS need the first instruction to be this

;But I like to have my offset to be close to 0, not 7c00h

jmp 7c0h : WORD __START__

__START__:

  cli

  ;Set up the segments to 7c0h

  mov ax, cs
  mov ss, ax
  xor sp, sp
  mov ds, ax

  ;Switch to PM

  lgdt [GDT]

  mov eax, cr0
  or ax, 1
  mov cr0, eax

  ;Set CS

  jmp CS_DPL0 : WORD __PM__ + 7c00h

__PM__:

  BITS 32

  ;Set segments

  mov ax, DS_DPL0
  mov ss, ax
  mov ds, ax
  mov es, ax

  mov esp, ESP_VALUE0

  ;Make a minimal TSS BEFORE loading TR

  mov eax, DS_DPL0
  mov DWORD [TSS_BASE + TSS_SS0], eax
  mov DWORD [TSS_BASE + TSS_ESP0], ESP_VALUE1

  ;Load TSS in TR

  mov ax, TSS_SEL
  ltr ax

  ;Go to CPL = 3

  push DWORD DS_DPL3 | RPL_3
  push DWORD ESP_VALUE0
  push DWORD CS_DPL3 | RPL_3
  push DWORD __PMCPL3__ + 7c00h
  retf

__PMCPL3__:

  ;UPDATE ESP IN TSS

  mov ax, DS_DPL3 | RPL_3
  mov ds, ax

  mov DWORD [TSS_BASE + TSS_ESP0], ESP_VALUE2

  ;SWITCH STACK

  call CALL_GATE : 0

  jmp $

__PMCG__:

  mov eax, esp

  mov bx, 0900h | '1'
  cmp eax, ESP_VALUE1 - 10h
  je __write

  mov bl, '2'
  cmp eax, ESP_VALUE2 - 10h
  je __write

  mov bl, '0'

__write:

  mov WORD [0b8000h + 80*5*2], bx

  cli
  hlt

GDT dw 37h
    dd GDT + 7c00h      ;GDT symbol is relative to 0 for the assembler
                ;We translate it to linear

    dw 0

    ;Index 1 (Selector 08h)
    ;TSS starting at 8000h and with length = 64KiB

    dw 0ffffh
    dw TSS_BASE
    dd 0000e900h

    ;Index 2 (Selector 10h)
    ;Code segment with DPL=3

    dd 0000ffffh, 00cffa00h

    ;Index 3 (Selector 18h)
    ;Data segment with DPL=0

    dd 0000ffffh, 00cff200h

    ;Index 4 (Selector 20h)
    ;Code segment with DPL=0

    dd 0000ffffh, 00cf9a00h

    ;Index 5 (Selector 28h)
    ;Data segment with DPL=0

    dd 0000ffffh, 00cf9200h

    ;Index 6 (Selector 30h)
    ;Call gate with DPL = 3 for SEL=20

    dw __PMCG__ + 7c00h
    dw CS_DPL0
    dd 0000ec00h

  ;Fake partition table entry

  TIMES 446-($-$$) db 0

  db 80h, 0,0,0, 07h

  TIMES 510-($-$$) db 0
  dw 0aa55h

  TSS_BASE  EQU     8000h
  TSS_ESP0  EQU     4
  TSS_SS0   EQU     8

  ESP_VALUE0    EQU 7c00h
  ESP_VALUE1    EQU 6000h
  ESP_VALUE2    EQU 7000h

  CS_DPL0   EQU 20h
  CS_DPL3   EQU 10h
  DS_DPL0   EQU 28h
  DS_DPL3   EQU 18h
  TSS_SEL   EQU 08h
  CALL_GATE EQU 30h

  RPL_3     EQU 03h

Do x86 instructions require their own encoding as well as all of their arguments to be present in memory at the same time?

Yes, they do require the machine code and all memory operands.

Shouldn't the CPU access the memory pages sequentially, i.e. first read the instruction and then access the memory operand?

Yes that's logically what happens, but a page-fault exception interrupts that 2-step process and discards any progress. The CPU doesn't have any way to remember what instruction it was in the middle of when a page-fault occurred.

When a page-fault handler returns after handling a valid page fault, RIP= the address of the faulting instruction, so the CPU retries executing it from scratch.

It would be legal for the OS to modify the machine code of the faulting instruction and expect it to execute a different instruction after iret from the page-fault handler (or any other exception or interrupt handler). So AFAIK it's architecturally required that the CPU redoes code-fetch from CS:RIP in the case you're talking about. (Assuming it even does return to the faulting CS:RIP instead of scheduling another process while waiting for disk on hard page fault, or delivering a SIGSEGV to a signal handler on an invalid page fault.)

It's probably also architecturally required for hypervisor entry/exit. And even if it's not explicitly forbidden on paper, it's not how CPUs work.

@torek comments that Some (CISC) microprocessors partially decode instructions and dump microregister state on a page fault, but x86 is not like that.

A few instructions are interruptible and can make partial progress, like rep movs (memcpy in a can) and other string instructions, or gather loads/scatter stores. But the only mechanism is updating architectural registers like RCX / RSI / RDI for string ops, or the destination and mask registers for gathers (e.g. manual for AVX2 vpgatherdd). Not keeping the opcode / decode results in some hidden internal register and restarting it after iret from a page fault handler. These are instructions that do multiple separate data accesses.

Also keep in mind that x86 (like most ISAs) guarantees that instructions are atomic wrt. interrupts / exceptions: they either fully happen, or don't happen at all, before an interrupt. Interrupting an assembly instruction while it is operating. So for example add [mem], reg would be required to discard the load if the store part faulted, even without a lock prefix.

The worst case number of guest user-space pages present to make forward progress might be 6 (plus separate guest-kernel page-table subtrees for each one):

movsq or movsw 2-byte instruction spanning a page boundary, so both pages are needed for it to decode.
qword source operand [rsi] also a page-split
qword destination operand [rdi] also a page-split

If any of these 6 pages fault, we're back to square one.

rep movsd is also a 2-byte instruction, and making progress on one step of it would have the same requirement. Similar cases like push [mem] or pop [mem] could be constructed with a misaligned stack.

One of the reasons (or side benefits) for/of making gather loads / scatter stores "interruptible" (updating the mask vector with their progress) is to avoid increasing this minimum footprint to execute a single instruction. Also to improve efficiency of handling multiple faults during one gather or scatter.

@Brandon points out in comments that a guest will need its page tables in memory, and the user-space page splits can also be 1GiB splits so the two sides are in different sub-trees of the top level PML4. HW page walk will need to touch all of these guest page-table pages to make progress. A situation this pathological is unlikely to happen by chance.

The TLB (and page-walker internals) are allowed to cache some of the page-table data, and aren't required to restart page-walk from scratch unless the OS did invlpg or set a new CR3 top-level page directory. Neither of these are necessary when changing a page from not-present to present; x86 on paper guarantees that it's not needed (so "negative caching" of not-present PTEs isn't allowed, at least not visible to software). So the CPU might not VMexit even if some of the guest-physical page-table pages are not actually present.

PMU performance counters can be enabled and configured such that the instruction also requires a perf event to a write into a PEBS buffer for that instruction. With a counter's mask configured to count only user-space instructions, not kernel, it could well be that it keeps trying to overflow the counter and store a sample in the buffer every time you return to userspace, producing a page-fault.