What Register State Is Saved on a Context Switch in Linux

What register state is saved on a context switch in Linux?

Since no one seems to have answered this, let me venture.

Take a look at the _math_restore_cpu and __unlazy_fpu methods.

You can find them here:

http://www.cs.fsu.edu/~baker/devices/lxr/http/ident?i=math_state_restore
http://www.cs.fsu.edu/~baker/devices/lxr/http/ident?i=__unlazy_fpu

The x86 like processors have separate instructions for saving (fnsave) and restore (frstor) FPU state and so it looks like the OS is burdened with saving/restoring them.

I presume unless the FPU unit has been used by the usermode process, linux context switch will not save it for you.

So you need to do it yourself (in your driver) to be sure. You can use kernel_fpu_begin/end to do it in your driver, but is generally not a good idea.

http://www.cs.fsu.edu/~baker/devices/lxr/http/ident?i=kernel_fpu_begin
http://www.cs.fsu.edu/~baker/devices/lxr/http/ident?i=kernel_fpu_end

Why it is not a good idea? From Linus himself: http://lkml.indiana.edu/hypermail/linux/kernel/0405.3/1620.html

Quoted:

You can do it "safely" on x86 using

kernel_fpu_begin(); ...
kernel_fpu_end();

and make sure that all the FP stuff
is in between those two things, and
that you don't do anything that
might fault or sleep.

The kernel_fpu_xxx() macros make sure
that preemption is turned off etc, so
the above should always be safe.

Even then, of course, using FP in the
kernel assumes that you actually
have an FPU, of course. The in-kernel FP emulation package is
not supposed to work with kernel FP instructions.

Oh, and since the kernel doesn't link
with libc, you can't use anything
even remotely fancy. It all has to be
stuff that gcc can do in-line,
without any function calls.

In other words: the rule is that you
really shouldn't use FP in the
kernel. There are ways to do it, but
they tend to be for some real
special cases, notably for doing
MMX/XMM work. Ie the only "proper" FPU
user is actually the RAID checksumming
MMX stuff.

Linus

In any case, do you really want to rely on Intel's floating point unit? http://en.wikipedia.org/wiki/Pentium_FDIV_bug (just kidding :-)).

How can a CPU save its register state in a context switch?

Here is an example of how it can be done:

1. Save one register to the stack.

2. Load that register with the address of the PCB.

3. Save all the state in the PCB, including retrieving the register value saved on the stack.

How is current context saved during context switching?

Hardware saves the user-space program-counter on the kernel stack, as part of how exceptions / interrupts work on x86. (Or for the syscall entry point, user-space RIP is in RCX and does have to get stored manually into the PCB).

The rest of user-space context is saved on the kernel stack for that task by software after entering the kernel. Context-switch swaps kernel context including kernel stack pointer to be pointing at the new task's stack, so returning, eventually to user-space, will restore the new task's user-space state.

What memory state does the kernel have to save between context switches?

On a processor with a memory management unit (MMU) the main thing that occurs during a task switch with regards to memory is to tell the MMU to use a different virtual address space. This is assuming the destination task has a different address space, i.e. the task is a different process instead of a different thread in the same process. For example on 32bit x86, control register 3 (CR3) contains the physical address of the page directory table. During a task switch to a different process the appropriate CR3 value for the destination process is loaded into CR3 so all further virtual memory accesses use the new address space.

Memory itself is typically not saved in any manner as the MMU is used to protect pages so that one process cannot in general access another process memory. I say in general because there are a few cases where this is allowed but it is not relevant for this question.

In a memory constrained system, pages of memory can be stored to secondary storage (e.g. a hard drive) when there is a need for more virtual memory and there is no more free physical memory but this is usually not done during the context switch (although it could) and instead done on a demand basis. In either way storing memory to secondary storage is not an essential part of the context switch.

So in conclusion with regards to the following statement

The kernel records the current state of the CPU and memory, which will be essential to resuming the process that was just interrupted.

If by "memory" the statement is referring to the MMU state that controls the virtual to physical address translation then yes during a context switch this could be saved. I say could be because usually the CR3 value for a process does not change so there is no reason to save it as it is already known. If the statement is referring to the actual memory being used by the task then it is wrong.

And finally with regards to the following statement:

The kernel prepares the memory for this new process, and then prepares the CPU.

Again if by "prepares the memory for the new process" it means sets up the MMU to use the new virtual to physical address translation then yes that does happen during a context switch. If it means its loading memory from somewhere for the new process this does not need to happen, thanks to the MMU.

What is saved in a context switch?

This is rather a complex question since the answer(s) are dependent on many things:

The CPU in question
- It can vary significantly even within the same family for example the additional registers added for SSE/MMX operations.
The operating system, since it controls the handlers which trigger on a context switch and decide whether the CPU's hardware (if any) to assist in a context switch is used or not.
- For example Windows does not use the Intel hardware that can do much of the context switch storage for you since it does not store floating point registers.
Any optimizations enabled by a program aware of it's own requirements and capable of informing the OS of this
- Perhaps to indicate that it isn't using FP registers so don't bother with them
- In architectures with sizeable register files like most RISC designs there is considerable benefit to knowing you need only a smaller subset of these registers

At a minimum the in use general purpose registers and program counter register will need to be saved (assuming the common design of most current CISC/RISC style general purpose CPUs).

Note that attempting to do only the minimal amount of effort in relation to a context switch is a topic of some academic interest

Linux obviously has more info available on this in the public domain though my references may be a little out of date.

There is a ‘task_struct’ which contains a large number of fields relating to the task state as well as the process that the task is for.

One of these is the ‘thread_struct’

/* CPU-specific state of this task */

- struct thread_struct thread;

holds information about cache TLS descriptors, debugging registers,

fault info, floating point, virtual 86 mode or IO permissions.

Each architecture defines it's own thread_struct which identifies the registers and other values saved on a switch.

This is further complicated by the presence of rename registers which allow multiple in flight instructions (either via superscalar or pipeline related architectural designs). The restore phase of a context swicth will likely rely on the CPU's pipeline being restored in a initially empty state such the the instructions which had not yet been retired in the pipeline have no effect and thus can be ignored. This makes the design of the CPU that much harder.

The difference between a process and a thread is that the process switch (which always means a thread switch in all main stream operating systems) will need to update memory translation information, IO related information and permission related structures.

These will mainly be pointers to the more rich data structures so will not be a significant cost in relation to the thread context switch.

what happens during context switch between two processes in linux?

The specifics of a context switch depend upon the underlying hardware. However, context switches are basically the same, even among different system.

The mistake you have is " i understand that all the current cpu registers are pushed into stack before loading process p2". The registers are stored in an area of memory that is usually called the PROCESS CONTEXT BLOCK (or PCB) whose structure is defined by the processor. Most processors have instructions for loading and saving the process context (i.e., its registers) into this structure. In the case of Intel, this can require multiple instructions saving to multiple blocks because of all the different register sets (e.g. FPU, MMX).

The outgoing process does not have to be written to disk. It may paged out if the system needs more memory but it is possible that it could stay entirely in memory and be ready to execute.

A context switch is simply the exchange of one processor's saved register values for another's.

Context switch internals

At a high level, there are two separate mechanisms to understand. The first is the kernel entry/exit mechanism: this switches a single running thread from running usermode code to running kernel code in the context of that thread, and back again. The second is the context switch mechanism itself, which switches in kernel mode from running in the context of one thread to another.

So, when Thread A calls sched_yield() and is replaced by Thread B, what happens is:

Thread A enters the kernel, changing from user mode to kernel mode;
Thread A in the kernel context-switches to Thread B in the kernel;
Thread B exits the kernel, changing from kernel mode back to user mode.

Each user thread has both a user-mode stack and a kernel-mode stack. When a thread enters the kernel, the current value of the user-mode stack (SS:ESP) and instruction pointer (CS:EIP) are saved to the thread's kernel-mode stack, and the CPU switches to the kernel-mode stack - with the int $80 syscall mechanism, this is done by the CPU itself. The remaining register values and flags are then also saved to the kernel stack.

When a thread returns from the kernel to user-mode, the register values and flags are popped from the kernel-mode stack, then the user-mode stack and instruction pointer values are restored from the saved values on the kernel-mode stack.

When a thread context-switches, it calls into the scheduler (the scheduler does not run as a separate thread - it always runs in the context of the current thread). The scheduler code selects a process to run next, and calls the switch_to() function. This function essentially just switches the kernel stacks - it saves the current value of the stack pointer into the TCB for the current thread (called struct task_struct in Linux), and loads a previously-saved stack pointer from the TCB for the next thread. At this point it also saves and restores some other thread state that isn't usually used by the kernel - things like floating point/SSE registers. If the threads being switched don't share the same virtual memory space (ie. they're in different processes), the page tables are also switched.

So you can see that the core user-mode state of a thread isn't saved and restored at context-switch time - it's saved and restored to the thread's kernel stack when you enter and leave the kernel. The context-switch code doesn't have to worry about clobbering the user-mode register values - those are already safely saved away in the kernel stack by that point.

What Register State Is Saved on a Context Switch in Linux