Why Doesn't Linux Use the Hardware Context Switch via the Tss

Why doesn't Linux use the hardware context switch via the TSS?

The x86 TSS is very slow for hardware multitasking and offers almost no benefits when compared to software task switching. (In fact, I think doing it manually beats the TSS a lot of times)

The TSS is known also for being annoying and tedious to work with and it is not portable, even to x86-64. Linux aims at working on multiple architectures so they probably opted to use software task switching because it can be written in a machine independent way. Also, Software task switching provides a lot more power over what can be done and is generally easier to setup than the TSS is.

I believe Windows 3.1 used the TSS, but at least the NT >5 kernel does not. I do not know of any Unix-like OS that uses the TSS.

Do note that the TSS is mandatory. The thing that OSs do though is create a single TSS entry(per processor) and everytime they need to switch tasks, they just change out this single TSS. And also the only fields used in the TSS by software task switching is ESP0 and SS0. This is used to get to ring 0 from ring 3 code for interrupts. Without a TSS, there would be no known Ring 0 stack which would of course lead to a GPF and eventually triple fault.

How does a software-based context-switch with TSS work?

First of all, the TSS is a historical wart. Once in a time (a.k.a: early 1980's), people at Intel tought that hardware context-switching, instead of software context-switching, was a great idea. They were greatly wrong. Hardware context-switching has several noticeable disadvantages, and, as it was never implemented appropiately, had miserable performance. No sane OS even implemented it due to all of that, plus the fact that it's even less portable than segmentation. See the obscure corner of OSDevers for details.

Now, with respect to the Task State Segment. If any ever OS implemented hardware context-switching, it's purpose is to represent a "task". It's possible to represent both threads and processes as "tasks", but more often than not, in the few code we have using hardware context-switching, it represents a simple process. The TSS would hold stuff such as the task's general purpose register contents, the control registers (CR0, CR2, CR3, and CR4; there's no CR1), CPU flags and instruction pointer, etc...

However, in the real world, where software performs all context switches, we are left with a 104-byte long structure which is (almost) useless. However, as we're talking about Intel, it was never deprecated/removed, and OSes have to deal with it.

The problem is actually pretty simple. Suppose you're running your typical foo() function in your typical user-mode process. Suddenly, you, the user, press the Windows/Meta/Super/however-you-call-it key in order to launch your mail client. As a result, an interrupt request (IRQ) is sent from the keyboard into the interrupt controller (either a 8259A PIC or a IOAPIC). Then, the interrupt controller arranges things in order to trigger a CPU interrupt. The CPU enters into privilege level 0, The registers are pushed, along with the interrupt number, and kernel-mode code is invoked to handle the situation. Wait! Pushing stuff? Where? On the stack, of course! But, where is the stack pointer taken from in order to define a "stack"?

If you happened to use the user-mode stack pointer, bad things will happen, and a giant security exploit would be available. What would happen if the stack pointer pointed into an invalid address? It could happen. After all, strictly speaking, the stack pointer is just another general purpose register, and assembly programmers are known to use it that way for hardcoreness' sake.

An attempt to push stuff there would generate a CPU exception, nice! And, as double faults (exceptions that occur while attempting to handle interrupts) would yet again attempt to push over the invalid pointer, the worst nightmare of an operating system becomes true: a triple fault. Have you ever seen your computer suddenly reboot without any prior advice? That is a triple fault (or a power failure). The OS has no change to handle a triple fault, it just reboots everything.

Great, the system has rebooted. But, something worse could have happened. Had an attacker purposefully written the address of a critical kernel variable (!), and put the values that him would like written there in the right order, let the greatest privilege elevation exploit reign as getting superuser privileges becomes easier than ever! GDB, the kernel's configuration (found in /proc/config.gz, and the GCC version the kernel was compiled with are more than enough to do this.

Now, back to the TSS, it happens that the aforementioned structure contains the values of the stack pointer and the stack segment register that are loaded upon a interrupt while in privilege level 3 (user-mode). The kernel sets this to point to a safe stack in kernel-land. As a result, there's a "kernel stack" per thread in the system, and a TSS per each logical CPU in the system. Upon thread switching, the kernel just changes these two variables in the right TSS. And no, there can't be a single kernel stack per Logical CPU, because the kernel itself may be preempted (most of the time).

I hope this has led some light on you!

When kernel stack's esp is stored to TSS for interrupt return iret?

Once the kernel returns to user-space for a given task, it's done with that task's kernel stack until the next interrupt / exception. There's no useful data on it, so the TSS can hold a fixed SS:[ER]SP value that points to the top of the virtual page[s] allocated as the kernel stack for the current task.

Kernel state doesn't live on the kernel stack between entries into the kernel; it's kept elsewhere in a process control block. (Context switches between asks actually happen in the kernel, switching kernel stacks to the formerly-sleeping task's kernel stack, so eventually returning to user-space means returning up the call-chain of whatever that task was doing in the kernel first).

BTW, unless the kernel pushes a new CS:EIP / EFLAGS / SS:ESP for iret to pop, the stuff it pops will be the stuff pushed by hardware at the address specified in the TSS. So even if there was some desire to re-enter the kernel with the stack as you left it, that would normally be at the TSS location anyway. But this is irrelevant because Linux doesn't keep stuff on a task's kernel stack while user-space is running, except for a pointer to per-task stuff at the bottom of the region where the kernel can find it with [ER]SP & -16384.

(I think this is right; I've looked at a few bits of Linux kernel code but haven't really gotten my hands dirty experimenting with things. I think this is how Linux works, and a consistent viable design.)

How does Linux handles the I/O Permission Bitmap in the TSS structure?

Yes. From the same section of the book, it says:

The tss_struct structure describes the format of the TSS. As already
mentioned in Chapter 2, the init_tss array stores one TSS for each CPU
on the system. At each process switch, the kernel updates some fields
of the TSS so that the corresponding CPU’s control unit may safely
retrieve the information it needs. Thus, the TSS reflects the
privilege of the current process on the CPU, but there is no need to
maintain TSSs for processes when they’re not running.

In later versions of the kernel, init_tss was renamed to cpu_tss. The TSS structure of each processor is initialized in cpu_init, which is executed once per processor when booting the system.

When switching from one task to another, __switch_to_xtra is called, which calls switch_to_bitmap, which simply copies the IO bitmap of the next task into the TSS structure of the processor on which it is scheduled to run next.

Related: How do Intel CPUs that use the ring bus topology decode and handle port I/O operations.

saving general purpose registers in switch_to() in linux 2.6

Linux versions before 2.2.0 use hardware task switching, where the TSS saves/restores registers for you. That's what the "ljmp %0\n\t" is doing. (ljmp is AT&T syntax for a far jmp, presumably to a task gate). I'm not really familiar with hardware TSS stuff because it's not very relevant; it's still used in modern kernels for getting RSP pointing to the kernel stack for interrupt handlers, but not for context switching between tasks.

Hardware task switching is slow, so later kernels avoid it. Linux 2.2 does save/restore the call-preserved registers manually, with push/pop before/after swapping stacks. EAX, EDX, and ECX are declared as dummy outputs ("=a" (eax), "=d" (edx), "=c" (ecx)) so the compiler knows that the old values of those registers are no longer available.

This is a sensible choice because switch_to is probably used inside a non-inline function. The caller will make a function call that eventually returns (after running another task for a while) with the call-preserved registers restored, and the call-clobbered registers clobbered, just like a regular function call. (So compiler code-gen for the function that uses the switch_to macro doesn't need to emit save/restore code outside of the inline asm). If you think about writing a whole context switch function in asm (not inline asm), you'd get this clobbering of volatile registers for free because callers expect that.

So how do later kernels avoid saving/restoring those registers in inline asm?

Linux 2.4 uses "=b" (last) as an output operand, so the compiler has to save/restore EBX in a function that uses this asm. The asm still saves/restores ESI, EDI, and EBP (as well as ESP). The text of the article notes this:

The 2.4 kernel context switch brings a few minor changes: EBX is no longer pushed/popped, but it is now included in the output of the inline assembly. We have a new input argument.

I don't see where they tell the compiler about EAX, ECX, and EDX not surviving, so that's odd. It might be a bug that they get away with by making the function noinline or something?

Linux 2.6 on i386 uses more output operands that get the compiler to handle the save/restore.

But Linux 2.6 for x86-64 introduces the trick that hands off the save/restore to the compiler easily: #define __EXTRA_CLOBBER ,"rcx","rbx","rdx","r8","r9","r10", "r11","r12","r13","r14","r15"

Notice the clobbers declaration: : "memory", "cc" __EXTRA_CLOBBER

This tells the compiler that the inline asm destroys all those registers, so the compiler will emit instructions to save/restore these registers at the start/end of whatever function switch_to ultimately inlines into.

Telling the compiler that all the registers are destroyed after a context switch solves the same problem as manually saving/restoring them with inline asm. The compiler will still make a function that obeys the calling convention.

The context-switch swaps to the new task's stack, so the compiler-generated save/restore code is always running with the appropriate stack pointer. Notice that the explicit push/pop instructions inside the inline asm int Linux 2.2 and 2.4 are before / after everything else.

Does a system call involve a context switch or not?

Your understanding is wrong because the kernel isn't a separate process. The kernel is sitting in RAM in shared memory areas. Typically, it sits in the top half of the virtual address space.

When the kernel is invoked with a system call, it is not necessarily using an interrupt. On x86-64, it is invoked directly using a specific processor instruction (syscall). This instruction makes the processor jump to the address stored in a special register.

Syscalls don't necessarily involve a full context switch. They must involve a user mode to kernel mode context switch. Most often, kernels have a kernel stack per process. This stack is mostly unused and empty when no system call is active as it then makes no sense to have anything stored in it.

The registers also need to be saved since the kernel can use them. I don't know for other processors but x86-64 does have the TSS allowing for automated user mode to kernel mode stack switch. The registers still need to be saved manually.

In the end, there is actually a necessary partial context switch when entering the kernel through a system call but it doesn't involve switching the whole process. Since the temporary storage for swapped registers and the kernel stack are already reserved, it involves much less overhead as the kernel doesn't need to touch the page tables. Swapping page tables often involves cache managing and some cache flushing to make it consistent.

Does page fault necessarily cause context switch in Linux?

Yes. If this happens because your program in user-mode leads to a "page_fault" the CPU in which it is running receives the interrupt of "page_fault" and the context of the current execution must be saved in the system's stack space (typically is the firmware to do this) so that the control is passed to the handler of "page_fault" ("ENTRY(page_fault)" defined in /kernel/entry.S).

Task management on x86

Edited to add your actual answer:


Protected Mode Software Architecture

Tom Shanley

Addison-Wesley Professional (March 16, 1996)

ISBN-10: 020155447X

ISBN-13: 978-0201554472

googlebook, amazon

My answer

Have you looked at "Understanding the Linux Kernel," 3rd Edition? It's available via Safari, and it's probably a good place to start for the OS side of things -- I don't think it gives you nitty-
gritty details, but it's an excellent guide that would probably put the linux kernel source and architecture-specific stuff into context. The following chapters give you the narrative you're asking for from the kernel side ("relationship between the hardware and the OS when an interrupt or context-switch occurs"):

  • Chapter 3: Processes
  • Chapter 4: Interrupts and Exceptions
  • Chapter 7: Process Scheduling



Understanding the Linux Kernel, 3rd Ed.

Daniel P. Bovet; Marco Cesati

Publisher: O'Reilly Media, Inc.

Pub. Date: November 17, 2005

Print ISBN-13: 978-0-596-00565-8

Print ISBN-10: 0-596-00565-2

Safari, Amazon

My recommendation is a book like this, with the linux source code and the intel manuals and a full fridge of beer, and you'll be off and running.

A brief snippet from Chapter 3: Processes, to whet your appetite:

3.3.2. Task State Segment


The 80×86 architecture includes a specific segment type called the Task State Segment (TSS), to store hardware contexts. Although Linux doesn't use hardware context switches, it is nonetheless forced to set up a TSS for each distinct CPU in the system. This is done for two main reasons:
  • When an 80×86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS (see the sections "Hardware Handling of Interrupts and Exceptions" in Chapter 4 and "Issuing a System Call via the sysenter Instruction" in Chapter 10).
  • When a User Mode process attempts to access an I/O port by means of an in or out instruction, the CPU may need to access an I/O Permission Bitmap stored in the TSS to verify whether the process is allowed to address the port.
More precisely, when a process executes an in or out I/O instruction in User Mode, the control unit performs the following operations:
  1. It checks the 2-bit IOPL field in the eflags register. If it is set to 3, the control unit executes the I/O instructions. Otherwise, it performs the next check.
  2. It accesses the tr register to determine the current TSS, and thus the proper I/O Permission Bitmap.
  3. It checks the bit of the I/O Permission Bitmap corresponding to the I/O port specified in the I/O instruction. If it is cleared, the instruction is executed; otherwise, the control unit raises a "General protection " exception.
The tss_struct structure describes the format of the TSS. As already mentioned in Chapter 2, the init_tss array stores one TSS for each CPU on the system. At each process switch, the kernel updates some fields of the TSS so that the corresponding CPU's control unit may safely retrieve the information it needs. Thus, the TSS reflects the privilege of the current process on the CPU, but there is no need to maintain TSSs for processes when they're not running.

Another potential reference in the same vein is this one, which does have a lot more x86-specific stuff, and you might benefit a bit from the contrast w/ PowerPC.
Linux® Kernel Primer, The: A Top-Down Approach for x86 and PowerPC Architectures

Claudia Salzberg Rodriguez; Gordon Fischer; Steven Smolski

Publisher: Prentice Hall

Pub. Date: September 19, 2005

Print ISBN-10: 0-13-118163-7

Print ISBN-13: 978-0-13-118163-2

Safari, Amazon

Finally, Robert Love's Linux Kernel Development, 3rd Edition, has a pretty thorough description of context switching, though it may be redundant with the above. It's a pretty fantastic resource.

Why are linux system calls different across architectures

You're looking at this from the wrong direction. mkdirat can do everything that mkdir can do and then some, so the question isn't why riscv64 doesn't have mkdir, but rather why x86 does have it. The answer to that is backwards userspace compatibility. Since Linux never breaks that, and mkdir existed first, it will exist there forever. But riscv64 never had it, so there are no userspace programs for it to break by not having it.

As for rmdir, the replacement for that isn't rmdirat, but rather unlinkat with AT_REMOVEDIR. The same argument then applies to it.



Related Topics



Leave a reply



Submit