How Does Linux Handles The I/O Permission Bitmap in The Tss Structure

How does Linux handles the I/O Permission Bitmap in the TSS structure?

Yes. From the same section of the book, it says:

The tss_struct structure describes the format of the TSS. As already
mentioned in Chapter 2, the init_tss array stores one TSS for each CPU
on the system. At each process switch, the kernel updates some fields
of the TSS so that the corresponding CPU’s control unit may safely
retrieve the information it needs. Thus, the TSS reflects the
privilege of the current process on the CPU, but there is no need to
maintain TSSs for processes when they’re not running.

In later versions of the kernel, init_tss was renamed to cpu_tss. The TSS structure of each processor is initialized in cpu_init, which is executed once per processor when booting the system.

When switching from one task to another, __switch_to_xtra is called, which calls switch_to_bitmap, which simply copies the IO bitmap of the next task into the TSS structure of the processor on which it is scheduled to run next.

Related: How do Intel CPUs that use the ring bus topology decode and handle port I/O operations.

Creating a proper Task State Segment (TSS) structure with and without an IO Bitmap?

This is a very fair question. Although at first glance the TSS with or without an IO Port Bitmap (IOPB) seems rather trivial in nature, it has been the focus of intense discussion; debate; incorrect documentation; ambiguous documentation; and information from the CPU designers that at times muddied the waters. A very good read about this subject can be found in the OS/2 Museum. Despite the name, the information isn't limited to OS/2. One take away from that article that sums it up:

It is obviously not trivial to use the IOPB correctly. In addition, an incorrectly set up IOPB is unlikely to cause obvious problems, but may disallow access to desired ports or (much worse, security-wise) allow access to undesired ports.

The sordid history of the TSS and IOPB as it pertained to security holes and bugs in 386BSD, NetBSD, OpenBSD makes for an interesting read and should be an indicator that the questions you pose are reasonable if you wish to avoid introducing bugs.

Answers to Questions

If you want no IOPB, you can simply fill the IOPB offset field with the length of your entire TSS structure (do not subtract 1). Your TSS Structure should have no trailing 0xff byte in it. The TSS limit in the TSS descriptor (as you already are aware) will be one less than that value. The Intel manuals say that there is no IOPB if the value in the IOPB offset value is greater than the TSS limit. If the value in the IOPB offset field is always 1 greater than the limit this condition is satisfied. This is how modern Microsoft Windows handles it.

If using an IOPB set an additional byte at the end to 0xff per the Intel documentation. By setting an extra byte to all 0xff would prevent any multi port access (INW/OUTW/INL/OUTL) starting in or ending in the last 8 ports. This would avoid the situation where a multi port read/write could straddle the end of the IOPB causing accesses to ports that fall outside the range of the IOPB. It would also deny multi port access that started on a port preceding the last 8 ports that crosses into the following 8 ports. If any port of a multi port access has a permission bit set to 1, the entire port access is denied (per the Intel documentation)

It is unclear what the x represents in the context of the diagram, but if those bits were set to 0 they would appear as permissible ports which isn't what you want. Again, stick with the Intel documentation and set an extra trailing byte to 0xff (all bits set to deny access).

From the Intel386 DX Microprocessor Data Sheet:

Each bit in the I/O Permission Bitmap corresponds to a single byte-wide I/O port, as illustrated in Figure 4-15a. If a bit is 0, I/O to the corresponding byte-wide port can occur without generating an exception. Otherwise the I/O instruction causes an exception 13 fault. Since every byte-wide I/O port must be
protectable, all bits corresponding to a word-wide or dword-wide port must be 0 for the word-wide or dword-wide I/O to be permitted. If all the referenced
bits are 0, the I/O will be allowed. If any referenced bits are 1, the attempted I/O will cause an exception 13 fault.

and

**IMPORTANT IMPLEMENTATION NOTE: Beyond the last byte of I/O mapping information in the I/O Permission Bitmap must be a byte containing all 1’s. The byte of all 1’s must be within the limit of the Intel386 DX TSS segment (see Figure 4-15a).

In NASM assembly you could create a structure that looks like:

tss_entry:
.back_link: dd 0
.esp0:      dd 0              ; Kernel stack pointer used on ring transitions
.ss0:       dd 0              ; Kernel stack segment used on ring transitions
.esp1:      dd 0
.ss1:       dd 0
.esp2:      dd 0
.ss2:       dd 0
.cr3:       dd 0
.eip:       dd 0
.eflags:    dd 0
.eax:       dd 0
.ecx:       dd 0
.edx:       dd 0
.ebx:       dd 0
.esp:       dd 0
.ebp:       dd 0
.esi:       dd 0
.edi:       dd 0
.es:        dd 0
.cs:        dd 0
.ss:        dd 0
.ds:        dd 0
.fs:        dd 0
.gs:        dd 0
.ldt:       dd 0
.trap:      dw 0
.iomap_base:dw TSS_SIZE         ; IOPB offset
;.cetssp:    dd 0              ; Need this if CET is enabled

; Insert any kernel defined task instance data here
; ...

; If using VME (Virtual Mode extensions) there need to bean additional 32 bytes
; available immediately preceding iomap. If using VME uncomment next 2 lines
;.vmeintmap:                     ; If VME enabled uncomment this line and the next
;TIMES 32    db 0                ;     32*8 bits = 256 bits (one bit for each interrupt)

.iomap:
TIMES TSS_IO_BITMAP_SIZE db 0x0
                                ; IO bitmap (IOPB) size 8192 (8*8192=65536) representing
                                ; all ports. An IO bitmap size of 0 would fault all IO
                                ; port access if IOPL < CPL (CPL=3 with v8086)
%if TSS_IO_BITMAP_SIZE > 0
.iomap_pad: db 0xff             ; Padding byte that has to be filled with 0xff
                                ; To deal with issues on some CPUs when using an IOPB
%endif
TSS_SIZE EQU $-tss_entry

Special Note:

If you are using a high level language and creating a TSS structure, ensure you use a packed structure (ie: using GCC's __attribute__((packed)) or MSVC's #pragma pack). Review your compiler documentation for more details. Failure to heed this advice could cause extra bytes to be added to the end of your TSS structure that could cause problems if you have an IOPB. If an IOPB is present in the TSS and extra padding bytes are added those bytes will become part of the IO bitmap and may grant/deny permissions you didn't intend. This was one of the failures that produced bugs in BSD kernels.
The rules for the 64-bit TSS are the same when it comes to creating a TSS with or without an IOPB. A 64-bit TSS is still used even in Long Modes (64-bit and compatibility mode) and is loaded into the Task Register the same way it is done in legacy protected mode via the LTR instruction.

Why doesn't Linux use the hardware context switch via the TSS?

The x86 TSS is very slow for hardware multitasking and offers almost no benefits when compared to software task switching. (In fact, I think doing it manually beats the TSS a lot of times)

The TSS is known also for being annoying and tedious to work with and it is not portable, even to x86-64. Linux aims at working on multiple architectures so they probably opted to use software task switching because it can be written in a machine independent way. Also, Software task switching provides a lot more power over what can be done and is generally easier to setup than the TSS is.

I believe Windows 3.1 used the TSS, but at least the NT >5 kernel does not. I do not know of any Unix-like OS that uses the TSS.

Do note that the TSS is mandatory. The thing that OSs do though is create a single TSS entry(per processor) and everytime they need to switch tasks, they just change out this single TSS. And also the only fields used in the TSS by software task switching is ESP0 and SS0. This is used to get to ring 0 from ring 3 code for interrupts. Without a TSS, there would be no known Ring 0 stack which would of course lead to a GPF and eventually triple fault.

using IN instruction crashes the x86 program

Computer can get data from I/O port but only if it runs in real mode or in ring 0, which is reserved for kernel and device drivers.
In native DOS you can read whichever I/O port you like, and some well-known ports can be read/written even when the realmode program runs in simulator (NTVDM, DosBox).

But as you have chosen Windows protected-mode executable, this won't work.

Invoke WinAPI function Sleep(dwMilliseconds) instead.

Task management on x86

Edited to add your actual answer:

Protected Mode Software Architecture

Tom Shanley

Addison-Wesley Professional (March 16, 1996)

ISBN-10: 020155447X

ISBN-13: 978-0201554472

googlebook, amazon

My answer

Have you looked at "Understanding the Linux Kernel," 3rd Edition? It's available via Safari, and it's probably a good place to start for the OS side of things -- I don't think it gives you nitty-
gritty details, but it's an excellent guide that would probably put the linux kernel source and architecture-specific stuff into context. The following chapters give you the narrative you're asking for from the kernel side ("relationship between the hardware and the OS when an interrupt or context-switch occurs"):

Chapter 3: Processes
Chapter 4: Interrupts and Exceptions
Chapter 7: Process Scheduling

Understanding the Linux Kernel, 3rd Ed.

Daniel P. Bovet; Marco Cesati

Publisher: O'Reilly Media, Inc.

Pub. Date: November 17, 2005

Print ISBN-13: 978-0-596-00565-8

Print ISBN-10: 0-596-00565-2

Safari, Amazon

My recommendation is a book like this, with the linux source code and the intel manuals and a full fridge of beer, and you'll be off and running.

A brief snippet from Chapter 3: Processes, to whet your appetite:

3.3.2. Task State Segment

The 80×86 architecture includes a specific segment type called the Task State Segment (TSS), to store hardware contexts. Although Linux doesn't use hardware context switches, it is nonetheless forced to set up a TSS for each distinct CPU in the system. This is done for two main reasons:
When an 80×86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS (see the sections "Hardware Handling of Interrupts and Exceptions" in Chapter 4 and "Issuing a System Call via the sysenter Instruction" in Chapter 10).

When a User Mode process attempts to access an I/O port by means of an in or out instruction, the CPU may need to access an I/O Permission Bitmap stored in the TSS to verify whether the process is allowed to address the port.
More precisely, when a process executes an in or out I/O instruction in User Mode, the control unit performs the following operations:
It checks the 2-bit IOPL field in the eflags register. If it is set to 3, the control unit executes the I/O instructions. Otherwise, it performs the next check.
It accesses the tr register to determine the current TSS, and thus the proper I/O Permission Bitmap.
It checks the bit of the I/O Permission Bitmap corresponding to the I/O port specified in the I/O instruction. If it is cleared, the instruction is executed; otherwise, the control unit raises a "General protection " exception.
The tss_struct structure describes the format of the TSS. As already mentioned in Chapter 2, the init_tss array stores one TSS for each CPU on the system. At each process switch, the kernel updates some fields of the TSS so that the corresponding CPU's control unit may safely retrieve the information it needs. Thus, the TSS reflects the privilege of the current process on the CPU, but there is no need to maintain TSSs for processes when they're not running.

Another potential reference in the same vein is this one, which does have a lot more x86-specific stuff, and you might benefit a bit from the contrast w/ PowerPC.
Linux® Kernel Primer, The: A Top-Down Approach for x86 and PowerPC Architectures

Claudia Salzberg Rodriguez; Gordon Fischer; Steven Smolski

Publisher: Prentice Hall

Pub. Date: September 19, 2005

Print ISBN-10: 0-13-118163-7

Print ISBN-13: 978-0-13-118163-2

Safari, Amazon

Finally, Robert Love's Linux Kernel Development, 3rd Edition, has a pretty thorough description of context switching, though it may be redundant with the above. It's a pretty fantastic resource.

How does a software-based context-switch with TSS work?

First of all, the TSS is a historical wart. Once in a time (a.k.a: early 1980's), people at Intel tought that hardware context-switching, instead of software context-switching, was a great idea. They were greatly wrong. Hardware context-switching has several noticeable disadvantages, and, as it was never implemented appropiately, had miserable performance. No sane OS even implemented it due to all of that, plus the fact that it's even less portable than segmentation. See the obscure corner of OSDevers for details.

Now, with respect to the Task State Segment. If any ever OS implemented hardware context-switching, it's purpose is to represent a "task". It's possible to represent both threads and processes as "tasks", but more often than not, in the few code we have using hardware context-switching, it represents a simple process. The TSS would hold stuff such as the task's general purpose register contents, the control registers (CR0, CR2, CR3, and CR4; there's no CR1), CPU flags and instruction pointer, etc...

However, in the real world, where software performs all context switches, we are left with a 104-byte long structure which is (almost) useless. However, as we're talking about Intel, it was never deprecated/removed, and OSes have to deal with it.

The problem is actually pretty simple. Suppose you're running your typical foo() function in your typical user-mode process. Suddenly, you, the user, press the Windows/Meta/Super/however-you-call-it key in order to launch your mail client. As a result, an interrupt request (IRQ) is sent from the keyboard into the interrupt controller (either a 8259A PIC or a IOAPIC). Then, the interrupt controller arranges things in order to trigger a CPU interrupt. The CPU enters into privilege level 0, The registers are pushed, along with the interrupt number, and kernel-mode code is invoked to handle the situation. Wait! Pushing stuff? Where? On the stack, of course! But, where is the stack pointer taken from in order to define a "stack"?

If you happened to use the user-mode stack pointer, bad things will happen, and a giant security exploit would be available. What would happen if the stack pointer pointed into an invalid address? It could happen. After all, strictly speaking, the stack pointer is just another general purpose register, and assembly programmers are known to use it that way for hardcoreness' sake.

An attempt to push stuff there would generate a CPU exception, nice! And, as double faults (exceptions that occur while attempting to handle interrupts) would yet again attempt to push over the invalid pointer, the worst nightmare of an operating system becomes true: a triple fault. Have you ever seen your computer suddenly reboot without any prior advice? That is a triple fault (or a power failure). The OS has no change to handle a triple fault, it just reboots everything.

Great, the system has rebooted. But, something worse could have happened. Had an attacker purposefully written the address of a critical kernel variable (!), and put the values that him would like written there in the right order, let the greatest privilege elevation exploit reign as getting superuser privileges becomes easier than ever! GDB, the kernel's configuration (found in /proc/config.gz, and the GCC version the kernel was compiled with are more than enough to do this.

Now, back to the TSS, it happens that the aforementioned structure contains the values of the stack pointer and the stack segment register that are loaded upon a interrupt while in privilege level 3 (user-mode). The kernel sets this to point to a safe stack in kernel-land. As a result, there's a "kernel stack" per thread in the system, and a TSS per each logical CPU in the system. Upon thread switching, the kernel just changes these two variables in the right TSS. And no, there can't be a single kernel stack per Logical CPU, because the kernel itself may be preempted (most of the time).

I hope this has led some light on you!

x86/x64: modifying TSS fields

The processor uses the TSS to store the current context and load the next-to-be-scheduled context during task switching.

Changing the TSS structure won't affect any context until the CPU switches to such TSS.

The CPU performs a task switch when

Software or the processor can dispatch a task for execution in one of the following ways:

• A explicit call to a task with the CALL instruction.

• A explicit jump to a task with the JMP instruction.

• An implicit call (by the processor) to an interrupt-handler task.

• An implicit call to an exception-handler task.

• A return (initiated with an IRET instruction) when the NT flag in the EFLAGS register is set.

You can read about TSS on Chapter 7 of Intel manual 3.

The ltr doesn't perform a switch, from the Intel manual 2:

After the segment selector is loaded in the task register, the processor uses the segment selector to locate the
segment descriptor for the TSS in the global descriptor table (GDT).

It then loads the segment limit and base
address for the TSS from the segment descriptor into the task register.

The task pointed to by the task register is
marked busy, but a switch to the task does not occur.

EDIT: I've actually tested if the CPU cached the static values from the TSS.

The test consisted in a boot program (attached) that

Create a GDT with two Code segments with DPL 0 and 3, two Data segments with DPL 0 and 3, a TSS and a Call gate with DPL 3 to the code segment with DPL 0.
Switch to protected mode, set the value of ESP0 in the TSS to v1 and load the tr.
Return to the code segment with DPL 3, change the value of ESP0 to v2 and call the Call gate.
Check if ESP is v1-10h or v2-10h, print 1 or 2 respectively (or 0 if for some reason none match).

On my Haswell and on Bochs the result is 2, meaning that the CPU read the TSS from the memory (hierarchy) when needed.

Though a test on a model cannot be generalised to the ISA, it is unlikely that this is not the case.

BITS 16

xor ax, ax          ;Most EFI CPS need the first instruction to be this

;But I like to have my offset to be close to 0, not 7c00h

jmp 7c0h : WORD __START__

__START__:

  cli

  ;Set up the segments to 7c0h

  mov ax, cs
  mov ss, ax
  xor sp, sp
  mov ds, ax

  ;Switch to PM

  lgdt [GDT]

  mov eax, cr0
  or ax, 1
  mov cr0, eax

  ;Set CS

  jmp CS_DPL0 : WORD __PM__ + 7c00h

__PM__:

  BITS 32

  ;Set segments

  mov ax, DS_DPL0
  mov ss, ax
  mov ds, ax
  mov es, ax

  mov esp, ESP_VALUE0

  ;Make a minimal TSS BEFORE loading TR

  mov eax, DS_DPL0
  mov DWORD [TSS_BASE + TSS_SS0], eax
  mov DWORD [TSS_BASE + TSS_ESP0], ESP_VALUE1

  ;Load TSS in TR

  mov ax, TSS_SEL
  ltr ax

  ;Go to CPL = 3

  push DWORD DS_DPL3 | RPL_3
  push DWORD ESP_VALUE0
  push DWORD CS_DPL3 | RPL_3
  push DWORD __PMCPL3__ + 7c00h
  retf

__PMCPL3__:

  ;UPDATE ESP IN TSS

  mov ax, DS_DPL3 | RPL_3
  mov ds, ax

  mov DWORD [TSS_BASE + TSS_ESP0], ESP_VALUE2

  ;SWITCH STACK

  call CALL_GATE : 0

  jmp $

__PMCG__:

  mov eax, esp

  mov bx, 0900h | '1'
  cmp eax, ESP_VALUE1 - 10h
  je __write

  mov bl, '2'
  cmp eax, ESP_VALUE2 - 10h
  je __write

  mov bl, '0'

__write:

  mov WORD [0b8000h + 80*5*2], bx

  cli
  hlt

GDT dw 37h
    dd GDT + 7c00h      ;GDT symbol is relative to 0 for the assembler
                ;We translate it to linear

    dw 0

    ;Index 1 (Selector 08h)
    ;TSS starting at 8000h and with length = 64KiB

    dw 0ffffh
    dw TSS_BASE
    dd 0000e900h

    ;Index 2 (Selector 10h)
    ;Code segment with DPL=3

    dd 0000ffffh, 00cffa00h

    ;Index 3 (Selector 18h)
    ;Data segment with DPL=0

    dd 0000ffffh, 00cff200h

    ;Index 4 (Selector 20h)
    ;Code segment with DPL=0

    dd 0000ffffh, 00cf9a00h

    ;Index 5 (Selector 28h)
    ;Data segment with DPL=0

    dd 0000ffffh, 00cf9200h

    ;Index 6 (Selector 30h)
    ;Call gate with DPL = 3 for SEL=20

    dw __PMCG__ + 7c00h
    dw CS_DPL0
    dd 0000ec00h

  ;Fake partition table entry

  TIMES 446-($-$$) db 0

  db 80h, 0,0,0, 07h

  TIMES 510-($-$$) db 0
  dw 0aa55h

  TSS_BASE  EQU     8000h
  TSS_ESP0  EQU     4
  TSS_SS0   EQU     8

  ESP_VALUE0    EQU 7c00h
  ESP_VALUE1    EQU 6000h
  ESP_VALUE2    EQU 7000h

  CS_DPL0   EQU 20h
  CS_DPL3   EQU 10h
  DS_DPL0   EQU 28h
  DS_DPL3   EQU 18h
  TSS_SEL   EQU 08h
  CALL_GATE EQU 30h

  RPL_3     EQU 03h

How Does Linux Handles The I/O Permission Bitmap in The Tss Structure