Does The Linux Scheduler Needs to Be Context Switched

Does the linux scheduler needs to be context switched?

That's a very good question, and the answer to it would be "yes" except for the fact that the hardware is aware of the concept of an OS and task scheduler.

In the hardware, you'll find registers that are restricted to "supervisor" mode. Without going into too much detail about the internal CPU architecture, there's a copy of the basic program execution registers for "user mode" and "supervisor mode," the latter of which can only be accessed by the OS itself (via a flag in a control register that the kernel sets which says whether or not the kernel or a user mode application is currently running).

So the "context switch" you speak of is the process of swapping/resetting the user mode registers (instruction register, stack pointer register, etc.) etc. but the system registers don't need to be swapped out because they're stored apart from the user ones.

For instance, the user mode stack in x86 is USP - A7, whereas the supervisor mode stack is SSP - A7. So the kernel itself (which contains the task scheduler) would use the supervisor mode stack and other supervisor mode registers to run itself, setting the supervisor mode flag to 1 when it's running, then perform a context switch on the user mode hardware to swap between apps and setting the supervisor mode flag to 0.

But prior to the idea of OSes and task scheduling, if you wanted to do a multitasking system then you'd have had to use the basic concept that you outlined in your question: use a hardware interrupt to call the task scheduler every x cycles, then swap out the app for the task scheduler, then swap in the new app. But in most cases the timer interrupt would be your actual task scheduler itself and it would have been heavily optimized to make it less of a context switch and more of a simple interrupt handler routine.

Is a context switch needed for the the short-term scheduler to run?

Let's start by assuming a task has a state that is one of:

"currently running". If there are 8 CPUs then a maximum of 8 tasks can be currently running on a CPU at the same time.
"ready to run". If there are 20 tasks and 8 CPUs, then there may be 12 tasks that are ready to run on a CPU.
"blocked". This is waiting for IO (disk, network, keyboard, ...), waiting to acquire a mutex, waiting for time to pass (e.g. sleep()), etc. Note that this includes things the task isn't aware of (e.g. fetching data from swap space because the task tried to access data that isn't actually in memory).

Sometimes a task will do something (call a kernel function like read(), sleep(), pthread_mutex_lock(), etc; or access data that isn't in memory) that causes the task to switch from the "currently running" state to the "blocked" state. When this happens some other part of the kernel (e.g. the virtual file system layer, virtual memory management, ...) will tell the scheduler that the currently running task has blocked (and needs to be put into the "blocked" state); and the scheduler will have to find something else for the CPU to do, which will be either finding another task for the CPU to run (and switching the other task from "ready to run" to "currently running") or putting the CPU into a power saving state (because there's no tasks for the CPU to run).

Sometimes something that a task was waiting for occurs (e.g. the user presses a key, a mutex is released, data arrives from swap space, etc). When this happens some other part of the kernel (e.g. the virtual file system layer, virtual memory management, ...) will tell the scheduler that the task needs to leave the "blocked" state. When this happens the scheduler has to decide if the task will go from "blocked" to "ready to run" (and tasks that were using CPUs will continue using CPUs), or if the task will go from "blocked" to "currently running" (which will either cause a currently running task to be preempted and go from "currently running" to "ready to run", or will cause a previously idle CPU to be taken out of a power saving state). Note that in a well designed OS this decision will depend on things like task priorities (e.g. if a high priority tasks unblocks it preempt a low priority task, but if a low priority task unblocks then it doesn't preempt a high priority task).

On modern systems these 2 things (tasks entering and leaving the "blocked" state) are responsible for most task switches.

Other things that can cause task switches are:

a task terminates itself or crashes. This is mostly the same as a task blocking (some other part of the kernel informs the scheduler and the scheduler has to find something else for the CPU to do).
a new task is created. This is mostly the same as a task unblocking (some other part of the kernel informs the scheduler and the scheduler decides if the new task will preempt a currently running task or cause a CPU to be taken out of a power saving state).
the scheduler is frequently switching between 2 or more tasks to create the illusion that they're all running at the same time (time multiplexing). On a well designed modern system this only ever happens when there are more tasks at the same priority than there are available CPUs and those tasks block often enough; which is extremely rare. In some cases (e.g. "earliest deadline first" scheduling algorithm in a real-time system) this might be impossible.

My understanding is that the short-term scheduler is a module in the kernel (a process in itself i guess?)

The scheduler is typically implemented as set of functions that other parts of the kernel call - e.g. maybe a block_current_task(reason) function (where scheduler might have to decide which other task to switch to), and an unblock_task(taskID) function (where if the scheduler decides the unblocked task should preempt a currently running task it already knows which task it wants to switch to). These functions may call an even lower level function to do an actual context switch (e.g. a switch_to_task(taskID)), where that lower level function may:

do time accounting (work out how much time has passed since last time, and use that to update statistics so that people can know things like how much CPU time each task has consumed, how much time a CPU has been idle, etc).
if there was a previously running task (if the CPU wasn't previously idle), change the previously running task's state from "currently running" to something else ("ready to run" or "blocked").
if there was a previously running task, save the previously running task's "CPU state" (register contents, etc) somewhere (e.g. in a some kind of structure).
change the state of the next task to "currently running" (regardless of what the next task's state was previously).
load the next task's "CPU state" (register contents, etc) from somewhere.

How can the short-term scheduler process run without a context switch to happen for it?

The scheduler is just a group of functions in the kernel (and not a process).

What context does the scheduler code run in?

schedule() always runs in process context. In the second case, when it is initiated by a timer interrupt, it is in the return path back from the kernel to the interrupted process where schedule() is called.

Context switch internals

At a high level, there are two separate mechanisms to understand. The first is the kernel entry/exit mechanism: this switches a single running thread from running usermode code to running kernel code in the context of that thread, and back again. The second is the context switch mechanism itself, which switches in kernel mode from running in the context of one thread to another.

So, when Thread A calls sched_yield() and is replaced by Thread B, what happens is:

Thread A enters the kernel, changing from user mode to kernel mode;
Thread A in the kernel context-switches to Thread B in the kernel;
Thread B exits the kernel, changing from kernel mode back to user mode.

Each user thread has both a user-mode stack and a kernel-mode stack. When a thread enters the kernel, the current value of the user-mode stack (SS:ESP) and instruction pointer (CS:EIP) are saved to the thread's kernel-mode stack, and the CPU switches to the kernel-mode stack - with the int $80 syscall mechanism, this is done by the CPU itself. The remaining register values and flags are then also saved to the kernel stack.

When a thread returns from the kernel to user-mode, the register values and flags are popped from the kernel-mode stack, then the user-mode stack and instruction pointer values are restored from the saved values on the kernel-mode stack.

When a thread context-switches, it calls into the scheduler (the scheduler does not run as a separate thread - it always runs in the context of the current thread). The scheduler code selects a process to run next, and calls the switch_to() function. This function essentially just switches the kernel stacks - it saves the current value of the stack pointer into the TCB for the current thread (called struct task_struct in Linux), and loads a previously-saved stack pointer from the TCB for the next thread. At this point it also saves and restores some other thread state that isn't usually used by the kernel - things like floating point/SSE registers. If the threads being switched don't share the same virtual memory space (ie. they're in different processes), the page tables are also switched.

So you can see that the core user-mode state of a thread isn't saved and restored at context-switch time - it's saved and restored to the thread's kernel stack when you enter and leave the kernel. The context-switch code doesn't have to worry about clobbering the user-mode register values - those are already safely saved away in the kernel stack by that point.

Does The Linux Scheduler Needs to Be Context Switched