How Does Os Chooses The Next Process to Be Run in Cpu

How does OS chooses the next process to be run in CPU?

If the process X spent its quanta, the scheduler will be called. If there is other process of higher or same dynamic priority, then context switch will occur. If there is no other process, and current process still ready to run (no blocked), it will be resumed without context switch. There are two queues, active and expired at every priority; so if we have several processes of same priority, they will run one after another.

Check https://criticalblue.com/news/wp-content/uploads/2013/12/linux_scheduler_notes_final.pdf Process Scheduling in Linux - Critical Blue. Volker Seeker – University of Edinburgh 05.12.2013 (linux 3.1)

process scheduler .. has the following tasks:

• share CPU equally among all currently running processes

• pick appropriate process to run next if required, considering scheduling class/policy and process priorities

• balance processes between multiple cores in SMP systems

struct task_struct * (*pick_next_task) (struct rq *rq);

5.1 The Scheduler Entry Point .. Its main goal is to find the next task to be run and assign it to the local variable next.
  next = pick_next_task(rq);
     ...
  if (likely(prev != next)) {
        rq->nr_switches++;

 /*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq)
{
    const struct sched_class *class;
    struct task_struct *p;
    /*
     * Optimization: we know that if all tasks are in
     * the fair class we can call that function directly:
     */
    if (likely(rq->nr_running == rq->cfs.nr_running)) {
        p = fair_sched_class.pick_next_task(rq);
        if (likely(p))
            return p;
    }
    for_each_class(class) {
        p = class->pick_next_task(rq);
        if (p)
            return p;
    }
    BUG(); /* the idle class will always have a runnable task */
}
pick_next_task() is also implemented in sched.c. It iterates through our list of scheduling classes to find the class with the highest priority that has a runnable task (see Scheduling Classes above). If the class is found, the scheduling class hook is called. Since most tasks are handled by the sched_fair class, a short cut to this class is implemented in the beginning of the function.

Now schedule() checks if pick_next_task() found a new task or if it picked the same task again that was running before. If the latter is the case, no task switch is performed and the current task just keeps running. If a new task is found, which is the more likely case, the actual task switch is executed by calling context_switch(). Internally, context_switch() switches to the new task's memory map and swaps register state and stack.

And also https://www.cs.columbia.edu/~smb/classes/s06-4118/l13.pdf Linux Scheduler, CS CU, COMS W4118,

Have a separate run queue for each processor.
Each processor only selects processes from its own queue to run. ..
Periodically, the queues are rebalanced

Basic Scheduling Algorithm: Find the highest-priority queue with a runnable
process; Find the first process on that queue; Calculate its quantum size; Let it run; When its time is up, put it on the expired list; Repeat.

The Run Queue:

140 separate queues, one for each priority level

Actually, that number can be changed at a given site

Actually, two sets, active and expired

Priorities 0-99 for real-time processes

Priorities 100-139 for normal processes; value set via nice() system call

Using Quanta (13 / 40):

At every time tick, decrease the quantum of the current running process

If the time goes to zero, the process is done

If the process is non-interactive, put it aside on the expired list

If the process is interactive, put it at the end of the current priority queue

If there’s nothing else at that priority, it will run again immediately

Of course, by running so much is bonus will go down, and so will its priority and its interactive status

Avoiding Indefinite Overtaking:

There are two sets of 140 queues, active and expired

The system only runs processes from active queues, and puts them on expired queues when they use up their quanta

When a priority level of the active queue is empty, the scheduler looks for the next-highest priority queue

After running all of the active queues, the active and expired queues are swapped

There are pointers to the current arrays; at the end of a cycle, the pointers are switched

What does the kernel do while another process is running

Yeah.. you should stop thinking of the OS kernel as a process and think of it instead of just code and data - a state-machine that processes/threads call in to in order to obtain specific services at one end, (eg. I/O requests) and drivers call in to at the other end to provide service solutions, (eg. I/O completion).

The kernel does not need any threads of execution in itself. It only runs when entered from syscalls, (interrupt-like calls from running user threads/processes), or drivers, (hardware interrupts from disk/NIC/KB/mouse etc hardware). Sometimes, such calls will change the set of threads running on the available cores, (eg. if a thread waiting for a network buffer becomes ready because the NIC driver has completed the action, the OS will probably try to assign it to a core 'immediately', preempting some other thread if required).

If there are no syscalls, and no hardware interrupts, the kernel does nothing because it is not entered - there is nothing for it to do.

How does the scheduler of an operating system regain control from a process?

I would like to know, if we have a single core cpu and lets say that for a long time there are only cpu intesive processes (no I\O requests) how does the scheduler regain the control?

There's multiple choices:

a) It's a cooperative scheduler and gets control when the currently running task voluntarily or accidentally gives the scheduler control via. a kernel API function (which might be like yield() but could be anything that cause the currently running task to block - e.g. read()) or an exception (e.g. trying to access data that the kernel sent to swap space, causing a page fault where the page fault handler blocks the task until the data it needs is fetched from swap space). This can include the task crashing.

b) It's a preemptive scheduler that uses hardware (e.g. a timer) to ensure that kernel will gain control (and pass control to scheduler). Note that it might or might not be a timer (e.g. it could be a counter that counts the number of instructions executed, which has advantages for modern systems where CPU speed varies due to power management).

c) It's a "less cooperative/semi-preemptive" scheduler that opportunistically checks if a task switch should be done any time anything causes the kernel to gain control but doesn't explicitly use any hardware to ensure that kernel will gain control (e.g. so that things that seem unrelated to scheduling, like freeing memory, can cause a task switch).

d) It's a combination of the last 2 options - a preemptive scheduler that uses hardware to ensure that kernel will gain control; that (whenever kernel has control for any reason) opportunistically checks if a task switch can be done a little early to avoid a relatively expensive IRQ that would've occurred soon.

I have read some stuff about timer interupts, i would like to know how is, the operating system, able to set this timer?

"The operating system" is a huge amount of stuff (e.g. includes things like data files for a help system and graphics for icons and ...). Typically there is a kernel which is able to do anything it likes with no restrictions; including accessing timer hardware directly.

The exact details of how a kernel would set a timer depends on which kind of timer it is. Note that there may be different types of timer to choose from (e.g. an 80x86 PC might have a PIT chip, an RTC chip, HPET, and a local APIC timer built into each CPU; where some are configured via. IO ports, some are configured via. memory mapped registers, and one may be configured via. special registers/MSRs built into the CPU; where each type of timer has different frequencies, precision, accuracy, capabilities, etc).

How does the operating system handle its responsibilities while a process is executing?

The question is mixing up who's in control of the memory and who's in control of the CPU. The wording “running” is imprecise: on a single CPU, a single task is running at any given time in the sense that the processor is executing its instructions; but many tasks are executing in the sense that their state is stored in memory and their execution can resume at any time.

While a process is executing on the CPU, the kernel is not executing. Its state is saved in memory. The execution of the kernel can resume:

if the process code makes a jump into kernel code — this is called a system call.
if an interrupt occurs.

If the operating system provides preemptive multitasking, it will schedule an interrupt to happen after an interval of time (called a time slice). On a non-preemptive operating system, the process will run forever if it doesn't yield the CPU. See What mechanisms prevent a process from taking over the processor forever? for an explanation of how preemption works.

Tasks such as process management and device management are triggered by some event. If the event is a request by the process, the request will take the form of a system call, which executes kernel code. If the event is triggered from hardware, it will take the form of an interrupt, which executes kernel code.

(Note: in this answer, I use “CPU” and “processor” synonymously, to mean a single execution thread: a single core, or whatever the hardware architecture is.)

How does program execute? Where does the Operating Systems come into play?

For starters, a modern CPU has (at least) two modes, a mode in which it's running the core of the Operating System itself ("kernel mode") and a mode in which it's running programs ("user mode"). When in user mode, the CPU can't do a whole lot of things.

For instance, a mouse click is typically noticed in the kernel, not user mode. However, the OS dispatches the event to user mode and from there to the correct program. The other way around also requires cooperation: a program can't draw to the screen freely, but needs to go through the OS and kernel mode to draw on its part.

Similarly, the act of starting a program is typically a cooperation. The shell part of the OS is a user-mode program too. It gets your mouse click, and determines that it's a mouse click intended to start a process. The shell then tells the kernel-mode part of the OS to start a new process for that program.

When the kernel mode needs to start a new process, it first allocates memory for bookkeeping, and then proceeds to load the program. This involves retrieving the instructions from the binary, but also hooking up the program to the OS. This usually requires finding the entry point (classically int main(int argc, char** argv)) of the binary, and all points where the program wants to call the OS.

Different Operating Systems use different ways to hook up programs with the OS. As a result, the loading process differs, and the file formats for binaries can differ too. It's not absolute; the ELF format for binaries is used for a number of Operating Systems, and Microsoft uses its PE format on all its current Operating Systems. In both cases, the format does describe the precise format of the binary, so the OS can decide whether the program can be hooked up to the OS. For instance, if it's a Win32 binary, it will be in the PE format, therefore Linux won't load that, Windows 2000 will, as will Windows 7-64. A Win64 binary on the other hand is in PE format too, but Windows 2000 will reject it.

Is a context switch needed for the the short-term scheduler to run?

Let's start by assuming a task has a state that is one of:

"currently running". If there are 8 CPUs then a maximum of 8 tasks can be currently running on a CPU at the same time.
"ready to run". If there are 20 tasks and 8 CPUs, then there may be 12 tasks that are ready to run on a CPU.
"blocked". This is waiting for IO (disk, network, keyboard, ...), waiting to acquire a mutex, waiting for time to pass (e.g. sleep()), etc. Note that this includes things the task isn't aware of (e.g. fetching data from swap space because the task tried to access data that isn't actually in memory).

Sometimes a task will do something (call a kernel function like read(), sleep(), pthread_mutex_lock(), etc; or access data that isn't in memory) that causes the task to switch from the "currently running" state to the "blocked" state. When this happens some other part of the kernel (e.g. the virtual file system layer, virtual memory management, ...) will tell the scheduler that the currently running task has blocked (and needs to be put into the "blocked" state); and the scheduler will have to find something else for the CPU to do, which will be either finding another task for the CPU to run (and switching the other task from "ready to run" to "currently running") or putting the CPU into a power saving state (because there's no tasks for the CPU to run).

Sometimes something that a task was waiting for occurs (e.g. the user presses a key, a mutex is released, data arrives from swap space, etc). When this happens some other part of the kernel (e.g. the virtual file system layer, virtual memory management, ...) will tell the scheduler that the task needs to leave the "blocked" state. When this happens the scheduler has to decide if the task will go from "blocked" to "ready to run" (and tasks that were using CPUs will continue using CPUs), or if the task will go from "blocked" to "currently running" (which will either cause a currently running task to be preempted and go from "currently running" to "ready to run", or will cause a previously idle CPU to be taken out of a power saving state). Note that in a well designed OS this decision will depend on things like task priorities (e.g. if a high priority tasks unblocks it preempt a low priority task, but if a low priority task unblocks then it doesn't preempt a high priority task).

On modern systems these 2 things (tasks entering and leaving the "blocked" state) are responsible for most task switches.

Other things that can cause task switches are:

a task terminates itself or crashes. This is mostly the same as a task blocking (some other part of the kernel informs the scheduler and the scheduler has to find something else for the CPU to do).
a new task is created. This is mostly the same as a task unblocking (some other part of the kernel informs the scheduler and the scheduler decides if the new task will preempt a currently running task or cause a CPU to be taken out of a power saving state).
the scheduler is frequently switching between 2 or more tasks to create the illusion that they're all running at the same time (time multiplexing). On a well designed modern system this only ever happens when there are more tasks at the same priority than there are available CPUs and those tasks block often enough; which is extremely rare. In some cases (e.g. "earliest deadline first" scheduling algorithm in a real-time system) this might be impossible.

My understanding is that the short-term scheduler is a module in the kernel (a process in itself i guess?)

The scheduler is typically implemented as set of functions that other parts of the kernel call - e.g. maybe a block_current_task(reason) function (where scheduler might have to decide which other task to switch to), and an unblock_task(taskID) function (where if the scheduler decides the unblocked task should preempt a currently running task it already knows which task it wants to switch to). These functions may call an even lower level function to do an actual context switch (e.g. a switch_to_task(taskID)), where that lower level function may:

do time accounting (work out how much time has passed since last time, and use that to update statistics so that people can know things like how much CPU time each task has consumed, how much time a CPU has been idle, etc).
if there was a previously running task (if the CPU wasn't previously idle), change the previously running task's state from "currently running" to something else ("ready to run" or "blocked").
if there was a previously running task, save the previously running task's "CPU state" (register contents, etc) somewhere (e.g. in a some kind of structure).
change the state of the next task to "currently running" (regardless of what the next task's state was previously).
load the next task's "CPU state" (register contents, etc) from somewhere.

How can the short-term scheduler process run without a context switch to happen for it?

The scheduler is just a group of functions in the kernel (and not a process).

Is a schedulable unit of CPU time slice process or thread?

Clarification: my understanding of "a schedulable unit of CPU time slice" is "a unit that can be scheduled during a given CPU time slice" (since if "schedulable unit" would be a time then the question does not make much sense to me).

Based on this, put it shortly, "a schedulable unit of CPU time slice" for a given logical core can be seen as a software thread (more specifically its execution context composed of registers and process information).

Operating systems scheduler operates on tasks. Tasks can be threads, processes, or other unusual structure (eg. dataflows).

Modern mainstream operating system mainly schedule threads on processing units (typically hardware threads also called logical cores). You can get more information about how the Windows scheduler works in the Microsoft documentation. The documentation explicitly states:

A thread is the entity within a process that can be scheduled for execution

On Linux, the default scheduler, CFS, operates on task (ie. task_struct data structure). Tasks can be a thread, a group of threads or a process. This was done that way so to make the scheduler more generic and also because this scheduler was designed long ago, when processors had only 1 core and people focused on processes rather than thread. The multi-core era since caused applications to use a lot of threads so to use available cores. As a result, nowadays, it is generally threads that are actually scheduled AFAIK. This is explained in the famous research paper The Linux Scheduler: a Decade of Wasted Cores (which also explain a bit how the CFS operate regarding the target processor).

Note that the term "process" can sometime refer to a thread since threads are sometime called "lightweight processes" and basic processes are sometime called "heavy processes". Processes can even be a generic term for both heavy and lightweight processes (ie. threads and actual processes). This is a very confusing terminology and a misuse of language (like the term "processors" sometimes used for cores). In practice, this is often not a problem in a specific context since threads and processes may be used interchangeably though (in such a case, people should use a generic term like "tasks").

As for "a schedulable unit of CPU time slice" this is a bit more complex. A simple and naive answer is: a thread (it is definitively not processes alone). That being said, a thread is a software-defined concept (like processes). It is basically a stack, few registers, and a parent process (with possibly some meta-information and a TLS space). CPUs does not operate directly on such data structure. CPU does not have a concept of thread stack for example (it is just a section of the virtual process memory like any other). They just need an execution context which is composed of registers and a process configuration (in protected mode). For sake of simplicity, we can say that they execute threads. Mainstream modern x86 processors are very complex, and each core is often able to run multiple threads at the same time. This is called simultaneous multithreading (aka. Hyper-Threading for Intel processors). x86 physical cores are typically composed of two logical threads (ie. hardware threads) that can each execute a software threads.

How Does Os Chooses The Next Process to Be Run in Cpu