Linux Kernel: What Process Does Schedule() Run In

Linux kernel: What process does schedule() run in?

schedule() is always running in process context. The special part about it is that it can change which process context is current - but it does always have a process context. Prior to the call to context_switch() it runs in the context of the process to be swapped out, and after it runs in the process swapped in.

The Linux kernel does not have a dedicated "swapper" task (there is an idle task, which is always runnable in case nothing else is eligible to run).

Is the scheduler built into the kernel a program or a process?

You have 2 similar questions (The opinion that the scheduler built into the kernel is the program and the opinion that it is the process and I want to know how to implement the cpu scheduling process in Linux operating system) so I'll answer for both of these here.

The answer is that it doesn't work that way at all. The scheduler is not called by user mode processes by using system calls. The scheduler isn't a system call. There are timers that are programmed to throw interrupts after some time has elapsed. Timers are accessed using registers that are memory in RAM often called memory mapped IO (MMIO). You write to some position in RAM specified by the ACPI tables (https://wiki.osdev.org/ACPI) and it will allow to control the chips in the CPU or external PCI devices (PCI is everything nowadays).

When the timer reaches 0, it will trigger an interrupt. Interrupts are thrown by hardware (the CPU). The CPU thus includes special mechanism to let the OS determine the position at which it will jump on interrupt (https://wiki.osdev.org/Interrupt_Descriptor_Table). Interrupts are used by the CPU to notify the OS that an event happened. Without interrupts, the OS would have to reserve at least one core of the processor for a special kernel process that would constantly poll the registers of peripherals and other things. It would be impossible to implement. Also, if user mode processes did the scheduler system call by themselves, the kernel would be slave to user mode because it wouldn't be able to tell if a process is finished and processes could be selfish over CPU time.

I didn't look at the source code but I think the scheduler is also often called on some IO completion (also on interrupt but not always on timer interrupt). I am quite sure that the scheduler must not be preempted. That is interrupts (and other things) will be disabled while the schedule() function runs.

I don't think you can call the scheduler a process (not even a kernel thread). The scheduler can be called by kernel threads that are created by interrupts due to bottom half processing. In bottom half processing, the top "half" of the interrupt handler runs fast and efficiently while the bottom "half" is added to the queue of processes and runs when the scheduler decides it should be scheduled. This has the effect of creating some kernel threads. The scheduler can thus be called from kernel threads but not always from bottom half of interrupts. There has to be a mechanism to call the scheduler without the scheduler having to schedule the task itself. Otherwise, the kernel will stop functioning.

What context does the scheduler code run in?

schedule() always runs in process context. In the second case, when it is initiated by a timer interrupt, it is in the return path back from the kernel to the interrupted process where schedule() is called.

Working of scheduler

The scheduler is a program, yes, but very rarely is it a process. Rather the schedule is part of the Kernel, or the program that abstracts processes from the hardware (including the processor usage).

In a preemptive scheduler, Since the scheduler is part of the kernel, it actually exists in the address space of every single process. When a process's alloted time is up, the scheduler takes control of program execution and then does the necessary work to move to the next process. When the schedule does this, however, it does not remove itself from the new process's address space so that when the new process's time is up, it can saftely perform the work needed to move on.

While there have been kernels whose functions were often offloaded into other processes (CMU Mach), there will always be a part of the kernel that retains functionality for changing processes, and this will never be exclusively in its own process.

For more information on how scheduling works, I find the following articles helpful:

http://wiki.osdev.org/Context_Switching

http://wiki.osdev.org/Scheduling_Algorithms

http://wiki.osdev.org/Processes_and_Threads

How does OS chooses the next process to be run in CPU?

If the process X spent its quanta, the scheduler will be called. If there is other process of higher or same dynamic priority, then context switch will occur. If there is no other process, and current process still ready to run (no blocked), it will be resumed without context switch. There are two queues, active and expired at every priority; so if we have several processes of same priority, they will run one after another.

Check https://criticalblue.com/news/wp-content/uploads/2013/12/linux_scheduler_notes_final.pdf Process Scheduling in Linux - Critical Blue. Volker Seeker – University of Edinburgh 05.12.2013 (linux 3.1)

process scheduler .. has the following tasks:

• share CPU equally among all currently running processes

• pick appropriate process to run next if required, considering scheduling class/policy and process priorities

• balance processes between multiple cores in SMP systems

struct task_struct * (*pick_next_task) (struct rq *rq);

5.1 The Scheduler Entry Point .. Its main goal is to find the next task to be run and assign it to the local variable next.
  next = pick_next_task(rq);
     ...
  if (likely(prev != next)) {
        rq->nr_switches++;

 /*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq)
{
    const struct sched_class *class;
    struct task_struct *p;
    /*
     * Optimization: we know that if all tasks are in
     * the fair class we can call that function directly:
     */
    if (likely(rq->nr_running == rq->cfs.nr_running)) {
        p = fair_sched_class.pick_next_task(rq);
        if (likely(p))
            return p;
    }
    for_each_class(class) {
        p = class->pick_next_task(rq);
        if (p)
            return p;
    }
    BUG(); /* the idle class will always have a runnable task */
}
pick_next_task() is also implemented in sched.c. It iterates through our list of scheduling classes to find the class with the highest priority that has a runnable task (see Scheduling Classes above). If the class is found, the scheduling class hook is called. Since most tasks are handled by the sched_fair class, a short cut to this class is implemented in the beginning of the function.

Now schedule() checks if pick_next_task() found a new task or if it picked the same task again that was running before. If the latter is the case, no task switch is performed and the current task just keeps running. If a new task is found, which is the more likely case, the actual task switch is executed by calling context_switch(). Internally, context_switch() switches to the new task's memory map and swaps register state and stack.

And also https://www.cs.columbia.edu/~smb/classes/s06-4118/l13.pdf Linux Scheduler, CS CU, COMS W4118,

Have a separate run queue for each processor.
Each processor only selects processes from its own queue to run. ..
Periodically, the queues are rebalanced

Basic Scheduling Algorithm: Find the highest-priority queue with a runnable
process; Find the first process on that queue; Calculate its quantum size; Let it run; When its time is up, put it on the expired list; Repeat.

The Run Queue:

140 separate queues, one for each priority level

Actually, that number can be changed at a given site

Actually, two sets, active and expired

Priorities 0-99 for real-time processes

Priorities 100-139 for normal processes; value set via nice() system call

Using Quanta (13 / 40):

At every time tick, decrease the quantum of the current running process

If the time goes to zero, the process is done

If the process is non-interactive, put it aside on the expired list

If the process is interactive, put it at the end of the current priority queue

If there’s nothing else at that priority, it will run again immediately

Of course, by running so much is bonus will go down, and so will its priority and its interactive status

Avoiding Indefinite Overtaking:

There are two sets of 140 queues, active and expired

The system only runs processes from active queues, and puts them on expired queues when they use up their quanta

When a priority level of the active queue is empty, the scheduler looks for the next-highest priority queue

After running all of the active queues, the active and expired queues are swapped

There are pointers to the current arrays; at the end of a cycle, the pointers are switched

Linux Kernel: What Process Does Schedule() Run In