Linux Kernel Changing Default CPU Scheduler

This post is a little bit dated, but anyway I hope this can help...
I had similar problem and I implemented a hack to Linux Kernel to make RR the default CPU scheduler. In the end the hack basically changes the shed_fork function, as pointed out in previous comments. Here is the code I made to implement that: https://aelseb.wordpress.com/2016/01/06/change-linux-cpu-default-scheduler/

Linux default scheduler alternatives

By all means, please tell me this is just an experiment :) I really can't see anything good coming out of trying such heresy.

That said, here's my try at it. First of all, we need to get a sandbox. I've used User Mode Linux (UML). For the kernel, I grabbed a random 4.10.0-rc1 repo, but any version would do. For the rootfs, the UML page provides a bunch of them here (pitfall: not all of them work fine).

Steps to build the kernel are quite short:

export ARCH=um
make x86_64_defconfig
make

If you got the Slackware rootfs, you can now run as:

 ./vmlinux ubda=./Slamd64-12.1-root_fs

OK, cool. So we have a safe place to break some kernels. Let's get to it. As you probably know, the init (pid=1) is the first process and the parent of all the other processes. RT scheduling class is inherited (unless asked otherwise with a flag, see SCHED_RESET_ON_FORK in man 7 sched). This means we might change the scheduling class for the init process and get its children to inherit the scheduling class by default.

This is easily doable with something like this:

diff --git a/init/main.c b/init/main.c
index b0c9d6facef9..015f72b318ef 100644
--- a/init/main.c
+++ b/init/main.c
@@ -951,8 +951,11 @@ static inline void mark_readonly(void)

 static int __ref kernel_init(void *unused)
 {
+       struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
        int ret;

+       /* Sit tight and fasten your seat belt! */
+       sched_setscheduler_nocheck(current, SCHED_FIFO, ¶m);
        kernel_init_freeable();
        /* need to finish all async __init code before freeing the memory */
        async_synchronize_full();

And it works!

root@darkstar:~# sleep 60 &
[1] 549
root@darkstar:~# ps -c
  PID CLS PRI TTY          TIME CMD
  536 FF  139 tty0     00:00:00 bash
  549 FF  139 tty0     00:00:00 sleep
  550 FF  139 tty0     00:00:00 ps

(By the way, SCHED_DEADLINE cannot be inherited, as noted in man 7 sched).

Do kernel components like the scheduler execute on their own dedicated CPU/Core or do they share?

They share. A typical system spends so little of its time running kernel code that dedicating an entire core to it would be an absurd waste, and the scheduler itself is a tiny fraction even of that. And in cases where it does need to run a lot of kernel code, that's exactly when you want that work shared among as many cores as possible.

I'm not sure about Windows specifically, but a common OS design is that every core executes the scheduler when it's time to decide which task that core should execute next.

Force Linux to schedule processes on CPU cores that share CPU cache

Newer Linux may do this for you: Cluster-Aware Scheduling Lands In Linux 5.16 - there's support for scheduling decisions to be influenced by the fact that some cores share resources.

If you manually pick a CCX, you could give them each the same affinity mask that allows them to schedule on any of the cores in that CCX.

An affinity mask can have multiple bits set.

I don't know of a way to let the kernel decide which CCX, but then schedule both tasks to cores within it. If the parent checks which core it's currently running on, it could set a mask to include all cores in the CCX containing it, assuming you have a way to detect how core #s are grouped, and a function to apply that.

You'd want to be careful that you don't end up leaving some CCXs totally unused if you start multiple processes that each do this, though. Maybe every second, do whatever top or htop do to check per-core utilization, and if so rebalance? (i.e. change the affinity mask of both processes to the cores of a different CCX). Or maybe put this functionality outside the processes being scheduled, so there's one "master control program" that looks at (and possibly modifies) affinity masks for a set of tasks that it should control. (Not all tasks on the system; that would be a waste of work.)

Or if it's looking at everything, it doesn't need to do so much checking of current load average, just count what's scheduled where. (And assume that tasks it doesn't know about can pick any free cores on any CCX, like daemons or the occasional compile job. Or at least compete fairly if all cores are busy with jobs it's managing.)

Obviously this is not helpful for most parent/child processes, only ones that do a lot of communication via shared memory (or maybe pipes, since kernel pipe buffers are effectively shared memory).

It is true that Zen CPUs have varying inter-core latency within / across CCXs, as well as just cache hit effects from sharing L3. https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 did some microbenchmarking on Zen 3 vs. 2-socket Xeon Platinum vs. 2-socket ARM Ampere.

Does the linux scheduler needs to be context switched?

That's a very good question, and the answer to it would be "yes" except for the fact that the hardware is aware of the concept of an OS and task scheduler.

In the hardware, you'll find registers that are restricted to "supervisor" mode. Without going into too much detail about the internal CPU architecture, there's a copy of the basic program execution registers for "user mode" and "supervisor mode," the latter of which can only be accessed by the OS itself (via a flag in a control register that the kernel sets which says whether or not the kernel or a user mode application is currently running).

So the "context switch" you speak of is the process of swapping/resetting the user mode registers (instruction register, stack pointer register, etc.) etc. but the system registers don't need to be swapped out because they're stored apart from the user ones.

For instance, the user mode stack in x86 is USP - A7, whereas the supervisor mode stack is SSP - A7. So the kernel itself (which contains the task scheduler) would use the supervisor mode stack and other supervisor mode registers to run itself, setting the supervisor mode flag to 1 when it's running, then perform a context switch on the user mode hardware to swap between apps and setting the supervisor mode flag to 0.

But prior to the idea of OSes and task scheduling, if you wanted to do a multitasking system then you'd have had to use the basic concept that you outlined in your question: use a hardware interrupt to call the task scheduler every x cycles, then swap out the app for the task scheduler, then swap in the new app. But in most cases the timer interrupt would be your actual task scheduler itself and it would have been heavily optimized to make it less of a context switch and more of a simple interrupt handler routine.

Is the scheduler built into the kernel a program or a process?

You have 2 similar questions (The opinion that the scheduler built into the kernel is the program and the opinion that it is the process and I want to know how to implement the cpu scheduling process in Linux operating system) so I'll answer for both of these here.

The answer is that it doesn't work that way at all. The scheduler is not called by user mode processes by using system calls. The scheduler isn't a system call. There are timers that are programmed to throw interrupts after some time has elapsed. Timers are accessed using registers that are memory in RAM often called memory mapped IO (MMIO). You write to some position in RAM specified by the ACPI tables (https://wiki.osdev.org/ACPI) and it will allow to control the chips in the CPU or external PCI devices (PCI is everything nowadays).

When the timer reaches 0, it will trigger an interrupt. Interrupts are thrown by hardware (the CPU). The CPU thus includes special mechanism to let the OS determine the position at which it will jump on interrupt (https://wiki.osdev.org/Interrupt_Descriptor_Table). Interrupts are used by the CPU to notify the OS that an event happened. Without interrupts, the OS would have to reserve at least one core of the processor for a special kernel process that would constantly poll the registers of peripherals and other things. It would be impossible to implement. Also, if user mode processes did the scheduler system call by themselves, the kernel would be slave to user mode because it wouldn't be able to tell if a process is finished and processes could be selfish over CPU time.

I didn't look at the source code but I think the scheduler is also often called on some IO completion (also on interrupt but not always on timer interrupt). I am quite sure that the scheduler must not be preempted. That is interrupts (and other things) will be disabled while the schedule() function runs.

I don't think you can call the scheduler a process (not even a kernel thread). The scheduler can be called by kernel threads that are created by interrupts due to bottom half processing. In bottom half processing, the top "half" of the interrupt handler runs fast and efficiently while the bottom "half" is added to the queue of processes and runs when the scheduler decides it should be scheduled. This has the effect of creating some kernel threads. The scheduler can thus be called from kernel threads but not always from bottom half of interrupts. There has to be a mechanism to call the scheduler without the scheduler having to schedule the task itself. Otherwise, the kernel will stop functioning.

Linux Kernel Changing Default CPU Scheduler