How Linux Scheduler Schedules Processes on Multi-Core Processors

How Tasks are scheduled in a multi-core processor

Here's a timing diagram of the code that each thread needs to execute. This imagines a maximal case where each task immmediately spawns a new thread. The green sections are infinitesimally short pieces of code (think, "not-to-scale") but are basically just scheduling operations. And the red sections are similarly short process EXIT and thread END scheduling operations. (I've omitted penalties associated with thread creation. And notice that worker threads do not END, they just go idle, and they stay in a thread pool.

Basic Timing Diagram

Now the first thing you'll notice is that, because of the way tasks work, the second task can be executed on the same thread that scheduled it, because no more tasks are scheduled, and the thread is only going to await that task. This has nothing to do with thread scheduling, and everything to do with how tasks efficiently manage their pool of worker threads. This is application-level code, not os-level code that accomplishes this. The diagram below requires 1 fewer threads thanks to tasks.

Timing Diagram with smarter tasks

Now we can look at what the scheduler needs to do. We are still dealing with only logical processors. (The details of which core will execute which thread are complicated so let's leave that out for the moment.) Here we see that we can naively execute all each of these threads on their own processor.

Greedy usage of processors

It will likely be more efficient to execute the worker thread on one of the previous processors. They are idle when worker thread 1 needs to execute, so it makes more sense to reuse one of the previously allocated processors. Here task 1 code in worker thread 1 is shown executing on processor 2 (could also have been assigned to processor 1 because it is also free, but stay tuned for the next diagram and you'll see why I put it on processor 2).

Schedule thread to reuse a processor

And finally, we can construct the last version that takes us to the most efficient scheduling. This hinges on optimizing the case where you create a thread and then immediately join a thread. Different operating systems try to optimize this case so that the newly created thread can run on the same processor. It means that creating the thread doesn't immediately schedule the new thread on a free processor and burn the cost of a context switch back to the thread that scheduled it. Instead, the new thread is scheduled when we block in our Join operation, or when the next clock interrupt occurs. If we can quickly get to our Join call before an interrupt triggers the scheduler (we're talking < 10 ms on a typical operating systems for such things to be triggered by the clock chip) then the scheduling will happen more efficiently like this (below), where thread 2 can be scheduled to run on the same processor without a context switch. (Interestingly, Linux and Windows optimize this case differently.)

Final timing diagram

You'll notice (above) that this can now all execute on only two logical processors.

Whether it is more efficient to run these on separate cores or different logical processors of the same core is a nuance of the operating system again that depends highly on virtual memory usage and also the hardware specs of the processor and its caches. Different operating systems will do different things here, too. And the details matter greatly. Non-uniform memory architecture would affect the decision too.

In the real world, the operating system may use heuristics to determine the best priority and placement for threads and processes. The real world answer is so much different and more nuanced than this "computer science" answer I've given and depends on the specific details.

Additional Reading/Viewing:

Windows and Linux: A Tale of Two Kernels - Tech-Ed 2004 (Older but excellent info)
Processes, Threads, and Jobs in the Windows Operating System
Scheduling: Introduction; and Multiprocessor Scheduling (Advanced)
Capacity Aware Scheduling

Scheduling on multiple cores with each list in each processor vs one list that all processes share

Generally there is one, giant problem when it comes to multi-core CPU systems - cache coherency.

What does cache coherency mean?

Access to main memory is hard. Depending on the memory frequency, it can take between a few thousand to a few million cycles to access some data in RAM - that's a whole lot of time the CPU is doing no useful work. It'd be significantly better if we minimized this time as much as possible, but the hardware required to do this is expensive, and typically must be in very close proximity to the CPU itself (we're talking within a few millimeters of the core).

This is where the cache comes in. The cache keeps a small subset of main memory in close proximity to the core, allowing accesses to this memory to be several orders of magnitude faster than main memory. For reading this is a simple process - if the memory is in the cache, read from cache, otherwise read from main memory.

Writing is a bit more tricky. Writing to the cache is fast, but now main memory still holds the original value. We can update that memory, but that takes a while, sometimes even longer than reading depending on the memory type and board layout. How do we minimize this as well?

The most common way to do so is with a write-back cache, which, when written to, will flush the data contained in the cache back to main memory at some later point when the CPU is idle or otherwise not doing something. Depending on the CPU architecture, this could be done during idle conditions, or interleaved with CPU instructions, or on a timer (this is up to the designer/fabricator of the CPU).

Why is this a problem?

In a single core system, there is only one path for reads and writes to take - they must go through the cache on their way to main memory, meaning the programs running on the CPU only see what they expect - if they read a value, modified it, then read it back, it would be changed.

In a multi-core system, however, there are multiple paths for data to take when going back to main memory, depending on the CPU that issued the read or write. this presents a problem with write-back caching, since that "later time" introduces a gap in which one CPU might read memory that hasn't yet been updated.

Imagine a dual core system. A job starts on CPU 0 and reads a memory block. Since the memory block isn't in CPU 0's cache, it's read from main memory. Later, the job writes to that memory. Since the cache is write-back, that write will be made to CPU 0's cache and flushed back to main memory later. If CPU 1 then attempts to read that same memory, CPU 1 will attempt to read from main memory again, since it isn't in the cache of CPU 1. But the modification from CPU 0 hasn't left CPU 0's cache yet, so the data you get back is not valid - your modification hasn't gone through yet. Your program could now break in subtle, unpredictable, and potentially devastating ways.

Because of this, cache synchronization is done to alleviate this. Application IDs, address monitoring, and other hardware mechanisms exist to synchronize the caches between multiple CPUs. All of these methods have one common problem - they all force the CPU to take time doing bookkeeping rather than actual, useful computations.

The best method of avoiding this is actually keeping processes on one processor as much as possible. If the process doesn't migrate between CPUs, you don't need to keep the caches synchronized, as the other CPUs won't be accessing that memory at the same time (unless the memory is shared between multiple processes, but we'll not go into that here).

Now we come to the issue of how to design our scheduler, and the three main problems there - avoiding process migration, maximizing CPU utilization, and scalability.

Single Queue Multiprocessor scheduling (SQMS)

Single Queue Multiprocessor schedulers are what you suggested - one queue containing available processes, and each core accesses the queue to get the next job to run. This is fairly simple to implement, but has a couple of major drawbacks - it can cause a whole lot of process migration, and does not scale well to larger systems with more cores.

Imagine a system with four cores and five jobs, each of which takes about the same amount of time to run, and each of which is rescheduled when completed. On the first run through, CPU 0 takes job A, CPU 1 takes B, CPU 2 takes C, and CPU 3 takes D, while E is left on the queue. Let's then say CPU 0 finishes job A, puts it on the back of the shared queue, and looks for another job to do. E is currently at the front of the queue, to CPU 0 takes E, and goes on. Now, CPU 1 finishes job B, puts B on the back of the queue, and looks for the next job. It now sees A, and starts running A. But since A was on CPU 0 before, CPU 1 now needs to sync its cache with CPU 0, resulting in lost time for both CPU 0 and CPU 1. In addition, if two CPUs both finish their operations at the same time, they both need to write to the shared list, which has to be done sequentially or the list will get corrupted (just like in multi-threading). This requires that one of the two CPUs wait for the other to finish their writes, and sync their cache back to main memory, since the list is in shared memory! This problem gets worse and worse the more CPUs you add, resulting in major problems with large servers (where there can be 16 or even 32 CPU cores), and being completely unusable on supercomputers (some of which have upwards of 1000 cores).

Multi-queue Multiprocessor Scheduling (MQMS)

Multi-queue multiprocessor schedulers have a single queue per CPU core, ensuring that all local core scheduling can be done without having to take a shared lock or synchronize the cache. This allows for systems with hundreds of cores to operate without interfering with one another at every scheduling interval, which can happen hundreds of times a second.

The main issue with MQMS comes from CPU Utilization, where one or more CPU cores is doing the majority of the work, and scheduling fairness, where one of the processes on the computer is being scheduled more often than any other process with the same priority.

CPU Utilization is the biggest issue - no CPU should ever be idle if a job is scheduled. However, if all CPUs are busy, so we schedule a job to a random CPU, and a different CPU ends up becoming idle, it should "steal" the scheduled job from the original CPU to ensure every CPU is doing real work. Doing so, however, requires that we lock both CPU cores and potentially sync the cache, which may degrade any speedup we could get by stealing the scheduled job.

In conclusion

Both methods exist in the wild - Linux actually has three different mainstream scheduler algorithms, one of which is an SQMS. The choice of scheduler really depends on the way the scheduler is implemented, the hardware you plan to run it on, and the types of jobs you intend to run. If you know you only have two or four cores to run jobs, SQMS is likely perfectly adequate. If you're running a supercomputer where overhead is a major concern, then an MQMS might be the way to go. For a desktop user - just trust the distro, whether that's a Linux OS, Mac, or Windows. Generally, the programmers for the operating system you've got have done their homework on exactly what scheduler will be the best option for the typical use case of their system.

This whitepaper describes the differences between the two types of scheduling algorithms in place.

How do Operating Systems schedule multiple threads on multiple CPU cores simultaneously?

On a symmetric multi processor architecture all CPUs have equal access to all memory. A thread's object code and data is accessible to all cores and processors, so it is easy to "move" a thread / process from core to core. The kernel simply needs to implement a scheduling scheme to ensure that everything that needs to run gets run as best as possible. A thread / process that was interrupted on one core can be resumed on another with very little penalty.

Exactly what that scheduling scheme is varies. There could be a single scheduler task running on a single core that controls what's running on all the other cores. Alternatively there could be a mini-scheduler per core that looks after scheduling on just that core, co-operating with its peers to spread threads around. This is, I think (corrections welcome), what Linux does.

Force Linux to schedule processes on CPU cores that share CPU cache

Newer Linux may do this for you: Cluster-Aware Scheduling Lands In Linux 5.16 - there's support for scheduling decisions to be influenced by the fact that some cores share resources.

If you manually pick a CCX, you could give them each the same affinity mask that allows them to schedule on any of the cores in that CCX.

An affinity mask can have multiple bits set.

I don't know of a way to let the kernel decide which CCX, but then schedule both tasks to cores within it. If the parent checks which core it's currently running on, it could set a mask to include all cores in the CCX containing it, assuming you have a way to detect how core #s are grouped, and a function to apply that.

You'd want to be careful that you don't end up leaving some CCXs totally unused if you start multiple processes that each do this, though. Maybe every second, do whatever top or htop do to check per-core utilization, and if so rebalance? (i.e. change the affinity mask of both processes to the cores of a different CCX). Or maybe put this functionality outside the processes being scheduled, so there's one "master control program" that looks at (and possibly modifies) affinity masks for a set of tasks that it should control. (Not all tasks on the system; that would be a waste of work.)

Or if it's looking at everything, it doesn't need to do so much checking of current load average, just count what's scheduled where. (And assume that tasks it doesn't know about can pick any free cores on any CCX, like daemons or the occasional compile job. Or at least compete fairly if all cores are busy with jobs it's managing.)

Obviously this is not helpful for most parent/child processes, only ones that do a lot of communication via shared memory (or maybe pipes, since kernel pipe buffers are effectively shared memory).

It is true that Zen CPUs have varying inter-core latency within / across CCXs, as well as just cache hit effects from sharing L3. https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 did some microbenchmarking on Zen 3 vs. 2-socket Xeon Platinum vs. 2-socket ARM Ampere.

What is the difference in scheduling threads?

Right, there's no difference. The kernel just schedules tasks; each user task refers to a page table (whether that's shared with any other task or not).

Each logical CPU core has its own page-table pointer (e.g. x86 CR3).

And yes, cache coherency is maintained by hardware. The Linux kernel's hand-rolled atomics (using volatile, and inline asm for RMWs and barriers) depend on that.

In Linux scheduler, how do different processes containing multiple threads get fair time quota?

The point you are missing here is, how scheduler looks at threads or tasks. Well, the Linux kernel scheduler will treat them as individual scheduling entity, therefore will be counted and scheduled differently.

Now let's see what CFS documentation says - it has a simplistic approach of giving out even slice of CPU time to each runnable process, therefore, if there are 4 runnable process/threads they'll get 25% of cpu time each. But on real hardware it's not possible and to fix the issue vruntime was introduced (take more on this from here

Now come back to your example, if process A creates 100 threads and B creates 1 thread then the # of running processes or threads becomes 103 (assuming all are runnable state) then CFS will evenly share the cpu using formula 1/103 (cpu/number of running tasks). And the context switching is same for all the scheduling entities, threads only shares task's internal mm_struct and when they run they have their own sets of registers, task status to load up to start with. Hope this will help to understand better.