Why Does Linux's Scheduler Put Two Threads Onto the Same Physical Core on Processors with Hyperthreading

When 2 threads would be executed on a 1 physical CPU core with a multi-core CPU machine?

Two cases:

1) The other physical cores are busy doing other stuff, so only one core gets used by this process. The two threads run in alternation on that core.

2) The physical core supports executing more than one thread concurrently using hyperthreading or something similar. The other physical cores are busy doing other stuff, so the best the scheduler can do is run both threads in a single physical core.

Is synchronization faster on the same physical CPU core?

The answer is dependant of the platform (especially the underlying architecture). That being said, on the (mainstream) x86-64 architecture, threads sharing the same core communicate faster than threads on different cores or even different sockets. One main reason is that the two threads will often share the same L1 cache (and if not, the L2 cache). Thus, on thread can directly read what the other just wrote. Moreover, the threads can often run in parallel thanks to simultaneous multithreading (called Hyper-Threading on Intel CPUs) reducing the communication latency (no scheduling quantum to wait).
Meanwhile, threads on different cores will have to communicate through a (slow) bus or share data using the L3 cache (significantly slower than the L1/L2).

Then your workload is bound by communication (latency or throughput), it is often better to put threads close to each other (ie. on the same core). When the number of threads per core exceed the number of hardware thread, then performance decrease due to preemptive multitasking. When the workload is compute bound, it is better to put them on separate cores. Note that on modern x86 processors, threads working on the same core can even share the computing resources (ALUs) at the instruction level.

Can a hyper-threaded processor core execute two threads at the exact same time?

See my answer on a softwareengineering.SE question for some details about how modern CPUs find and exploit instruction-level parallelism (ILP) by running multiple instructions at once. (Including a block diagram of Intel Haswell's pipeline, and links to more CPU microarchitecture details). Also Modern Microprocessors
A 90-Minute Guide!

You have a CPU with lots of execution units and a front-end that can keep them mostly supplied with work to do, but only under good conditions. Stalls like cache misses or branch mispredicts, or just limited parallelism (e.g. a loop that does one long chain of FP additions, bottlenecking on FP latency at one (scalar or SIMD) add per 4 or 5 clocks instead of one or two per clock) will result in throughput of much less than 4 instructions per cycle, and leave execution units idle.

The point of HT (and Simultaneous Multithreading (SMT) in general) is to keep those hungry execution units fed with work to do, even when running code with low ILP or lots of stalls (cache misses / branch mispredicts).

SMT only adds a bit of extra logic to the pipeline so it can keep track of two separate architectural contexts at the same time. So it costs a lot less die area and power than having twice or 4x as many full cores. (Knight's Landing Xeon Phi runs 4 threads per core, mainstream Intel CPUs run 2. Some non-x86 chips run 8 threads per core, aimed at database-server type workloads.) But of course having to divide out-of-order execution resources between logical threads often means the throughput gain is significantly below 2x or 4x, often far below, and for some workloads is negative.


Common misconceptions

Hyperthreading is not just optimized context switching. Simpler designs that switch to the other thread on a cache miss are possible, but HT is more advanced than that.

With two threads active, the front-end alternates between threads every cycle (in the fetch, decode, and issue/rename stages), but the out-of-order back-end can actually execute uops from both logical cores in the same cycle.

In pipeline stages that normally alternate, any time one thread is stalled, the other thread gets all the cycles in that stage. HT is much better than just fixed alternating, because one thread can get lots of work done while the other is recovering from a branch mispredict or waiting for a cache miss.

Note that up to 10 cache misses can be outstanding at once (from L1D cache in Intel CPUs: this is the number of LFB (Line Fill Buffers), and memory requests are pipelined. But if the address for the next load depends on an earlier load (e.g. pointer chasing through a tree or linked list), the CPU doesn't know where to load from and can't keep multiple requests in flight. So it is actually useful for both threads to be waiting on cache misses in parallel.

Some resources are statically partitioned when two threads are active, some are competitively shared. See this pdf of slides for some details. (For more details about how to actually optimize asm for Intel and AMD CPUs, see Agner Fog's microarchitecture PDF.)


When one logical core "sleeps" (i.e. the kernel runs a HLT instruction or whatever MWAIT to enter a deeper sleep), the physical core transitions to single-thread mode and lets the still-active logical core have all the resources (including the full ReOrder Buffer size, and other statically-partitioned resources), so it's ability to find and exploit ILP in the single thread still running increases more than when the other thread is simply stalled on a cache miss.


BTW, some workloads actually run slower with HT. If your working set barely fits in L2 or L1D cache, then running two on the same core will lead to a lot more cache misses. For very well-tuned high-throughput code that can already keep the execution units saturated
(like an optimized matrix multiply in high-performance computing), it can make sense to disable HT. Always benchmark.

On Skylake, I've found that video encoding (with x265 -preset slower, 1080p) is about 15% faster with 8 threads instead of 4, on my quad-core i7-6700k. I didn't actually disable HT for the 4-thread test, but Linux's scheduler is good at not bouncing threads around and running threads on separate physical cores when there are enough to go around. A 15% speedup is pretty good considering that x265 has a lot of hand-written asm and runs very high instructions-per-cycle even when it has a whole core to itself. (Slower presets like I used tend to be more CPU-bound than memory-bound.)



Related Topics



Leave a reply



Submit