Why Is Execution Time of a Process Shorter When Another Process Shares the Same Ht Core

Why is execution time of a process shorter when another process shares the same HT core

Both are compiled with gcc without special options. (I.e. with the default of -O0: no optimization debug mode, keeping variables in memory instead of registers.)

Unlike a normal program, the version with int i,j loop counters bottlenecks completely on store-forwarding latency, not front-end throughput or back-end execution resources or any shared resource.

This is why you never want to do real benchmarking with -O0 debug-mode: the bottlenecks are different than with normal optimization (-O2 at least, preferably -O3 -march=native).


On Intel Sandybridge-family (including @uneven_mark's Kaby Lake CPU), store-forwarding latency is lower if the reload doesn't try to run right away after the store, but instead runs a couple cycles later. Adding a redundant assignment speeds up code when compiled without optimization and also Loop with function call faster than an empty loop both demonstrate this effect in un-optimized compiler output.

Having another hyperthread competing for front-end bandwidth apparently makes this happen some of the time.

Or maybe the static partitioning of the store buffer speeds up store-forwarding? Might be interesting to try a minimally-invasive loop running on the other core, like this:

// compile this with optimization enabled
// and run it on the HT sibling of the debug-mode nested loop
#include <immintrin.h>

int main(void) {
while(1) {
_mm_pause(); _mm_pause();
_mm_pause(); _mm_pause();
}
}

pause blocks for about 100 cycles on Skylake, up from about 5 on earlier CPUs.

So if the benefit to store-forwarding is in uops from the other thread having to issue/execute, this loop will do less of that and the run-time will be closer to when it has a physical core in single-thread mode.

But if the benefit is just from partitioning the ROB and store buffer (which could plausibly speed up the time for a load to probe it for stores), we'd still see the full benefit.

Update: @uneven_mark tested on Kaby Lake and found that this reduced the "speedup" to ~2%, down from ~8%. So apparently competing for front-end / back-end resources was an important part of the infinite loop in stopping the other loop from reloading too soon.

Perhaps using up BOB (branch-order-buffer) slots was the main mechanism in stopping the other thread's branch uops from issueing into the out-of-order back-end. Modern x86 CPUs snapshot the RAT and other backend state to allow fast recovery when they detect branch mispredicts, allowing rollback to the mispredicted branch without waiting for it to reach retirement.

This avoids waiting for independent work before the branch, and letting out-of-order execution of it continue while recovering. But it means fewer branches can be in flight. At least fewer conditional/indirect branches? IDK if a direct jmp would use a BOB entry; its validity is established during decode. So maybe this guess doesn't hold water.


The while(1){} loop has no local vars in the loop so it doesn't bottleneck on store-forwarding. It's just a top: jmp top loop that can run at 1 cycle per iteration. That's a single-uop instruction on Intel.

i5-8250U is a Kaby Lake, and (unlike Coffee Lake) still has its loop buffer (LSD) disabled by microcode like Skylake. So it can't unroll itself in the LSD/IDQ (queue feeding the issue/rename stage) and has to fetch the jmp uop separately from the uop cache every cycle. But the IDQ does buffer that, only needing an issue/rename cycle every 4 cycles to issue a group of 4 jmp uops for that logical core.

But anyway, on SKL / KBL these two threads together more than saturate uop cache fetch bandwidth and do compete with each other that way. On a CPU with the LSD (loopback buffer) enabled (e.g. Haswell / Broadwell, or Coffee Lake and later), they wouldn't. Sandybridge/Ivybridge don't unroll tiny loops to use more of their LSD so you'd have the same effect there. I'm not sure if that's significant. Testing on Haswell or Coffee Lake would be interesting.

(An unconditional jmp always ends a uop-cache line, and it's not a trace cache anyway so one uop-cache fetch can't give you more than one jmp uop.)


I have to correct my confirmation from above: I compiled all programs as C++ (g++), which gave the roughly 2% difference. If I compile everything as C, I get about 8%, which is closer to OPs roughly 10%.

That's interesting, gcc -O0 and g++ -O0 do compile the loops differently. This is a quirk of the GCC's C vs. C++ front-ends feeding GCC's back-end different GIMPLE/RTL, or something like that, and -O0 not making the back-end fix the inefficiency. This is not anything fundamental about C vs. C++ or that you could expect from other compilers.

The C version still transforms to an idiomatic do{}while() style loop with a cmp/jle at the bottom of the loop, right after a memory-destination add. (The left pane on this Godbolt compiler explorer link). Why are loops always compiled into "do...while" style (tail jump)?

But the C++ version uses an if(break) style of looping with the condition at the top, then the memory-destination add. Funny that separating the memory-destination add from the cmp reload by only one jmp instruction makes that big a difference.

# inner loop, gcc9.2 -O0.   (Actually g++ -xc but same difference)
jmp .L3
.L4: # do {
add DWORD PTR [rbp-8], 1 # j++
.L3: # loop entry point for first iteration
cmp DWORD PTR [rbp-8], 99999
jle .L4 # }while(j<=99999)

Apparently the add/cmp back to back make this version suffer more from slower store-forwarding on Skylake / Kaby/Coffee Lake

vs. this one which isn't affected as much:

# inner loop, g++9.2 -O0
.L4: # do {
cmp DWORD PTR [rbp-8], 99999
jg .L3 # if(j>99999) break
add DWORD PTR [rbp-8], 1 # j++
jmp .L4 # while(1)
.L3:

cmp [mem], imm / jcc might still micro and/or macro-fuse, but I forget which. IDK if that's relevant, but if the loop is more uops it can't issue as fast. Still, with the execution bottleneck of 1 iteration per 5 or 6 cycles (memory-destination add latency), the front-end is easily going to stay ahead of the back-end even if it has to compete with another hyperthread.

Can different processes run RDTSC at the same time?

Each physical core has its own TSC; the microcode doesn't have to go off-core, so there's no shared resource that they compete for. Going off-core at all would make it much slower, and made the implementation more complex. Having a counter physically inside each core is a simpler implementation, just counting ticks of a reference-clock signal that's distributed to all cores.

With HyperThreading, the logical cores sharing a physical always compete for execution resources. From Agner Fog's instruction tables, we know that RDTSC on Skylake is 20 uops for the front-end, and has 1 per 25 cycle throughput. At less than 1 uop per clock while executing nothing but RDTSC instructions, competing for the front-end is probably not a problem.

Probably most of those uops can run on any execution port, so it's quite possible that both logical threads can run rdtsc with that throughput.

But maybe there's a not-fully-pipelined execution unit that they'd compete for.

You can test it by putting times 20 rdtsc inside a loop that runs a few 10s of millions of iterations, and running that microbenchmark on a core by itself, and then running it twice pinned to the logical cores of one physical core.

I got curious and did that myself on Linux with perf on a Skylake i7-6700k, with taskset -c 3 and taskset -c 7 (the way Linux enumerates the cores on this CPU, those numbers are the logical cores of the 4th physical core. You can check /proc/cpuinfo to find out on your system.)

To avoid interleaving the output lines if they both finish nearly simultaneously, I used bash process substitution with cat <(cmd1) <(cmd2) to run them both simultaneously and get the output printed in a fixed order. The commands were
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,instructions:u,branches:u,branch-misses:u,uops_issued.any:u,uops_executed.thread:u,cpu_clk_thread_unhalted.one_thread_active:u -r2 ./testloop to count core clock cycles (not reference cycles, so I don't have to be paranoid about turbo / idle clock frequencies).

testloop is a static executable with a hand-written asm loop containing times 20 rdtsc (NASM repeat operator) and dec ebp/jnz, with the top of the loop aligned by 64 in case that ever matters. Before the loop, mov ebp, 10000000 initializes the counter. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details on how I do microbenchmarks this way. Or Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths another example of a simple NASM program with a loop using times to repeat instructions.)

 Performance counter stats for './testloop' (2 runs):

1,278.19 msec task-clock:u # 1.000 CPUs utilized ( +- 0.19% )
4 context-switches # 0.004 K/sec ( +- 11.11% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.002 K/sec
5,243,270,118 cycles:u # 4.102 GHz ( +- 0.01% ) (71.37%)
219,949,542 instructions:u # 0.04 insn per cycle ( +- 0.01% ) (85.68%)
10,000,692 branches:u # 7.824 M/sec ( +- 0.03% ) (85.68%)
32 branch-misses:u # 0.00% of all branches ( +- 93.65% ) (85.68%)
4,010,798,914 uops_issued.any:u # 3137.885 M/sec ( +- 0.01% ) (85.68%)
4,010,969,168 uops_executed.thread:u # 3138.018 M/sec ( +- 0.00% ) (85.78%)
0 cpu_clk_thread_unhalted.one_thread_active:u # 0.000 K/sec (57.17%)

1.27854 +- 0.00256 seconds time elapsed ( +- 0.20% )

Performance counter stats for './testloop' (2 runs):

1,278.26 msec task-clock:u # 1.000 CPUs utilized ( +- 0.18% )
6 context-switches # 0.004 K/sec ( +- 9.09% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.002 K/sec ( +- 20.00% )
5,245,894,686 cycles:u # 4.104 GHz ( +- 0.02% ) (71.27%)
220,011,812 instructions:u # 0.04 insn per cycle ( +- 0.02% ) (85.68%)
9,998,783 branches:u # 7.822 M/sec ( +- 0.01% ) (85.68%)
23 branch-misses:u # 0.00% of all branches ( +- 91.30% ) (85.69%)
4,010,860,476 uops_issued.any:u # 3137.746 M/sec ( +- 0.01% ) (85.68%)
4,012,085,938 uops_executed.thread:u # 3138.704 M/sec ( +- 0.02% ) (85.79%)
4,174 cpu_clk_thread_unhalted.one_thread_active:u # 0.003 M/sec ( +- 9.91% ) (57.15%)

1.27876 +- 0.00265 seconds time elapsed ( +- 0.21% )

vs. running alone:

 Performance counter stats for './testloop' (2 runs):

1,223.55 msec task-clock:u # 1.000 CPUs utilized ( +- 0.52% )
4 context-switches # 0.004 K/sec ( +- 11.11% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.002 K/sec
5,003,825,966 cycles:u # 4.090 GHz ( +- 0.00% ) (71.31%)
219,905,884 instructions:u # 0.04 insn per cycle ( +- 0.04% ) (85.66%)
10,001,852 branches:u # 8.174 M/sec ( +- 0.04% ) (85.66%)
17 branch-misses:u # 0.00% of all branches ( +- 52.94% ) (85.78%)
4,012,165,560 uops_issued.any:u # 3279.113 M/sec ( +- 0.03% ) (85.78%)
4,010,429,819 uops_executed.thread:u # 3277.694 M/sec ( +- 0.01% ) (85.78%)
28,452,608 cpu_clk_thread_unhalted.one_thread_active:u # 23.254 M/sec ( +- 0.20% ) (57.01%)

1.22396 +- 0.00660 seconds time elapsed ( +- 0.54% )

(The counter for cpu_clk_thread_unhalted.one_thread_active:u only counts at some slow rate; the system was fairly idle during this test so it should have had the core to itself the whole time. i.e. that ~23.2 M counts / sec does represent single-thread mode.)

vs. the 0 and near-0 counts for running together show that I succeeded in having these tasks run simultaneously on the same core, with hyperthreading, for basically the whole time (~1.2 seconds repeated twice, or 2.4 seconds).

So 5.0038G cycles / 10M iters / 20 rdtsc/iter = 25.019 cycles per RDTSC single-threaded, pretty much what Agner Fog measured.

Averaging across both processes for the HT test, that's about 5.244G cycles / 10M iter / 20 rdtsc/iter = 26.22 cycles on average.

So running RDTSC on both logical cores simultaneously on Skylake gives a nearly linear speedup, with very minimal competition for throughput resources. Whatever RDTSC bottlenecks on, it's not something that both threads compete for or slow each other down with.

Having the other core busy running high-throughput code (that could sustain 4 uops per clock if it had a core to itself) would probably hurt an RDTSC thread more than another thread that also just running RDTSC. Maybe we could even figure out if there's one specific port that RDTSC needs more than others, e.g. port 1 is easy to saturate because it's the only port that can run integer multiply instructions.

CPU time on multicored/hyperthreaded

What's the relationship between numbers A, Bi, Ci, Di?

Expect D1=D2=D3=D4=A*1, except if you have L2 cache issues (conflicts, faults, ...) where you will have a slightly greater number instead of 1.

Expect B1=B2=B3=B4=...=B8=A*1.3. The number 1.3 may vary between 1.1 and 2 depending on you application (certain processor subparts are hyperthreaded, others are not). It was computed from similar statistics, with I give here using the notations of the question: D=23 seconds, and A=18 seconds, according to a private forum. The unthreaded process did integer computations without input/output. Exact application was checking Adem coefficients in algebra of motivic Steenrod (don't know what it is; settings were (2n+e,n) with n=20).

In the case of sevent processes (Cs), if you assign each process to a core (with /usr/bin/htop on linux), then you will have one of the process (C5 for example) that has the same execution time as an A, and the others (in my example, C1, C2, C3, C4, C6, C7) would have same values than Ds. If you do not assign the processes to cores, and your process lasts enough for the OS do balance them between the cores, they will converge to the mean of the C.

Are times Bi different between them? What about Ci, Di?

Depend on your OS scheduler and on its configuration. And the percentage shown by /bin/top from linux is cheating, it will show nearly 100% for A, Bs, Cs and Ds.

To assess performances, don't forget /usr/bin/nettop (and variants nethogs, nmon, iftop, iptraf), iotop (and variants iostat, latencytop), and collectl (+colmux) and sar (+sag, +sadf).

What will be used for data exchange between threads are executing on one Core with HT?

I think you'll get a round-trip to L1. (Not the same thing as store->load forwarding within a single thread, which is even faster than that.)

Intel's optimization manual says that store and load buffers are statically partitioned between threads, which tells us a lot about how this will work. I haven't tested most of this, so please let me know if my predictions aren't matching up with experiment.

Update: See this Q&A for some experimental testing of throughput and latency.


A store has to retire in the writing thread, and then commit to L1 from the store buffer/queue some time after that. At that point it will be visible to the other thread, and a load to that address from either thread should hit in L1. Before that, the other thread should get an L1 hit with the old data, and the storing thread should get the stored data via store->load forwarding.

Store data enters the store buffer when the store uop executes, but it can't commit to L1 until it's known to be non-speculative, i.e. it retires. But the store buffer also de-couples retirement from the ROB (the ReOrder Buffer in the out-of-order core) vs. commitment to L1, which is great for stores that miss in cache. The out-of-order core can keep working until the store buffer fills up.


Two threads running on the same core with hyperthreading can see StoreLoad re-ordering if they don't use memory fences, because store-forwarding doesn't happen between threads. Jeff Preshing's Memory Reordering Caught in the Act code could be used to test for it in practice, using CPU affinity to run the threads on different logical CPUs of the same physical core.

An atomic read-modify-write operation has to make its store globally visible (commit to L1) as part of its execution, otherwise it wouldn't be atomic. As long as the data doesn't cross a boundary between cache lines, it can just lock that cache line. (AFAIK this is how CPUs do typically implement atomic RMW operations like lock add [mem], 1 or lock cmpxchg [mem], rax.)

Either way, once it's done the data will be hot in the core's L1 cache, where either thread can get a cache hit from loading it.

I suspect that two hyperthreads doing atomic increments to a shared counter (or any other locked operation, like xchg [mem], eax) would achieve about the same throughput as a single thread. This is much higher than for two threads running on separate physical cores, where the cache line has to bounce between the L1 caches of the two cores (via L3).

movNT (Non-Temporal) weakly-ordered stores bypass the cache, and put their data into a line-fill buffer. They also evict the line from L1 if it was hot in cache to start with. They probably have to retire before the data goes into a fill buffer, so a load from the other thread probably won't see it at all until it enters a fill-buffer. Then probably it's the same as an movnt store followed by a load inside a single thread. (i.e. a round-trip to DRAM, a few hundred cycles of latency). Don't use NT stores for a small piece of data you expect another thread to read right away.


L1 hits are possible because of the way Intel CPUs share the L1 cache. Intel uses virtually indexed, physically tagged (VIPT) L1 caches in most (all?) of their designs. (e.g. the Sandybridge family.) But since the index bits (which select a set of 8 tags) are below the page-offset, it behaves exactly like a PIPT cache (think of it as translation of the low 12 bits being a no-op), but with the speed advantage of a VIPT cache: it can fetch the tags from a set in parallel with the TLB lookup to translate the upper bits. See the "L1 also uses speed tricks that wouldn't work if it was larger" paragraph in this answer.

Since L1d cache behaves like PIPT, and the same physical address really means the same memory, it doesn't matter whether it's 2 threads of the same process with the same virtual address for a cache line, or whether it's two separate processes mapping a block of shared memory to different addresses in each process. This is why L1d can be (and is) competitively by both hyperthreads without risk of false-positive cache hits. Unlike the dTLB, which needs to tag its entries with a core ID.

A previous version of this answer had a paragraph here based on the incorrect idea that Skylake had reduced L1 associativity. It's Skylake's L2 that's 4-way, vs. 8-way in Broadwell and earlier. Still, the discussion on a more recent answer might be of interest.


Intel's x86 manual vol3, chapter 11.5.6 documents that Netburst (P4) has an option to not work this way. The default is "Adaptive mode", which lets logical processors within a core share data.

There is a "shared mode":

In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if the
logical processors use identical CR3 registers and paging modes.

In shared mode, linear addresses in the L1 data cache can be aliased, meaning that one linear address in the cache
can point to different physical locations. The mechanism for resolving aliasing can lead to thrashing. For this
reason, IA32_MISC_ENABLE[bit 24] = 0 is the preferred configuration for processors based on the Intel NetBurst
microarchitecture that support Intel Hyper-Threading Technology

It doesn't say anything about this for hyperthreading in Nehalem / SnB uarches, so I assume they didn't include "slow mode" support when they introduced HT support in another uarch, since they knew they'd gotten "fast mode" to work correctly in netburst. I kinda wonder if this mode bit only existed in case they discovered a bug and had to disable it with microcode updates.

The rest of this answer only addresses the normal setting for P4, which I'm pretty sure is also the way Nehalem and SnB-family CPUs work.


It would be possible in theory to build an OOO SMT CPU core that made stores from one thread visible to the other as soon as they retired, but before they leaves the store buffer and commit to L1d (i.e. before they become globally visible). This is not how Intel's designs work, since they statically partition the store queue instead of competitively sharing it.

Even if the threads shared one store-buffer, store forwarding between threads for stores that haven't retired yet couldn't be allowed because they're still speculative at that point. That would tie the two threads together for branch mispredicts and other rollbacks.

Using a shared store queue for multiple hardware threads would take extra logic to always forward to loads from the same thread, but only forward retired stores to loads from the other thread(s). Besides transistor count, this would probably have a significant power cost. You couldn't just omit store-forwarding entirely for non-retired stores, because that would break single-threaded code.

Some POWER CPUs may actually do this; it seems like the most likely explanation for not all threads agreeing on a single global order for stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?.

As @BeeOnRope points out, this wouldn't work for an x86 CPU, only for an ISA that doesn't guarantee a Total Store Order, because this this would let the SMT sibling(s) see your store before it becomes globally visible to other cores.

TSO could maybe be preserved by treating data from sibling store-buffers as speculative, or not able to happen before any cache-miss loads (because lines that stay hot in your L1D cache can't contain new stores from other cores). IDK, I haven't thought this through fully. It seems way overcomplicated and probably not able to do useful forwarding while maintaining TSO, even beyond the complications of having a shared store-buffer or probing sibling store-buffers.

Is a schedulable unit of CPU time slice process or thread?

Clarification: my understanding of "a schedulable unit of CPU time slice" is "a unit that can be scheduled during a given CPU time slice" (since if "schedulable unit" would be a time then the question does not make much sense to me).

Based on this, put it shortly, "a schedulable unit of CPU time slice" for a given logical core can be seen as a software thread (more specifically its execution context composed of registers and process information).


Operating systems scheduler operates on tasks. Tasks can be threads, processes, or other unusual structure (eg. dataflows).

Modern mainstream operating system mainly schedule threads on processing units (typically hardware threads also called logical cores). You can get more information about how the Windows scheduler works in the Microsoft documentation. The documentation explicitly states:

A thread is the entity within a process that can be scheduled for execution

On Linux, the default scheduler, CFS, operates on task (ie. task_struct data structure). Tasks can be a thread, a group of threads or a process. This was done that way so to make the scheduler more generic and also because this scheduler was designed long ago, when processors had only 1 core and people focused on processes rather than thread. The multi-core era since caused applications to use a lot of threads so to use available cores. As a result, nowadays, it is generally threads that are actually scheduled AFAIK. This is explained in the famous research paper The Linux Scheduler: a Decade of Wasted Cores (which also explain a bit how the CFS operate regarding the target processor).

Note that the term "process" can sometime refer to a thread since threads are sometime called "lightweight processes" and basic processes are sometime called "heavy processes". Processes can even be a generic term for both heavy and lightweight processes (ie. threads and actual processes). This is a very confusing terminology and a misuse of language (like the term "processors" sometimes used for cores). In practice, this is often not a problem in a specific context since threads and processes may be used interchangeably though (in such a case, people should use a generic term like "tasks").

As for "a schedulable unit of CPU time slice" this is a bit more complex. A simple and naive answer is: a thread (it is definitively not processes alone). That being said, a thread is a software-defined concept (like processes). It is basically a stack, few registers, and a parent process (with possibly some meta-information and a TLS space). CPUs does not operate directly on such data structure. CPU does not have a concept of thread stack for example (it is just a section of the virtual process memory like any other). They just need an execution context which is composed of registers and a process configuration (in protected mode). For sake of simplicity, we can say that they execute threads. Mainstream modern x86 processors are very complex, and each core is often able to run multiple threads at the same time. This is called simultaneous multithreading (aka. Hyper-Threading for Intel processors). x86 physical cores are typically composed of two logical threads (ie. hardware threads) that can each execute a software threads.

How can I resolve data dependency in pointer arrays?

You've discovered one of the effects that causes bottlenecks in histograms. A workaround for that problem is to keep multiple arrays of counters and rotate through them, so repeated runs of the same index are distributed over 2 or 4 different counters in memory.

(Then loop over the arrays of counters to sum them down into one final set of counts. This part can benefit from SIMD.)


Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation

No, it's not the CPU, it's a compile-time optimization.

++*pointer[0] is fast because the compiler can hoist the store/reload out of the loop and actually just increment a register. (If you don't use the result, it might optimize away even that.)

Assumption of no data-race UB lets the compiler assume that nothing else is modifying pointer[0] so it's definitely the same object being incremented every time. And the as-if rule lets it keep *pointer[0] in a register instead of actually doing a memory-destination increment.

So that means 1 cycle latency for the increment, and of course it can combine multiple increments into one and do *pointer[0] += n if it fully unrolls and optimizes away the loop.


when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.

Yes, the data dependency through that memory location is the problem. Without knowing at compile time that the pointers all point to the same place, the compiler will make asm that does actually increment the pointed-to memory location.

"wait for the write to finish" isn't strictly accurate, though. The CPU has a store buffer to decouple store execution from cache misses, and out-of-order speculative exec from stores actually committing to L1d and being visible to other cores. A reload of recently-stored data doesn't have to wait for it to commit to cache; store forwarding from the store-buffer to a reload is a thing once the CPU detects it.

On modern Intel CPUs, store-forwarding latency is about 5 cycles, so a memory-destination add has 6-cycle latency. (1 for the add, 5 for the store/reload if it's on the critical path.)

And yes, out-of-order execution lets two of these 6-cycle-latency dependency chains run in parallel. And the loop overhead is hidden under that latency, again by OoO exec.

Related:

  • Store-to-Load Forwarding and Memory Disambiguation in x86 Processors
    on stuffedcow.net
  • Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?
  • How does store to load forwarding happens in case of unaligned memory access?


Related Topics



Leave a reply



Submit