How to Estimate the Thread Context Switching Overhead

How to estimate the thread context switching overhead?

While you said you don't want to write a test application, I did this for a previous test on an ARM9 Linux platform to find out what the overhead is. It was just two threads that would boost::thread::yield() (or, you know) and increment some variable, and after a minute or so (without other running processes, at least none that do something), the app printed how many context switches it could do per second. Of course this is not really exact, but the point is that both threads yielded the CPU to each other, and it was so fast that it just didn't make sense any more to think about the overhead.
So, simply go ahead and just write a simple test instead of thinking too much about a problem that may be non-existent.

Other than that, you might try like 1800 suggested with performance counters.

Oh, and I remember an application running on Windows CE 4.X, where we also have four threads with intensive switching at times, and never ran into performance issues. We also tried to implement the core threading thing without threads at all, and saw no performance improvement (the GUI just responded much slower, but everything else was the same). Maybe you can try the same, by either reducing the number of context switches or by removing threads completely (just for testing).

How to measure the context switching overhead of a very large program?

Are you sure most of those 200 threads are actually waiting to run at the same time, not waiting for data from a system call? I guess you can tell from perf stat that context-switches are actually pretty high, but part of the question is whether they're high for the threads doing the critical work.

The cost of a context-switch is reflected in cache misses once a thread is running again. (And stopping OoO exec from finding as much ILP right at the interrupt boundary). This cost is more significant than the cost of the kernel code that saves/restores registers. So even if there was a way to measure how much time the CPUs spent in kernel context-switch code (possible with perf record sampling profiler as long as your perf_event_paranoid setting allows recording kernel addresses), that wouldn't be an accurate reflection of the true cost.

Even making a system call has a similar (but lower and more frequent) performance cost from serializing OoO exec, as well as disturbing caches (and TLB). There's a useful characterization of this on real modern CPUs (from 2010) in a paper by Livio & Stumm, especially the graph on the first page of IPC (instructions per cycle) dropping after a system call returns, and taking time to recover: FlexSC: Flexible System Call Scheduling with Exception-Less System Calls. (Conference presentation: https://www.usenix.org/conference/osdi10/flexsc-flexible-system-call-scheduling-exception-less-system-calls)

You might estimate context-switch cost by running the program on a system with enough cores not to need to context-switch much at all (e.g. a big many-core Xeon or Epyc), vs. on fewer cores but with the same CPUs / caches / inter-core latency and so on. So, on the same system with taskset --cpu-list 0-8 ./program to limit how many cores it can use.

Look at the total user-space CPU-seconds used: the amount higher is the extra amount of CPU time needed because of slowdowns from context switched. The wall-clock time will of course be higher when the same work has to compete for fewer cores, but perf stat includes a "task-clock" output which tells you a total time in CPU-milliseconds that threads of your process spent on CPUs. That would be constant for the same amount of work, with perfect scaling to more threads, and/or to the same threads competing for more / fewer cores.

But that would tell you about context-switch overhead on that big system with big caches and higher latency between cores than on a small desktop.

Measuring context switch time for threads

First of all, using CLOCK_THREAD_CPUTIME_ID is probably very wrong; this clock will give the time spent in that thread, in user mode. However the context switch does not happen in user mode, You'd want to use another clock. Also, on multiprocessing systems the clocks can give different values from processor to another! Thus I suggest you use CLOCK_REALTIME or CLOCK_MONOTONIC instead. However be warned that even if you read either of these twice in rapid succession, the timestamps usually will tens of nanoseconds apart already.

As for context switches - tthere are many kinds of context switches. The fastest approach is to switch from one thread to another entirely in software. This just means that you push the old registers on stack, set task switched flag so that SSE/FP registers will be lazily saved, save stack pointer, load new stack pointer and return from that function - since the other thread had done the same, the return from that function happens in another thread.

This thread to thread switch is quite fast, its overhead is about the same as for any system call. Switching from one process to another is much slower: this is because the user-space page tables must be flushed and switched by setting the CR0 register; this causes misses in TLB, which maps virtual addresses to physical ones.

However the <1 ns context switch/system call overhead does not really seem plausible - it is very probable that there is either hyperthreading or 2 CPU cores here, so I suggest that you set the CPU affinity on that process so that Linux only ever runs it on say the first CPU core:

#include <sched.h>

cpu_set_t  mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
result = sched_setaffinity(0, sizeof(mask), &mask);

Then you should be pretty sure that the time you're measuring comes from a real context switch. Also, to measure the time for switching floating point / SSE stacks (this happens lazily), you should have some floating point variables and do calculations on them prior to context switch, then add say .1 to some volatile floating point variable after the context switch to see if it has an effect on the switching time.

How to Estimate the Thread Context Switching Overhead