Determine Tsc Frequency on Linux

How stable is TSC (TimeStamp Counter) from user space for Intel x86-64 CPUs in 2020?

It's as stable as the clock crystal on your motherboard, but it's locked to a reference frequency (which depends on the CPU model), not the current CPU core clock frequency. That change was about 15 years ago (constant_tsc CPU feature) making it usable for wall-clock timing instead of cycle counting.

For example, the Linux VDSO user-space implementation of clock_gettime uses rdtsc and a scale factor to calculate an offset from the less-frequently-updated timestamp updated by the kernel's timer interrupt. (VDSO = pages of code and data owned by the kernel, mapped read-only into user-space processes.)

What the best practices to use TSC in the user space nowadays?

If you want to count core clock cycles, use rdpmc (with a HW perf counter programmed appropriately and set up so user-space is allowed to read it.) Or user perf or other way of using HW perf counters.

But other than that, you can use rdtsc directly or indirectly via wrapper libraries.

Depending on your overhead requirements, and how much effort you're willing to put into finding out TSC frequency so you can relate TSC counts to seconds, you might just use it via std::chrono or libc clock_gettime which don't need to actually enter the kernel thanks to the VDSO.

How to get the CPU cycle count in x86_64 from C++? - my answer there has more details about the TSC, including how it worked on older CPUs, and the fact that out-of-order execution means you need lfence before/after rdtsc if you want to wait for earlier code to finish executing before it reads the internal TSC.

Measuring chunks of code shorter than a few hundred instructions introduces the complication that throughput and latency are different things, it's not meaningful to measure performance with just a single number. Out-of-order exec means that the surrounding code matters.

and they are gonna remove it from the user space.

x86 has basically never removed anything, and definitely not from user-space. Backwards compat with existing binaries is x86's main claim to fame and reason for continued existence.

rdtsc is documented in Intel and AMD's x86 manuals, e.g. Intel's vol.2 entry for it. There is a CPU feature that lets the kernel disable RDTSC for user-space (TSD = TimeStamp Disable) but it's not normally used on Linux. (Note the #GP(0) exception: If the TSD flag in register CR4 is set and the CPL is greater than 0 - Current Privilege Level 0 = kernel, higher = user-space.

IDK if there are any plans to use TSD by default; I'd assume not because it's a useful and efficient timesource. Even if so, on a dev machine where you want to do profiling / microbenchmarking you'd be able to toggle that feature. (Although usually I just put stuff in a large-enough repeat loop in a static executable and run it under perf stat to get total time and HW perf counters.)

Accuracy of rdtsc for benchmarking and time stamp counter frequency

Invariant TSC means, according to Intel,

The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states.

But what rate is that? Well,

That rate may be set by the
maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at
which the processor is booted. The maximum resolved frequency may differ from the maximum qualified
frequency of the processor, see Section 18.14.5 for more detail. On certain processors, the TSC frequency may
not be the same as the frequency in the brand string.

Looks to me as though they wanted it to be the frequency from the brand string, but then somehow didn't always get it right..
What is that frequency though?

The TSC, IA32_MPERF, and IA32_FIXED_CTR2 operate at the same, maximum-resolved frequency of the platform, which is equal to the product of scalable bus frequency and maximum resolved bus ratio.

For processors based on Intel Core microarchitecture, the scalable bus frequency is encoded in the bit field MSR_FSB_FREQ[2:0] at (0CDH), see Appendix B, "Model-Specific Registers (MSRs)". The maximum resolved bus ratio can be read from the following bit field:

If XE operation is disabled, the maximum resolved bus ratio can be read in MSR_PLATFORM_ID[12:8]. It corresponds to the maximum qualified frequency.

If XE operation is enabled, the maximum resolved bus ratio is given in MSR_PERF_STAT[44:40], it corresponds to the maximum XE operation frequency configured by BIOS.

That's probably not very helpful though. TL;DR, finding the TSC rate programmatically is too much effort. You can of course easily find it on your own system, just get an inaccurate guess based on a timed loop and take the "nearest number that makes sense". It's probably the number from the brand string anyway. It has been on all systems I've tested it on, but I haven't tested that many. And if it isn't, then it'll be some significantly differing rate, so you will definitely know.

In addition, does this mean the time obtained by using the TSC ticks and CPU frequency isn't the actual time used by the code piece?

Yes however not all hope is lost, the time obtained by using TSC ticks and the TSC rate (if you somehow know it) will give the actual time .. almost? Here usually a lot of FUD about unreliability is spouted. Yes, RDTSC is not serializing (but you can add serializing instructions). RDTSCP is serializing, but in some ways not quite enough (it can't execute too early, but it can execute too late). But it's not like you can't use them, you can either accept a small error, or read the paper I linked below.

But can it be assumed to be synchronized among cores on newer CPUs?

Yes, no, maybe - it will be synchronized, unless the TSC is written to. Who knows, someone might do it. Out of your control. It also won't be synchronized across different sockets.

Finally, I don't really buy the FUD about RDTSC(P) in the context of benchmarking. You can serialize it as much as you need, TSC is invariant, and you know the rate because it's your system. There isn't really any alternative either, it's basically the source of high resolution time measurement that in the end everything else ends up using anyway. Even without special precautions (but with filtering of your data) the accuracy and precision are fine for most benchmarks, and if you need more then read How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures, they write a kernel module so they can get rid of two other sources of benchmark error that are subject to much FUD, preemptions and interrupts.

Can constant non-invariant tsc change frequency across cpu states?

Starting with Nehalem and Saltwell, all Intel processors support invariant TSC, which means that the TSC is incremented at a constant rate across P-, C-, and T-states (but not necessarily across S-states).

Starting with Pentium 4 Family 0F Model 03, all Intel processors support constant TSC, which means that the TSC is incremented at a constant rate across P- and T-states. The TSC continues to increment in the HLT state (called Auto Halt or C1/Auto Halt). TSC doesn't increment in any other sleep state. This category of processors includes Bonnell.

Older processors don't support constant TSC. The TSC continues to increment in the HLT state, but not in deeper sleep states. On some of these processors, TSC is buggy.

The TSC value may be reinitialized (to some BIOS-dependent value) when waking up from an S-state.

Here is a summary. "Y" means that TSC continues to increment at the same rate across the specified type of states. "N" means that TSC either continues to increment at a different rate or stops incrementing. On a few processors, TSC is incremented in the S3 state and lower (this is called always-on TSC). "N/A" means that TSC is not supported.

                                  |   T   |   P   |C = HLT|C Other|S <= S3|S Other|
---------------------------------------------------
Nehalem+ | Y | Y | Y | Y | N | N |
Silvermont Merrifield+Moorefield, | Y | Y | Y | Y | Y | N |
Saltwell Penwell+Cloverview
Other Saltwell+ | Y | Y | Y | Y | N | N |
KNL+ | Y | Y | Y | Y | N | N |
P4 90nm+ | Y | Y | Y | N | N | N |
Enhanced Pentium M+ | Y | Y | Y | N | N | N |
Bonnell | Y | Y | Y | N | N | N |
Quark X1000 | Y | N | Y | N | N | N |
KNC | Y | N | Y | N | N | N |
P5+ | Y | N | Y | N | N | N |
Before P5 | N/A | N/A | N/A | N/A | N/A | N/A |
Other Quark | N/A | N/A | N/A | N/A | N/A | N/A |

How to get the CPU cycle count in x86_64 from C++?

Starting from GCC 4.5 and later, the __rdtsc() intrinsic is now supported by both MSVC and GCC.

But the include that's needed is different:

#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif

Here's the original answer before GCC 4.5.

Pulled directly out of one of my projects:

#include <stdint.h>

// Windows
#ifdef _WIN32

#include <intrin.h>
uint64_t rdtsc(){
return __rdtsc();
}

// Linux/GCC
#else

uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}

#endif

This GNU C Extended asm tells the compiler:

  • volatile: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result).
  • "=a"(lo) and "=d"(hi) : the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86 rdtsc instruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with "=r" wouldn't work: there's no way to ask the CPU for the result to go anywhere else.
  • ((uint64_t)hi << 32) | lo - zero-extend both 32-bit halves to 64-bit (because lo and hi are unsigned), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.

(editor's note: this could probably be more efficient if you used unsigned long instead of unsigned int. Then the compiler would know that lo was already zero-extended into RAX. It wouldn't know that the upper half was zero, so | and + are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)

https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info



Related Topics



Leave a reply



Submit