Is Rdtsc Timer Inaccurate in Linux

Accuracy of rdtsc for benchmarking and time stamp counter frequency

Invariant TSC means, according to Intel,

The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states.

But what rate is that? Well,

That rate may be set by the
maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at
which the processor is booted. The maximum resolved frequency may differ from the maximum qualified
frequency of the processor, see Section 18.14.5 for more detail. On certain processors, the TSC frequency may
not be the same as the frequency in the brand string.

Looks to me as though they wanted it to be the frequency from the brand string, but then somehow didn't always get it right..
What is that frequency though?

The TSC, IA32_MPERF, and IA32_FIXED_CTR2 operate at the same, maximum-resolved frequency of the platform, which is equal to the product of scalable bus frequency and maximum resolved bus ratio.

For processors based on Intel Core microarchitecture, the scalable bus frequency is encoded in the bit field MSR_FSB_FREQ[2:0] at (0CDH), see Appendix B, "Model-Specific Registers (MSRs)". The maximum resolved bus ratio can be read from the following bit field:

If XE operation is disabled, the maximum resolved bus ratio can be read in MSR_PLATFORM_ID[12:8]. It corresponds to the maximum qualified frequency.

If XE operation is enabled, the maximum resolved bus ratio is given in MSR_PERF_STAT[44:40], it corresponds to the maximum XE operation frequency configured by BIOS.

That's probably not very helpful though. TL;DR, finding the TSC rate programmatically is too much effort. You can of course easily find it on your own system, just get an inaccurate guess based on a timed loop and take the "nearest number that makes sense". It's probably the number from the brand string anyway. It has been on all systems I've tested it on, but I haven't tested that many. And if it isn't, then it'll be some significantly differing rate, so you will definitely know.

In addition, does this mean the time obtained by using the TSC ticks and CPU frequency isn't the actual time used by the code piece?

Yes however not all hope is lost, the time obtained by using TSC ticks and the TSC rate (if you somehow know it) will give the actual time .. almost? Here usually a lot of FUD about unreliability is spouted. Yes, RDTSC is not serializing (but you can add serializing instructions). RDTSCP is serializing, but in some ways not quite enough (it can't execute too early, but it can execute too late). But it's not like you can't use them, you can either accept a small error, or read the paper I linked below.

But can it be assumed to be synchronized among cores on newer CPUs?

Yes, no, maybe - it will be synchronized, unless the TSC is written to. Who knows, someone might do it. Out of your control. It also won't be synchronized across different sockets.

Finally, I don't really buy the FUD about RDTSC(P) in the context of benchmarking. You can serialize it as much as you need, TSC is invariant, and you know the rate because it's your system. There isn't really any alternative either, it's basically the source of high resolution time measurement that in the end everything else ends up using anyway. Even without special precautions (but with filtering of your data) the accuracy and precision are fine for most benchmarks, and if you need more then read How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures, they write a kernel module so they can get rid of two other sources of benchmark error that are subject to much FUD, preemptions and interrupts.

rdtsc accuracy across CPU cores

X86_FEATURE_CONSTANT_TSC + X86_FEATURE_NONSTOP_TSC bits in cpuid (edx=x80000007, bit #8; check unsynchronized_tsc function of linux kernel for more checks)

Intel's Designer's vol3b, section 16.11.1 Invariant TSC it says the following

"16.11.1 Invariant TSC
The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor's support for invariant TSC is indicated by CPUID.80000007H:EDX[8].
The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource."

So, if TSC can be used for wallclock, they are guaranteed to be in sync.

How to ensure that RDTSC is accurate?

Very old CPU's have a RDTSC that is accurate.

The problem

However newer CPU's have a problem.

Engineers decided that RDTSC would be great for telling time.

However if a CPU throttles the frequency RDTSC is useless for telling time.

The aforementioned braindead engineers then decided to 'fix' this problem by having the TSC always run at the same frequency, even if the CPU slows down.

This has the 'advantage' that TSC can be used for telling elapsed (wall clock) time. However it makes the TSC ~~useless~~ less useful for profiling.

How to tell if your CPU is not broken

You can tell if your CPU is fine by reading the TSC_invariant bit in the CPUID.

Set EAX to 80000007H and read bit 8 of EDX.

If it is 0 then your CPU is fine.

If it's 1 then your CPU is broken and you need to make sure you profile whilst running the CPU at full throttle.

function IsTimerBroken: boolean;
{$ifdef CPUX86}
asm
  //Make sure RDTSC measure CPU cycles, not wall clock time.
  push ebx
  mov eax,$80000007  //Has TSC Invariant support?
  cpuid
  pop ebx
  xor eax,eax        //Assume no
  and edx,$10        //test TSC_invariant bit
  setnz al           //if set, return true, your PC is broken.
end;
{$endif}
  //Make sure RDTSC measure CPU cycles, not wall clock time.
{$ifdef CPUX64}
asm
  mov r8,rbx
  mov eax,$80000007  //TSC Invariant support?
  cpuid
  mov rbx,r8
  xor eax,eax
  and edx,$10 //test bit 8
  setnz al
end;
{$endif}

How to fix out of order execution issues

See: http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

Use the following code:

function RDTSC: int64;
{$IFDEF CPUX64}
asm
  {$IFDEF AllowOutOfOrder}
  rdtsc
  {$ELSE}
  rdtscp        // On x64 we can use the serializing version of RDTSC
  push rbx      // Serialize the code after, to avoid OoO sneaking in
  push rax      // later instructions before the RDTSCP runs.
  push rdx      // See: http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
  xor eax,eax
  cpuid
  pop rdx
  pop rax
  pop rbx
  {$ENDIF}
  shl rdx,32
  or rax,rdx
  {$ELSE}
{$IFDEF CPUX86}
asm
  {$IFNDEF AllowOutOfOrder}
  xor eax,eax
  push ebx
  cpuid         // On x86 we can't assume the existance of RDTSP
  pop ebx       // so use CPUID to serialize
  {$ENDIF}
  rdtsc
  {$ELSE}
error!
{$ENDIF}
{$ENDIF}
end;

How to run RDTSC on a broken CPU

The trick is to force the CPU to run at 100%.

This is usually done by running the sample code many many times.

I usually use 1.000.000 to start with.

I then time those 1 million runs 10x and take the lowest time of those attempts.

Comparisons with theoretical timings show that this gives very accurate results.

Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?

As long as your thread stays on the same CPU core, the RDTSC instruction will keep returning an increasing number until it wraps around. For a 2GHz CPU, this happens after 292 years, so it is not a real issue. You probably won't see it happen. If you expect to live that long, make sure your computer reboots, say, every 50 years.

The problem with RDTSC is that you have no guarantee that it starts at the same point in time on all cores of an elderly multicore CPU and no guarantee that it starts at the same point in time time on all CPUs on an elderly multi-CPU board.

Modern systems usually do not have such problems, but the problem can also be worked around on older systems by setting a thread's affinity so it only runs on one CPU. This is not good for application performance, so one should not generally do it, but for measuring ticks, it's just fine.

(Another "problem" is that many people use RDTSC for measuring time, which is not what it does, but you wrote that you want CPU cycles, so that is fine. If you do use RDTSC to measure time, you may have surprises when power saving or hyperboost or whatever the multitude of frequency-changing techniques are called kicks in. For actual time, the clock_gettime syscall is surprisingly good under Linux.)

I would just write rdtsc inside the asm statement, which works just fine for me and is more readable than some obscure hex code. Assuming it's the correct hex code (and since it neither crashes and returns an ever-increasing number, it seems so), your code is good.

If you want to measure the number of ticks a piece of code takes, you want a tick difference, you just need to subtract two values of the ever-increasing counter. Something like uint64_t t0 = rdtsc(); ... uint64_t t1 = rdtsc() - t0;
Note that for if very accurate measurements isolated from surrounding code are necessary, you need to serialize, that is stall the pipeline, prior to calling rdtsc (or use rdtscp which is only supported on newer processors). The one serializing instruction that can be used at every privilegue level is cpuid.

In reply to the further question in the comment:

The TSC starts at zero when you turn on the computer (and the BIOS resets all counters on all CPUs to the same value, though some BIOSes a few years ago did not do so reliably).

Thus, from your program's point of view, the counter started "some unknown time in the past", and it always increases with every clock tick the CPU sees. Therefore if you execute the instruction returning that counter now and any time later in a different process, it will return a greater value (unless the CPU was suspended or turned off in between). Different runs of the same program get bigger numbers, because the counter keeps growing. Always.

Now, clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is a different matter. This is the CPU time that the OS has given to the process. It starts at zero when your process starts. A new process starts at zero, too. Thus, two processes running after each other will get very similar or identical numbers, not ever growing ones.

clock_gettime(CLOCK_MONOTONIC_RAW) is closer to how RDTSC works (and on some older systems is implemented with it). It returns a value that ever increases. Nowadays, this is typically a HPET. However, this is really time, and not ticks. If your computer goes into low power state (e.g. running at 1/2 normal frequency), it will still advance at the same pace.

Calculate system time using rdtsc

Don't do that -using yourself directly the RDTSC machine instruction- (because your OS scheduler could reschedule other threads or processes at arbitrary moments, or slow down the clock). Use a function provided by your library or OS.

My main objective is to avoid the need to perform system call every time I want to know the system time

On Linux, read time(7) then use clock_gettime(2) which is really quick (and does not involve any slow system call) thanks to vdso(7).

On a C++11 compliant implementation, simply use the standard <chrono> header. And standard C has clock(3) (giving microsecond precision). Both would use on Linux good enough time measurement functions (so indirectly vdso)

Last time I measured clock_gettime it often took less than 4 nanoseconds per call.

Timing a process in C using clock(), time(), clock_gettimes() and the rdtsc() intrinsic returning confusing values

If someone else comes across this I came across this paper and am using this to time my code https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

Is clock_gettime() adequate for submicrosecond timing?

No. You'll have to use platform-specific code to do it. On x86 and x86-64, you can use 'rdtsc' to read the Time Stamp Counter.

Just port the rdtsc assembly you're using.

__inline__ uint64_t rdtsc(void) {
  uint32_t lo, hi;
  __asm__ __volatile__ (      // serialize
  "xorl %%eax,%%eax \n        cpuid"
  ::: "%rax", "%rbx", "%rcx", "%rdx");
  /* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
  __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
  return (uint64_t)hi << 32 | lo;
}