Getting CPU Cycles Using Rdtsc - Why Does the Value of Rdtsc Always Increase

Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?

As long as your thread stays on the same CPU core, the RDTSC instruction will keep returning an increasing number until it wraps around. For a 2GHz CPU, this happens after 292 years, so it is not a real issue. You probably won't see it happen. If you expect to live that long, make sure your computer reboots, say, every 50 years.

The problem with RDTSC is that you have no guarantee that it starts at the same point in time on all cores of an elderly multicore CPU and no guarantee that it starts at the same point in time time on all CPUs on an elderly multi-CPU board.

Modern systems usually do not have such problems, but the problem can also be worked around on older systems by setting a thread's affinity so it only runs on one CPU. This is not good for application performance, so one should not generally do it, but for measuring ticks, it's just fine.

(Another "problem" is that many people use RDTSC for measuring time, which is not what it does, but you wrote that you want CPU cycles, so that is fine. If you do use RDTSC to measure time, you may have surprises when power saving or hyperboost or whatever the multitude of frequency-changing techniques are called kicks in. For actual time, the clock_gettime syscall is surprisingly good under Linux.)

I would just write rdtsc inside the asm statement, which works just fine for me and is more readable than some obscure hex code. Assuming it's the correct hex code (and since it neither crashes and returns an ever-increasing number, it seems so), your code is good.

If you want to measure the number of ticks a piece of code takes, you want a tick difference, you just need to subtract two values of the ever-increasing counter. Something like uint64_t t0 = rdtsc(); ... uint64_t t1 = rdtsc() - t0;

Note that for if very accurate measurements isolated from surrounding code are necessary, you need to serialize, that is stall the pipeline, prior to calling rdtsc (or use rdtscp which is only supported on newer processors). The one serializing instruction that can be used at every privilegue level is cpuid.

In reply to the further question in the comment:

The TSC starts at zero when you turn on the computer (and the BIOS resets all counters on all CPUs to the same value, though some BIOSes a few years ago did not do so reliably).

Thus, from your program's point of view, the counter started "some unknown time in the past", and it always increases with every clock tick the CPU sees. Therefore if you execute the instruction returning that counter now and any time later in a different process, it will return a greater value (unless the CPU was suspended or turned off in between). Different runs of the same program get bigger numbers, because the counter keeps growing. Always.

Now, clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is a different matter. This is the CPU time that the OS has given to the process. It starts at zero when your process starts. A new process starts at zero, too. Thus, two processes running after each other will get very similar or identical numbers, not ever growing ones.

clock_gettime(CLOCK_MONOTONIC_RAW) is closer to how RDTSC works (and on some older systems is implemented with it). It returns a value that ever increases. Nowadays, this is typically a HPET. However, this is really time, and not ticks. If your computer goes into low power state (e.g. running at 1/2 normal frequency), it will still advance at the same pace.

How to count clock cycles with RDTSC in GCC x86?

Update: reposted and updated this answer on a more canonical question. I'll probably delete this at some point once we sort out which question to use as the duplicate target for closing all the similar rdtsc questions.

You don't need and shouldn't use inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. https://gcc.gnu.org/wiki/DontUseInlineAsm

Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics. (Intel's intriniscs guide says #include <immintrin.h> for this, but with gcc and clang the non-SIMD intrinsics are mostly in x86intrin.h.)

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
unsigned long long readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    return __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
}

Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer.

For more about using lfence to improve repeatability of rdtsc, see @HadiBrais' answer on clflush to invalidate cache line via C function.

See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset.)

`rdtsc` counts reference cycles, not CPU core clock cycles

It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (except for system clock adjustments, so it's basically steady_clock). It ticks at the CPU's rated frequency, i.e. the advertised sticker frequency.

If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. Or better, use a library that gives you access to hardware performance counters, or a trick like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID. You usually will still want to avoid CPU frequency shifts during your microbenchmark, though.

std::chrono::clock, hardware clock and cycle count
Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC

It's also not guaranteed that the TSCs of all cores are in sync. So if your thread migrates to another CPU core between __rdtsc(), there can be an extra skew. (Most OSes attempt to sync the TSCs of all cores, though.) If you're using rdtsc directly, you probably want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux.

How good is the asm from using the intrinsic?

It's at least as good as anything you could do with inline asm.

A non-inline version of it compiles MSVC for x86-64 like this:

unsigned __int64 readTSC(void) PROC                             ; readTSC
    rdtsc
    shl     rdx, 32                             ; 00000020H
    or      rax, rdx
    ret     0
  ; return in RAX

For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.

In a test caller that uses it twice and subtracts to time an interval:

uint64_t time_something() {
    uint64_t start = readTSC();
    // even when empty, back-to-back __rdtsc() don't optimize away
    return readTSC() - start;
}

All 4 compilers make pretty similar code. This is GCC's 32-bit output:

# gcc8.2 -O3 -m32
time_something():
    push    ebx               # save a call-preserved reg: 32-bit only has 3 scratch regs
    rdtsc
    mov     ecx, eax
    mov     ebx, edx          # start in ebx:ecx
      # timed region (empty)

    rdtsc
    sub     eax, ecx
    sbb     edx, ebx          # edx:eax -= ebx:ecx

    pop     ebx
    ret                       # return value in edx:eax

This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.

# MSVC 19  2017  -Ox
unsigned __int64 time_something(void) PROC                            ; time_something
    rdtsc
    shl     rdx, 32                  ; high <<= 32
    or      rax, rdx
    mov     rcx, rax                 ; missed optimization: lea rcx, [rdx+rax]
                                     ; rcx = start
     ;; timed region (empty)

    rdtsc
    shl     rdx, 32
    or      rax, rdx                 ; rax = end

    sub     rax, rcx                 ; end -= start
    ret     0
unsigned __int64 time_something(void) ENDP                            ; time_something

All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.

But writing it in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.

RDTSCP in NASM always returns the same value (timing a single instruction)

Your first code (leading to the title question) is buggy because it overwrites the rdtsc and rdtscp results with the cpuid results in EAX,EBX,ECX and EDX.

Use lfence instead of cpuid; on Intel since forever and AMD with Spectre mitigation enabled, lfence will serialize the instruction stream and thus do what you want with rdtsc.

Remember that RDTSC counts reference cycles, not core clock cycles. Get CPU cycle count? for that and more about RDTSC.

You don't have cpuid or lfence inside your measurement interval. But you do have rdtscp itself in the measurement interval. Back-to-back rdtscp is not fast, 64 reference cycles sounds totally reasonable if you ran without warming up the CPU. Idle clock speed is usually a lot slower than a reference cycle; 1 reference cycle is equal or close to the "sticker" frequency, e.g. max non-turbo sustained frequency, on Intel CPUs. e.g. 4008 MHz on a "4GHz" Skylake CPU.

This is not how you time a single instruction

What matters is latency before another instruction can use the result, not latency until it fully retires from the out-of-order back-end. RDTSC can be useful for timing relative variations in how long one load or one store instruction takes, but the overhead means you won't get a good absolute time.

You can try to subtract measurement overhead, though. e.g. clflush to invalidate cache line via C function. And see also the followups: Using time stamp counter and clock_gettime for cache miss and Memory latency measurement with time stamp counter.

This is what I usually use to profile latency or throughput (and uops fused and unfused domain) of an instruction of short block. Adjust how you use it to bottleneck on latency like here, or not if you want to just test throughput. e.g. with a %rep block with enough different registers to hide latency, or break dependency chains with a pxor xmm3, xmm3 after a short block and let out-of-order exec work its magic. (As long as you don't bottleneck on the front-end.)

You might want to use NASM's smartalign package, or use YASM, to avoid a wall of single-byte NOP instructions for the ALIGN directive. NASM defaults to really stupid NOPs even in 64-bit mode where long-NOP is always supported.

global _start
_start:
    mov   ecx, 1000000000
; linux static executables start with XMM0..15 already zeroed
align 32                     ; just for good measure to avoid uop-cache effects
.loop:
    ;; LOOP BODY, put whatever you want to time in here
    times 4   addsd  xmm4, xmm3

    dec   ecx
    jnz   .loop

    mov  eax, 231
    xor  edi, edi
    syscall          ; x86-64 Linux sys_exit_group(0)

Run this with something like this one-liner that links it into a static executable and profiles it with perf stat, which you can up-arrow and re-run every time you change the source:

(I actually put the nasm+ld + optional disassemble into a shell script called asm-link, to save typing when I'm not profiling. Disassembling makes sure that what's in your loop is what you meant to profile, especially if you have some %if stuff in your code. And also so it's on your terminal right before the profile, if you want to scroll back while testing theories in your head.)

t=testloop; nasm -felf64 -g "$t.asm" && ld "$t.o" -o "$t" &&  objdump -drwC -Mintel "$t" &&
 taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread -r4 ./"$t"

Result from i7-6700k at 3.9GHz (current perf has a unit-scaling display bug for the secondary column. It's fixed upstream but Arch Linux hasn't updated yet.):

 Performance counter stats for './testloop' (4 runs):

          4,106.09 msec task-clock                #    1.000 CPUs utilized            ( +-  0.01% )
                17      context-switches          #    4.080 M/sec                    ( +-  5.65% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.487 M/sec                  
    16,012,778,144      cycles                    # 3900323.504 GHz                   ( +-  0.01% )
     1,001,537,894      branches                  # 243950284.862 M/sec               ( +-  0.00% )
     6,008,071,198      instructions              #    0.38  insn per cycle           ( +-  0.00% )
     5,013,366,769      uops_issued.any           # 1221134275.667 M/sec              ( +-  0.01% )
     5,013,217,655      uops_executed.thread      # 1221097955.182 M/sec              ( +-  0.01% )

          4.106283 +- 0.000536 seconds time elapsed  ( +-  0.01% )

On my i7-6700k (Skylake), addsd has 4 cycle latency, 0.5c throughput. (i.e. 2 per clock, if latency wasn't the bottleneck). See https://agner.org/optimize/, https://uops.info/, and http://instlatx64.atw.hu/.

16 cycles per branch = 16 cycles per chain of 4 addsd = 4 cycle latency for addsd, reproducing Agner Fog's measurement of 4 cycles to better than 1 part in 100 even for this test that includes a tiny bit of startup overhead, and interrupt overhead.

Take your pick of different counters to record. Adding a :u, like instructions:u to a perf even will count only user-space instructions, excluding any that ran during interrupt handlers. I usually don't do that, so I can see that overhead as part of the explanation for wall-clock time. But if you do, cycles:u can match very closely with instructions:u.

-r4 runs it 4 times and averages, which can be useful to see if there's a lot of run-to-run variation instead of just getting one average from a higher value in ECX.

Adjust your initial ECX value to make the total time about 0.1 to 1 second, that's usually plenty, especially if your CPU ramps up to max turbo very quickly (e.g. Skylake with hardware P-states and a fairly aggressive energy_performance_preference). Or max non-turbo with turbo disabled.

But this counts in core clock cycles, not reference cycles, so it still gives the same result regardless of CPU frequency changes. (+- some noise from stopping the clock during the transition.)

Calculating CPU frequency in C with RDTSC always returns 0

Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:

rdtsc
push 1
call sleep
rdtsc

Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:

rdtsc
rdtsc
push 1
call sleep

In this case, it's clear why the difference between the two rdtscs would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that is CPUID. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to use CPUID correctly/effectively for this task.

Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.

Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of) sleep contain serializing instructions that prevent rdtsc from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).

What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).

Does RDTSCP increment monotonically across multi-cores?

On most systems yes, if you create synchronization between threads to make sure that one actually does run after the other¹. Otherwise all bets are off; starting one thread before another does not ensure that its code executes first.

Footnote 1: e.g. having one spin-wait until it sees an atomic store done by the other. Or use a mutex and run rdtscp in a critical section, along with a variable to record whether the other thread was already there.

On anything non-ancient (like Core2 and newer at least), TSC ticks at constant frequency (the "reference") frequency. See this answer for links and details about the constant_tsc / nonstop_tsc CPU features, and the possibility of TSC not being synced.

Most modern systems in practice do have the TSC synced between cores I think, thanks to motherboard vendors making sure that even on multi-socket systems the RESET signal is distributed to all cores at the same time. And firmware and OS software taking care not to screw it up. It's much easier on a single-socket system like a normal desktop with a multicore CPU where all the "extra" cores are on the same chip.

But this is not guaranteed, and part of why rdtscp exists (with a processor ID output) is this possibility (which I think might have been more common on older systems when RDTSCP was new).

There are even CPU features VMs can use to offset and scale the TSC transparently (with HW support), to migrate VMs between physical machines while preserving monotonicity and frequency of the TSC. Using these features indiscriminately can of course produce desynced TSCs or even ones that run at different frequencies on different cores.

TSC is a 64-bit counter that usually counts at the CPUs rated sticker frequency. This can be over ~4.2 GHz (2^32) on some CPUs, so that leaves the high half incrementing about once per second on fast CPUs. The TSC can in theory wrap if the computer has been "up" for over 2^32 seconds (several decades), or if the TSC has been manually set to have a big offset.

rdtsc accuracy across CPU cores

X86_FEATURE_CONSTANT_TSC + X86_FEATURE_NONSTOP_TSC bits in cpuid (edx=x80000007, bit #8; check unsynchronized_tsc function of linux kernel for more checks)

Intel's Designer's vol3b, section 16.11.1 Invariant TSC it says the following

"16.11.1 Invariant TSC
The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor's support for invariant TSC is indicated by CPUID.80000007H:EDX[8].
The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource."

So, if TSC can be used for wallclock, they are guaranteed to be in sync.

Getting CPU Cycles Using Rdtsc - Why Does the Value of Rdtsc Always Increase