Calculate System Time Using Rdtsc

Calculate system time using rdtsc

Don't do that -using yourself directly the RDTSC machine instruction- (because your OS scheduler could reschedule other threads or processes at arbitrary moments, or slow down the clock). Use a function provided by your library or OS.

My main objective is to avoid the need to perform system call every time I want to know the system time

On Linux, read time(7) then use clock_gettime(2) which is really quick (and does not involve any slow system call) thanks to vdso(7).

On a C++11 compliant implementation, simply use the standard <chrono> header. And standard C has clock(3) (giving microsecond precision). Both would use on Linux good enough time measurement functions (so indirectly vdso)

Last time I measured clock_gettime it often took less than 4 nanoseconds per call.

Using Time stamp counter to get the time stamp

A simple answer to the stated question, "how do I convert the TSC frequency to microseconds or milliseconds?" is: You do not. What the TSC (Time Stamp Counter) clock frequency actually is, varies depending on the hardware, and may vary during runtime on some. To measure real time, you use clock_gettime(CLOCK_REALTIME) or clock_gettime(CLOCK_MONOTONIC) in Linux.

As Peter Cordes mentioned in a comment (Aug 2018), on most current x86-64 architectures the Time Stamp Counter (accessed by the RDTSC instruction and __rdtsc() function declared in <x86intrin.h>) counts reference clock cycles, not CPU clock cycles. His answer to a similar question in C++ is valid for C also in Linux on x86-64, because the compiler provides the underlying built-in when compiling C or C++, and rest of the answer deals with the hardware details. I recommend reading that one, too.

The rest of this answer assumes the underlying issue is microbenchmarking code, to find out how two implementations of some function compare to each other.

On x86 (Intel 32-bit) and x86-64 (AMD64, Intel and AMD 64-bit) architectures, you can use __rdtsc() from <x86intrin.h> to find out the number of TSC clock cycles elapsed. This can be used to measure and compare the number of cycles used by different implementations of some function, typically a large number of times.

Do note that there are hardware differences as to how the TSC clock is related to CPU clock. The abovementioned more recent answer goes into some detail on that. For practical purposes in Linux, it is sufficient in Linux to use cpufreq-set to disable frequency scaling (to ensure the relationship between the CPU and TSC frequencies does not change during microbenchmarking), and optionally taskset to restrict the microbenchmark to specific CPU core(s). That ensures that the results gathered in that microbenchmark yield results that can be compared to each other.

(As Peter Cordes commented, we also want to add _mm_lfence() from <emmintrin.h> (included by <immintrin.h>). This ensures that the CPU does not internally reorder the RDTSC operation compared to the function to be benchmarked. You can use -DNO_LFENCE at compile time to omit those, if you want.)

Let's say you have functions void foo(void); and void bar(void); that you wish to compare:

#include <stdlib.h>
#include <x86intrin.h>
#include <stdio.h>

#ifdef    NO_LFENCE
#define   lfence()
#else
#include <emmintrin.h>
#define   lfence()  _mm_lfence()
#endif

static int cmp_ull(const void *aptr, const void *bptr)
{
    const unsigned long long  a = *(const unsigned long long *)aptr;
    const unsigned long long  b = *(const unsigned long long *)bptr;
    return (a < b) ? -1 :
           (a > b) ? +1 : 0;
}

unsigned long long *measure_cycles(size_t count, void (*func)())
{
    unsigned long long  *elapsed, started, finished;
    size_t               i;

    elapsed = malloc((count + 2) * sizeof elapsed[0]);
    if (!elapsed)
        return NULL;

    /* Call func() count times, measuring the TSC cycles for each call. */
    for (i = 0; i < count; i++) {
        /* First, let's ensure our CPU executes everything thus far. */
        lfence();
        /* Start timing. */
        started = __rdtsc();
        /* Ensure timing starts before we call the function. */
        lfence();
        /* Call the function. */
        func();
        /* Ensure everything has been executed thus far. */
        lfence();
        /* Stop timing. */
        finished = __rdtsc();
        /* Ensure we have the counter value before proceeding. */
        lfence();

        elapsed[i] = finished - started;
    }

    /* The very first call is likely the cold-cache case,
       so in case that measurement might contain useful
       information, we put it at the end of the array.
       We also terminate the array with a zero. */
    elapsed[count] = elapsed[0];
    elapsed[count + 1] = 0;

    /* Sort the cycle counts. */
    qsort(elapsed, count, sizeof elapsed[0], cmp_ull);

    /* This function returns all cycle counts, in sorted order,
       although the median, elapsed[count/2], is the one
       I personally use. */
    return elapsed;
}

void benchmark(const size_t count)
{
    unsigned long long  *foo_cycles, *bar_cycles;

    if (count < 1)
        return;

    printf("Measuring run time in Time Stamp Counter cycles:\n");
    fflush(stdout);

    foo_cycles = measure_cycles(count, foo);
    bar_cycles = measure_cycles(count, bar);

    printf("foo(): %llu cycles (median of %zu calls)\n", foo_cycles[count/2], count);
    printf("bar(): %llu cycles (median of %zu calls)\n", bar_cycles[count/2], count);

    free(bar_cycles);
    free(foo_cycles);
}

Note that the above results are very specific to the compiler and compiler options used, and of course on the hardware it is run on. The median number of cycles can be interpreted as "the typical number of TSC cycles taken", because the measurement is not completely reliable (may be affected by events outside the process; for example, by context switches, or by migration to another core on some CPUs). For the same reason, I don't trust the minimum, maximum, or average values.

However, the two implementations' (foo() and bar()) cycle counts above can be compared to find out how their performance compares to each other, in a microbenchmark. Just remember that microbenchmark results may not extend to real work tasks, because of how complex tasks' resource use interactions are. One function might be superior in all microbenchmarks, but poorer than others in real world, because it is only efficient when it has lots of CPU cache to use, for example.

In Linux in general, you can use the CLOCK_REALTIME clock to measure real time (wall clock time) used, in the very same manner as above. CLOCK_MONOTONIC is even better, because it is not affected by direct changes to the realtime clock the administrator might make (say, if they noticed the system clock is ahead or behind); only drift adjustments due to NTP etc. are applied. Daylight savings time or changes thereof does not affect the measurements, using either clock. Again, the median of a number of measurements is the result I seek, because events outside the measured code itself can affect the result.

For example:

#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#ifdef   NO_LFENCE
#define  lfence()
#else
#include <emmintrin.h>
#define  lfence() _mm_lfence()
#endif

static int cmp_double(const void *aptr, const void *bptr)
{
    const double a = *(const double *)aptr;
    const double b = *(const double *)bptr;
    return (a < b) ? -1 :
           (a > b) ? +1 : 0;
}

double median_seconds(const size_t count, void (*func)())
{
    struct timespec started, stopped;
    double         *seconds, median;
    size_t          i;

    seconds = malloc(count * sizeof seconds[0]);
    if (!seconds)
        return -1.0;

    for (i = 0; i < count; i++) {
        lfence();
        clock_gettime(CLOCK_MONOTONIC, &started);
        lfence();
        func();
        lfence();
        clock_gettime(CLOCK_MONOTONIC, &stopped);
        lfence();
        seconds[i] = (double)(stopped.tv_sec - started.tv_sec)
                   + (double)(stopped.tv_nsec - started.tv_nsec) / 1000000000.0;
    }

    qsort(seconds, count, sizeof seconds[0], cmp_double);
    median = seconds[count / 2];
    free(seconds);
    return median;
}

static double realtime_precision(void)
{
    struct timespec t;

    if (clock_getres(CLOCK_REALTIME, &t) == 0)
        return (double)t.tv_sec
             + (double)t.tv_nsec / 1000000000.0;

    return 0.0;
}

void benchmark(const size_t count)
{
    double median_foo, median_bar;
    if (count < 1)
        return;

    printf("Median wall clock times over %zu calls:\n", count);
    fflush(stdout);

    median_foo = median_seconds(count, foo);
    median_bar = median_seconds(count, bar);

    printf("foo(): %.3f ns\n", median_foo * 1000000000.0);
    printf("bar(): %.3f ns\n", median_bar * 1000000000.0);

    printf("(Measurement unit is approximately %.3f ns)\n", 1000000000.0 * realtime_precision());
    fflush(stdout);
}

In general, I personally prefer to compile the benchmarked function in a separate unit (to a separate object file), and also benchmark a do-nothing function to estimate the function call overhead (although it tends to give an overestimate for the overhead; i.e. yield too large an overhead estimate, because some of the function call overhead is latencies and not actual time taken, and some operations are possible during those latencies in the actual functions).

It is important to remember that the above measurements should only be used as indications, because in a real world application, things like cache locality (especially on current machines, with multi-level caching, and lots of memory) hugely affect the time used by different implementations.

For example, you might compare the speeds of a quicksort and a radix sort. Depending on the size of the keys, the radix sort requires rather large extra arrays (and uses a lot of cache). If the real application the sort routine is used in does not simultaneously use a lot of other memory (and thus the sorted data is basically what is cached), then a radix sort will be faster if there is enough data (and the implementation is sane). However, if the application is multithreaded, and the other threads shuffle (copy or transfer) a lot of memory around, then the radix sort using a lot of cache will evict other data also cached; even though the radix sort function itself does not show any serious slowdown, it may slow down the other threads and therefore the overall program, because the other threads have to wait for their data to be re-cached.

This means that the only "benchmarks" you should trust, are wall clock measurements used on the actual hardware, running actual work tasks with actual work data. Everything else is subject to many conditions, and are more or less suspect: indications, yes, but not very reliable.

rdtsc timing for a measuring a function

You use plain rdtsc instruction, which may not work correctly on Out-of-order CPUs, like Xeons and Cores. You should add some serializing instruction or switch to rdtscp instruction:

http://en.wikipedia.org/wiki/Time_Stamp_Counter

Starting with the Pentium Pro, Intel processors have supported out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. This can cause RDTSC to be executed later than expected, producing a misleading cycle count.[3] This problem can be solved by executing a serializing instruction, such as CPUID, to force every preceding instruction to complete before allowing the program to continue, or by using the RDTSCP instruction, which is a serializing variant of the RDTSC instruction.

Intel has recent manual of using rdtsc/rdtscp - How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures (ia-32-ia-64-benchmark-code-execution-paper.pdf, 324264-001, 2010). They recommend cpuid+rdtsc for start and rdtscp for end timers:

The solution to the problem presented in Section 0 is to add a CPUID instruction
just after the RDTPSCP and the two mov instructions (to store in memory the
value of edx and eax). The implementation is as follows:

asm volatile ("CPUID\n\t"
 "RDTSC\n\t"
 "mov %%edx, %0\n\t"
 "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
"%rax", "%rbx", "%rcx", "%rdx");
/***********************************/
/*call the function to measure here*/
/***********************************/
asm volatile("RDTSCP\n\t"
 "mov %%edx, %0\n\t"
 "mov %%eax, %1\n\t"
 "CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1)::
"%rax", "%rbx", "%rcx", "%rdx");

start = ( ((uint64_t)cycles_high << 32) | cycles_low );
end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 );

In the code above, the first CPUID call implements a barrier to avoid out-of-order
execution of the instructions above and below the RDTSC instruction.
Nevertheless, this call does not affect the measurement since it comes before the
RDTSC (i.e., before the timestamp register is read).
The first RDTSC then reads the timestamp register and the value is stored in
memory.
Then the code that we want to measure is executed. If the code is a call to a
function, it is recommended to declare such function as “inline” so that from an
assembly perspective there is no overhead in calling the function itself.
The RDTSCP instruction reads the timestamp register for the second time and
guarantees that the execution of all the code we wanted to measure is completed.

You example is not very correct; you try to measure empty function bar(), but it is so short that you are measuring rdtsc overhead in method 1 (for() { rdtsc; bar(); rdtsc)). According to the Agner Fog's table for haswell - http://www.agner.org/optimize/instruction_tables.pdf page 191 (long table "Intel Haswell List of instruction timings and μop breakdown", at the very end of it)
RDTSC has 15 uops (no fusion possible) and the latency of 24 ticks; RDTSCP (for older microarchitecture Sandy Bridge has 23 uops and 36 ticks latency versus 21 uops and 28 ticks for rdtsc). So, you can't use plain rdtsc (or rdtscp) to directly measure such short code.

Measuring time difference using RDTSC - results too large

In your update version that doesn't clobber the start time (the bug @R. pointed out):

sub %eax, %edi is calculating start - end. This is a negative number, i.e. a huge unsigned number just below 2^32. If you're going to use %u, get used to interpreting its output back to a bit-pattern when debugging.

You want end - start.

And BTW, use lfence; it's significantly more efficient than cpuid. It's guaranteed to serialize instruction execution on Intel (without flushing the store buffer like a full serializing instruction). It's also safe on AMD CPUs with Spectre mitigation enabled.

See also http://akaros.cs.berkeley.edu/lxr/akaros/kern/arch/x86/rdtsc_test.c for some different ways to serialize RDTSC and/or RDTSCP.

See also Get CPU cycle count? for more about RDTSC, especially that it doesn't count core clock cycles, only reference cycles. So idle/turbo will affect your results.

Also, the cost of one instruction isn't one-dimensional. It's not particularly useful to time a single instruction with RDTSC like that. See RDTSCP in NASM always returns the same value for more about how to measure throughput/latency/uops for a single instruction.

RDTSC can be useful for timing a whole loop or longer sequence of instructions, larger than the OoO execution window of your CPU.

How to count clock cycles with RDTSC in GCC x86?

Update: reposted and updated this answer on a more canonical question. I'll probably delete this at some point once we sort out which question to use as the duplicate target for closing all the similar rdtsc questions.

You don't need and shouldn't use inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. https://gcc.gnu.org/wiki/DontUseInlineAsm

Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics. (Intel's intriniscs guide says #include <immintrin.h> for this, but with gcc and clang the non-SIMD intrinsics are mostly in x86intrin.h.)

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
unsigned long long readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    return __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
}

Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer.

For more about using lfence to improve repeatability of rdtsc, see @HadiBrais' answer on clflush to invalidate cache line via C function.

See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset.)

`rdtsc` counts reference cycles, not CPU core clock cycles

It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (except for system clock adjustments, so it's basically steady_clock). It ticks at the CPU's rated frequency, i.e. the advertised sticker frequency.

If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. Or better, use a library that gives you access to hardware performance counters, or a trick like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID. You usually will still want to avoid CPU frequency shifts during your microbenchmark, though.

std::chrono::clock, hardware clock and cycle count
Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC

It's also not guaranteed that the TSCs of all cores are in sync. So if your thread migrates to another CPU core between __rdtsc(), there can be an extra skew. (Most OSes attempt to sync the TSCs of all cores, though.) If you're using rdtsc directly, you probably want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux.

How good is the asm from using the intrinsic?

It's at least as good as anything you could do with inline asm.

A non-inline version of it compiles MSVC for x86-64 like this:

unsigned __int64 readTSC(void) PROC                             ; readTSC
    rdtsc
    shl     rdx, 32                             ; 00000020H
    or      rax, rdx
    ret     0
  ; return in RAX

For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.

In a test caller that uses it twice and subtracts to time an interval:

uint64_t time_something() {
    uint64_t start = readTSC();
    // even when empty, back-to-back __rdtsc() don't optimize away
    return readTSC() - start;
}

All 4 compilers make pretty similar code. This is GCC's 32-bit output:

# gcc8.2 -O3 -m32
time_something():
    push    ebx               # save a call-preserved reg: 32-bit only has 3 scratch regs
    rdtsc
    mov     ecx, eax
    mov     ebx, edx          # start in ebx:ecx
      # timed region (empty)

    rdtsc
    sub     eax, ecx
    sbb     edx, ebx          # edx:eax -= ebx:ecx

    pop     ebx
    ret                       # return value in edx:eax

This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.

# MSVC 19  2017  -Ox
unsigned __int64 time_something(void) PROC                            ; time_something
    rdtsc
    shl     rdx, 32                  ; high <<= 32
    or      rax, rdx
    mov     rcx, rax                 ; missed optimization: lea rcx, [rdx+rax]
                                     ; rcx = start
     ;; timed region (empty)

    rdtsc
    shl     rdx, 32
    or      rax, rdx                 ; rax = end

    sub     rax, rcx                 ; end -= start
    ret     0
unsigned __int64 time_something(void) ENDP                            ; time_something

All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.

But writing it in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.

Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?

As long as your thread stays on the same CPU core, the RDTSC instruction will keep returning an increasing number until it wraps around. For a 2GHz CPU, this happens after 292 years, so it is not a real issue. You probably won't see it happen. If you expect to live that long, make sure your computer reboots, say, every 50 years.

The problem with RDTSC is that you have no guarantee that it starts at the same point in time on all cores of an elderly multicore CPU and no guarantee that it starts at the same point in time time on all CPUs on an elderly multi-CPU board.

Modern systems usually do not have such problems, but the problem can also be worked around on older systems by setting a thread's affinity so it only runs on one CPU. This is not good for application performance, so one should not generally do it, but for measuring ticks, it's just fine.

(Another "problem" is that many people use RDTSC for measuring time, which is not what it does, but you wrote that you want CPU cycles, so that is fine. If you do use RDTSC to measure time, you may have surprises when power saving or hyperboost or whatever the multitude of frequency-changing techniques are called kicks in. For actual time, the clock_gettime syscall is surprisingly good under Linux.)

I would just write rdtsc inside the asm statement, which works just fine for me and is more readable than some obscure hex code. Assuming it's the correct hex code (and since it neither crashes and returns an ever-increasing number, it seems so), your code is good.

If you want to measure the number of ticks a piece of code takes, you want a tick difference, you just need to subtract two values of the ever-increasing counter. Something like uint64_t t0 = rdtsc(); ... uint64_t t1 = rdtsc() - t0;
Note that for if very accurate measurements isolated from surrounding code are necessary, you need to serialize, that is stall the pipeline, prior to calling rdtsc (or use rdtscp which is only supported on newer processors). The one serializing instruction that can be used at every privilegue level is cpuid.

In reply to the further question in the comment:

The TSC starts at zero when you turn on the computer (and the BIOS resets all counters on all CPUs to the same value, though some BIOSes a few years ago did not do so reliably).

Thus, from your program's point of view, the counter started "some unknown time in the past", and it always increases with every clock tick the CPU sees. Therefore if you execute the instruction returning that counter now and any time later in a different process, it will return a greater value (unless the CPU was suspended or turned off in between). Different runs of the same program get bigger numbers, because the counter keeps growing. Always.

Now, clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is a different matter. This is the CPU time that the OS has given to the process. It starts at zero when your process starts. A new process starts at zero, too. Thus, two processes running after each other will get very similar or identical numbers, not ever growing ones.

clock_gettime(CLOCK_MONOTONIC_RAW) is closer to how RDTSC works (and on some older systems is implemented with it). It returns a value that ever increases. Nowadays, this is typically a HPET. However, this is really time, and not ticks. If your computer goes into low power state (e.g. running at 1/2 normal frequency), it will still advance at the same pace.

Calculate System Time Using Rdtsc