How to get the CPU cycle count in x86_64 from C++?
Starting from GCC 4.5 and later, the __rdtsc()
intrinsic is now supported by both MSVC and GCC.
But the include that's needed is different:
#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
Here's the original answer before GCC 4.5.
Pulled directly out of one of my projects:
#include <stdint.h>
// Windows
#ifdef _WIN32
#include <intrin.h>
uint64_t rdtsc(){
return __rdtsc();
}
// Linux/GCC
#else
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#endif
This GNU C Extended asm tells the compiler:
volatile
: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result)."=a"(lo)
and"=d"(hi)
: the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86rdtsc
instruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with"=r"
wouldn't work: there's no way to ask the CPU for the result to go anywhere else.((uint64_t)hi << 32) | lo
- zero-extend both 32-bit halves to 64-bit (because lo and hi areunsigned
), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.
(editor's note: this could probably be more efficient if you used unsigned long
instead of unsigned int
. Then the compiler would know that lo
was already zero-extended into RAX. It wouldn't know that the upper half was zero, so |
and +
are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)
https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info
CPU Cycle count based profiling in C/C++ Linux x86_64
I personally think the rdtsc instruction is great and usable for a variety of tasks. I do not think that using cpuid is necessary to prepare for rdtsc. Here is how I reason around rdtsc:
- Since I use the Watcom compiler I've implemented rdtsc using "#pragma aux" which means that the C compiler will generate the instruction inline, expect the result in edx:eax and also inform its optimizer that the contents of eax and edx have been modified. This is a huge improvement from traditional _asm implementations where the optimizer would stay away from optimizing in _asm's vicinity. I've also implemented a divide_U8_by_U4 using "#pragma aux" so that I won't need to call a lib function when I convert clock_cycles to us or ms.
- Every execution of rdtsc will result in some overhead (A LOT more if it is encapsulated as in the author's example) which must be taken more into account the shorter the sequence to measure is. Generally I don't time shorter sequences than 1/30 of the internal clock frequency which usually works out to 1/10^8 seconds (3 GHZ internal clock). I use such measurements as indications, not fact. Knowing this I can leave out cpuid. The more times I measure, the closer to fact I will get.
- To measure reliably I would use the 1/100 - 1/300 range i/e 0.03 - 0.1 us. In this range the additional accuracy of using cpuid is practically insignificant. I use this range for short sequence timing. This is my "non-standard" unit since it is dependent on the CPU's internal clock frequency. For example on a 1 GHz machine I would not use 0.03 us because that would put me outside the 1/100 limit and my readings would become indications. Here I would use 0.1 us as the shortest time measurement unit. 1/300 would not be used since it would be too close to 1 us (see below) to make any significant difference.
- For even longer processing sequences I divide the difference between two rdtsc reading with say 3000 (for 3 GHz) and will convert the elapsed clock cycles to us. Actually I use (diff+1500)/3000 where 1500 is half of 3000. For I/O waits I use milliseconds => (diff+1500000)/3000000. These are my "standard" units. I very seldom use seconds.
- Sometimes I get unexpectedly slow results and then I must ask myself: is this due to an interrupt or to the code? I measure a few more times to see if it was, indeed, an interrupt. In that case ... well interrupts happen all the time in the real world. If my sequence is short then there is a good possibility that the next measurement won't be interrupted. If the sequence is longer interrupts will occur more often and there isn't much I can do about it.
- Measuring long elapsed times very accurately (hour and longer ETs in us or lower) will increase the risk of getting a division exception in divide_U8_by_U4, so I think through when to use us and when to use ms.
- I also have code for basic statistics. Using this I log min and max values and I can calculate mean and standard deviation. This code is non-trivial so its own ET must be subtracted from the measured ETs.
- If the compiler is doing extensive optimizations and your readings are stored in local variables the compiler may determine ("correctly") that the code can be omitted. One way to avoid this is to store the results in public (non-static, non-stack-based) variables.
- Programs running in real-world conditions should be measured in real-world conditions, there's no way around that.
As to the question of time stamp counter being accurate I would say that assuming the tsc on different cores are synchronized (which is the norm) there is the problem of CPU throttling during periods of low activity to reduce energy consumption. It is always possible to inhibit the functionality when testing. If you're executing an instruction at 1 GHz or at 10 Mhz on the same processor the elapsed cycle count will be the same even though the former completed in 1% of the time compred to the latter.
How to get the cpu cycles and executed time of one function in my source code or system library
Consider the code from SO: Get CPU cycle count?
static inline uint64_t get_cycles()
{
uint64_t t;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
and implement a class like the following:
class ScopedTimer
{
public:
ScopedTime ()
{
m_start = get_cycles ()
}
~ScopedTimer ()
{
auto diff = get_cycles() - m_start;
std::cout << "Takes " << diff << " cycles" << std::endl;
}
private:
uint64_t m_start;
};
Finally you can simply use that class in your code with:
void job () {
ScopedTimer timer;
// do some job
// leaving the scope will automatically print the message in the desctrutor.
}
I have some similar code that automatically counts some statistics in different categories.
However, mainly in the destructor you have to accumulate the cycles into a statistic class or something else.
What is the most reliable way to measure the number of cycles of my program in C?
A lot here depends on how large an amount of time you're trying to measure.
RDTSC can be (almost) 100% reliable when used correctly. It is, however, of use primarily for measuring truly microscopic pieces of code. If you want to measure two sequences of, say, a few dozen or so instructions apiece, there's probably nothing else that can do the job nearly as well.
Using it correctly is somewhat challenging though. Generally speaking, to get good measurements you want to do at least the following:
- Set the code to only run on one specific core.
- Set the code to execute at maximum priority so nothing preempts it.
- Use CPUID liberally to ensure serialization where needed.
If, on the other hand, you're trying to measure something that takes anywhere from, say, 100 ms on up, RDTSC
is pointless. It's like trying to measure the distance between cities with a micrometer. For this, it's generally best to assure that the code in question takes (at least) the better part of a second or so. clock
isn't particularly precise, but for a length of time on this general order, the fact that it might only be accurate to, say, 10 ms or so, is more or less irrelevant.
How to count clock cycles with RDTSC in GCC x86? [duplicate]
Update: reposted and updated this answer on a more canonical question. I'll probably delete this at some point once we sort out which question to use as the duplicate target for closing all the similar rdtsc
questions.
You don't need and shouldn't use inline asm for this. There's no benefit; compilers have built-ins for rdtsc
and rdtscp
, and (at least these days) all define a __rdtsc
intrinsic if you include the right headers. https://gcc.gnu.org/wiki/DontUseInlineAsm
Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics. (Intel's intriniscs guide says #include <immintrin.h>
for this, but with gcc and clang the non-SIMD intrinsics are mostly in x86intrin.h
.)
#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
unsigned long long readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
return __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
}
Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer.
For more about using lfence
to improve repeatability of rdtsc
, see @HadiBrais' answer on clflush to invalidate cache line via C function.
See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset.)
rdtsc
counts reference cycles, not CPU core clock cycles
It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc
is exactly correlated with wall-clock time (except for system clock adjustments, so it's basically steady_clock
). It ticks at the CPU's rated frequency, i.e. the advertised sticker frequency.
If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. Or better, use a library that gives you access to hardware performance counters, or a trick like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID
. You usually will still want to avoid CPU frequency shifts during your microbenchmark, though.
- std::chrono::clock, hardware clock and cycle count
- Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
- Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
It's also not guaranteed that the TSCs of all cores are in sync. So if your thread migrates to another CPU core between __rdtsc()
, there can be an extra skew. (Most OSes attempt to sync the TSCs of all cores, though.) If you're using rdtsc
directly, you probably want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram
on Linux.
How good is the asm from using the intrinsic?
It's at least as good as anything you could do with inline asm.
A non-inline version of it compiles MSVC for x86-64 like this:
unsigned __int64 readTSC(void) PROC ; readTSC
rdtsc
shl rdx, 32 ; 00000020H
or rax, rdx
ret 0
; return in RAX
For 32-bit calling conventions that return 64-bit integers in edx:eax
, it's just rdtsc
/ret
. Not that it matters, you always want this to inline.
In a test caller that uses it twice and subtracts to time an interval:
uint64_t time_something() {
uint64_t start = readTSC();
// even when empty, back-to-back __rdtsc() don't optimize away
return readTSC() - start;
}
All 4 compilers make pretty similar code. This is GCC's 32-bit output:
# gcc8.2 -O3 -m32
time_something():
push ebx # save a call-preserved reg: 32-bit only has 3 scratch regs
rdtsc
mov ecx, eax
mov ebx, edx # start in ebx:ecx
# timed region (empty)
rdtsc
sub eax, ecx
sbb edx, ebx # edx:eax -= ebx:ecx
pop ebx
ret # return value in edx:eax
This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.
# MSVC 19 2017 -Ox
unsigned __int64 time_something(void) PROC ; time_something
rdtsc
shl rdx, 32 ; high <<= 32
or rax, rdx
mov rcx, rax ; missed optimization: lea rcx, [rdx+rax]
; rcx = start
;; timed region (empty)
rdtsc
shl rdx, 32
or rax, rdx ; rax = end
sub rax, rcx ; end -= start
ret 0
unsigned __int64 time_something(void) ENDP ; time_something
All 4 compilers use or
+mov
instead of lea
to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.
But writing it in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.
Related Topics
Vector of Structs Initialization
Difference Between Function Template and Template Function
Opencv Imread(Filename) Fails in Debug Mode When Using Release Libraries
How to Remove "Noise" from Gcc/Clang Assembly Output
What Does the C++ Standard State the Size of Int, Long Type to Be
Opengl - Vertex Normals in Obj
Difference Between G++ and Gcc
In Which Scenario Do I Use a Particular Stl Container
How to Get Assembler Output from C/C++ Source in Gcc
Checking the Neighbour Values of Arrays
C++: How to Iterate Over Each Char in a String
C++ Extract Number from the Middle of a String
Is Short-Circuiting Logical Operators Mandated? and Evaluation Order
Accessing Inactive Union Member and Undefined Behavior