Difference Between Rdtscp, Rdtsc:Memory and Cpuid/Rdtsc

Difference between rdtscp, rdtsc : memory and cpuid / rdtsc?

As mentioned in a comment, there's a difference between a compiler barrier and a processor barrier. volatile and memory in the asm statement act as a compiler barrier, but the processor is still free to reorder instructions.

Processor barrier are special instructions that must be explicitly given, e.g. rdtscp, cpuid, memory fence instructions (mfence, lfence, ...) etc.

As an aside, while using cpuid as a barrier before rdtsc is common, it can also be very bad from a performance perspective, since virtual machine platforms often trap and emulate the cpuid instruction in order to impose a common set of CPU features across multiple machines in a cluster (to ensure that live migration works). Thus it's better to use one of the memory fence instructions.

The Linux kernel uses mfence;rdtsc on AMD platforms and lfence;rdtsc on Intel. If you don't want to bother with distinguishing between these, mfence;rdtsc works on both although it's slightly slower as mfence is a stronger barrier than lfence.

Edit 2019-11-25: As of Linux kernel 5.4, lfence is used to serialize rdtsc on both Intel and AMD. See this commit "x86: Remove X86_FEATURE_MFENCE_RDTSC": https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be261ffce6f13229dad50f59c5e491f933d3167f

RDTSCP versus RDTSC + CPUID

A full discussion of the overhead you're seeing from the cpuid instruction is available at this stackoverflow thread. When using rdtsc, you need to use cpuid to ensure that no additional instructions are in the execution pipeline. The rdtscp instruction flushes the pipeline intrinsically. (The referenced SO thread also discusses these salient points, but I addressed them here because they're part of your question as well).

You only "need" to use cpuid+rdtsc if your processor does not support rdtscp. Otherwise, rdtscp is what you want, and will accurately give you the information you are after.

Both instructions provide you with a 64-bit, monotonically increasing counter that represents the number of cycles on the processor. If this is your pattern:

uint64_t s, e;
s = rdtscp();
do_interrupt();
e = rdtscp();

atomic_add(e - s, &acc);
atomic_add(1, &counter);

You may still have an off-by-one in your average measurement depending on where your read happens. For instance:

   T1                              T2
t0 atomic_add(e - s, &acc);
t1                                 a = atomic_read(&acc);
t2                                 c = atomic_read(&counter);
t3 atomic_add(1, &counter);
t4                                 avg = a / c;

It's unclear whether "[a]t the end" references a time that could race in this fashion. If so, you may want to calculate a running average or a moving average in-line with your delta.

Side-points:

If you do use cpuid+rdtsc, you need to subtract out the cost of the cpuid instruction, which may be difficult to ascertain if you're in a VM (depending on how the VM implements this instruction). This is really why you should stick with rdtscp.
Executing rdtscp inside a loop is usually a bad idea. I somewhat frequently see microbenchmarks that do things like

for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
   s = rdtscp();
   loop_body();
   e = rdtscp();
   acc += e - s;
}

printf("%"PRIu64"\n", (acc / SOME_LARGEISH_NUMBER / CLOCK_SPEED));

While this will give you a decent idea of the overall performance in cycles of whatever is in loop_body(), it defeats processor optimizations such as pipelining. In microbenchmarks, the processor will do a pretty good job of branch prediction in the loop, so measuring the loop overhead is fine. Doing it the way shown above is also bad because you end up with 2 pipeline stalls per loop iteration. Thus:

s = rdtscp();
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
   loop_body();
}
e = rdtscp();
printf("%"PRIu64"\n", ((e-s) / SOME_LARGEISH_NUMBER / CLOCK_SPEED));

Will be more efficient and probably more accurate in terms of what you'll see in Real Life versus what the previous benchmark would tell you.

Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time?

TL;DR

rdtscp and lfence/rdtsc have the same exact upstream serialization properties On Intel processors. On AMD processors with a dispatch-serializing lfence, both sequences have also the same upstream serialization properties. With respect to later instructions, rdtsc in the lfence/rdtsc sequence may be dispatched for execution simultaneously with later instructions. This behavior may not be desirable if you also want to precisely time these later instructions as well. This is generally not a problem because the reservation station scheduler prioritizes older uops for dispatching as long as there are no structural hazards. After lfence retires, rdtsc uops would be the oldest in the RS with probably no structural hazards, so they will be immediately dispatched (possibly together with some later uops). You could also put an lfence after rdtsc.

The Intel manual V2 says the following about rdtscp (emphasis mine):

The RDTSCP instruction is not a serializing instruction, but it does
wait until all previous instructions have executed and all previous
loads are globally visible. But it does not wait for previous stores
to be globally visible, and subsequent instructions may begin execution before the read operation is performed.

The "read operation" part here refers to reading the time-stamp counter. This suggests that rdtscp internally works like lfence followed by rdtsc + reading IA32_TSC_AUX. That is, lfence is performed first then the two reads from the registers are executed (possibly at the same time).

On most Intel and AMD processors that support these instructions, lfence/rdtsc have a slightly larger number of uops than rdtscp. The number of lfence uops mentioned in Agner's tables is for the case where the lfence instructions are executed back-to-back, which makes it appear that lfence is decoded into a smaller number of uops (1 or 2) than what a single lfence is actually decoded into (5 or 6 uops). Usually, lfence is used without other back-to-back lfences. That's why lfence/rdtsc contains more uops than rdtscp. Agner's tables also show that on some processors, rdtsc and rdtscp have the same number of uops, which I'm not sure is correct. It makes more sense for rdtscp to have one or more uops than rdtsc. That said, the latency may be more important than the difference in the number of uops because that's what directly impacts the measurement overhead.

In terms of portability, rdtsc is older than rdtscp; rdtsc was first supported on the Pentium processors while the first processors that support rdtscp were released in 2005-2006 (See: What is the gcc cpu-type that includes support for RDTSCP?). But most Intel and AMD processors that are in use today support rdtscp. Another dimension for comparing between the two sequences is that rdtscp pollutes one more register (i.e., ECX) than rdtsc.

In summary, if you don't care about reading the IA32_TSC_AUX MSR, there is no particularly big reason why you should choose one over the other. I would use rdtscp and fall back to lfence/rdtsc (or lfence/rdtsc/lfence) on processors that don't support it. If you want maximum timing precision, use the method discussed in Memory latency measurement with time stamp counter.

As Andreas Abel pointed out, you still need an lfence after the last rdtsc(p) as it is not ordered w.r.t. subsequent instructions:

lfence                    lfence
rdtsc      -- ALLOWED --> B
B                         rdtsc

rdtscp     -- ALLOWED --> B
B                         rdtscp

This is also addressed in the manuals.

Regarding the use of rdtscp, it seems correct to me to think of it as a compact lfence + rdtsc.

The manuals use different terminology for the two instructions (e.g. "completed locally" vs "globally visible" for loads) but the behavior described seems to be the same.

I'm assuming so in the rest of this answer.

However rdtscp is a single instruction, while lfence + rdtscp are two, making the lfence part of the profiled code.

Granted that lfence should be lightweight in terms of backend execution resources (it is just a marker) it still occupies front-end resources (two uops?) and a slot in the ROB.

rdtscp is decoded into a greater number of uops due to its ability to read IA32_TSC_AUX, so while it saves front-end (part of) resources, it occupies the backend more.

If the read of the TSC is done first (or concurrently) with the processor ID then this extra uops are only relevant for the subsequent code.

This could be a reason why it is used at the end but not at the start of the benchmark (where the extra uops would affect the code).
This is enough to bias/complicate some micro-architectural benchmarks.

You cannot avoid the lfence after an rdtsc(p) but you can avoid the one before with rdtscp.

This seems unnecessary for the first rdtsc as the preceding lfence is not profiled anyway.

Another reason to use rdtscp at the end is that it was (according to Intel) meant to detect a migration to a different CPU (that's why it atomically also load IA32_TSC_AUX), so at the end of the profiled code you may want to check that the code has not been scheduled to another CPU.

User mode software can use RDTSCP to detect if CPU migration has occurred between successive reads of the TSC.

This, of course, requires to have read IA32_TSC_AUX before (to have something to compare to) so one should have a rdpid or rdtscp before the profiling code.

If one can afford to not use ecx, the first rdtsc can be a rdtscp too (but see above), otherwise (rather than storing the processor id while in the profiled code), rdpid can be used first (thus, having a rdtsc + rdtscp pair around the profiled code).

This is open to ABA problem, so I don't think Intel has a strong point on this (unless we restrict ourselves to code short enough to be rescheduled at most once).

EDIT
As PeterCordes pointed out, from the point of view of the elapsed time measure, having a migration A->B->A is not an issue as the reference clock is the same.

More information on why rdtsc(p) is not fully serializing: Why isn't RDTSC a serializing instruction?
.

Why is CPUID + RDTSC unreliable?

I think they're finding that CPUID inside the measurement interval causes extra variability in the total time. Their proposed fix in 3.2 Improvements Using RDTSCP Instruction highlights the fact that there's no CPUID inside the timed interval when they use CPUID / RDTSC to start, and RDTSCP/CPUID to stop.

Perhaps they could have ensured EAX=0 or EAX=1 before executing CPUID, to choose which CPUID leaf of data to read (http://www.sandpile.org/x86/cpuid.htm#level_0000_0000h), in case CPUID time taken depends on which query you make. Other than that, I'm unsure why that would be.

Or better, use lfence instead of cpuid to serialize OoO exec without being a full serializing operation.

Note that the inline asm in Intel's whitepaper sucks: there's no need for those mov instructions if you use proper output constraints like "=a"(low), "=d"(high). See How to get the CPU cycle count in x86_64 from C++? for better ways.

cpuid + rdtsc and out-of-order execution

Since RDTSC does not depend on any input (it takes no arguments) in principle the OOO pipeline will run it as soon as it can. The reason you add a serializing instruction before it is not to let the RDTSC execute earlier.

There is an answer from John McCalpin here, you might find it useful. He explains the OOO reordering for the RDTSCP instruction (which behaves differently from RDTSC) which you may prefer to use instead.

cpuid before rdtsc

It's to prevent out-of-order execution. From a link that has now disappeared from the web (but which was fortuitously copied here before it disappeared), this text is from an article entitled "Performance monitoring" by one John Eckerdal:

The Pentium Pro and Pentium II processors support out-of-order execution instructions may be executed in another order as you programmed them. This can be a source of errors if not taken care of.
To prevent this the programmer must serialize the the instruction queue. This can be done by inserting a serializing instruction like CPUID instruction before the RDTSC instruction.

Why isn't RDTSC a serializing instruction?

If you are trying to use rdtsc to see if a branch mispredicts, the non-serializing version is what you want.

//math here
rdtsc
branch if zero to done
//do some work that always takes 1 cycle
done: rdtsc

If the branch is predicted correctly, the delta will be small (maybe even negative?). If the branch is mispredicted, the delta will be large.

With the serializing version, the branch condition will be resolved because the first rdtsc waits for the math to finish.

Difference Between Rdtscp, Rdtsc:Memory and Cpuid/Rdtsc