Why Does The Call Latency on Clock_Gettime(Clock_Realtime, ..) Vary So Much

Why does the call latency on clock_gettime(CLOCK_REALTIME, ..) vary so much?

The very first time clock_gettime is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime will take much more time than later calls, which is what you are observing.

The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now and printf, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime takes more time to execute.

Now regarding the second and later measurements. From the question:

I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now and printf contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp instruction itself may vary by 4 cycles. See also this.

In practice, the clock_gettime is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.

Is clock_gettime() adequate for submicrosecond timing?

No. You'll have to use platform-specific code to do it. On x86 and x86-64, you can use 'rdtsc' to read the Time Stamp Counter.

Just port the rdtsc assembly you're using.

__inline__ uint64_t rdtsc(void) {
  uint32_t lo, hi;
  __asm__ __volatile__ (      // serialize
  "xorl %%eax,%%eax \n        cpuid"
  ::: "%rax", "%rbx", "%rcx", "%rdx");
  /* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
  __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
  return (uint64_t)hi << 32 | lo;
}

clock_gettime takes longer to execute when program run from terminal

Just add more iterations to give the CPU time to ramp up to max clock speed. Your "slow" times are with the CPU at low-power idle clockspeed.

QtCreator apparently uses enough CPU time to make this happen before your program runs, or else you're compiling + running and the compilation process serves as a warm-up. (vs. bash's fork/execve being lighter weight.)

See Idiomatic way of performance evaluation? for more about doing warm-up runs when benchmarking, and also Why does this delay-loop start to run faster after several iterations with no sleep?

On my i7-6700k (Skylake) running Linux, increasing the loop iteration count to 1000 is sufficient to get the final iterations running at full clock speed, even after the first couple iterations handling page faults, warming up the iTLB, uop cache, data caches, and so on.

$ ./a.out      
It took 244 ns
It took 150 ns
It took 73 ns
It took 76 ns
It took 75 ns
It took 71 ns
It took 72 ns
It took 72 ns
It took 69 ns
It took 75 ns
...
It took 74 ns
It took 68 ns
It took 69 ns
It took 72 ns
It took 72 ns        # 382 "slow" iterations in this test run (copy/paste into wc to check)
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 16 ns
It took 16 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 14 ns
It took 16 ns
...

On my system, energy_performance_preference is set to balance_performance, so the hardware P-state governor isn't as aggressive as with performance. Use grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference to check, use sudo to change it:

sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_performance > "$i";done'

Even running it under perf stat ./a.out is enough to ramp up to max clock speed very quickly, though; it really don't take much. But bash's command parsing after you press return is very cheap, not much CPU work done before it calls execve and reaches main in your new process.

The printf with line-buffered output is what takes most of the CPU time in your program, BTW. That's why it takes so few iterations to ramp up to speed. e.g. if you run perf stat --all-user -r10 ./a.out, you'll see the user-space core clock cycles per second are only like 0.4GHz, the rest of the time spent in the kernel in write system calls.

Difference between CLOCK_REALTIME and CLOCK_MONOTONIC?

CLOCK_REALTIME represents the machine's best-guess as to the current wall-clock, time-of-day time. As Ignacio and MarkR say, this means that CLOCK_REALTIME can jump forwards and backwards as the system time-of-day clock is changed, including by NTP.

CLOCK_MONOTONIC represents the absolute elapsed wall-clock time since some arbitrary, fixed point in the past. It isn't affected by changes in the system time-of-day clock.

If you want to compute the elapsed time between two events observed on the one machine without an intervening reboot, CLOCK_MONOTONIC is the best option.

Note that on Linux, CLOCK_MONOTONIC does not measure time spent in suspend, although by the POSIX definition it should. You can use the Linux-specific CLOCK_BOOTTIME for a monotonic clock that keeps running during suspend.

faster equivalent of gettimeofday

Have you actually benchmarked, and found gettimeofday to be unacceptably slow?

At the rate of 100 messages a second, you have 10ms of CPU time per message. If you have multiple cores, assuming it can be fully parallelized, you can easily increase that by 4-6x - that's 40-60ms per message! The cost of gettimeofday is unlikely to be anywhere near 10ms - I'd suspect it to be more like 1-10 microseconds (on my system, microbenchmarking it gives about 1 microsecond per call - try it for yourself). Your optimization efforts would be better spent elsewhere.

While using the TSC is a reasonable idea, modern Linux already has a userspace TSC-based gettimeofday - where possible, the vdso will pull in an implementation of gettimeofday that applies an offset (read from a shared kernel-user memory segment) to rdtsc's value, thus computing the time of day without entering the kernel. However, some CPU models don't have a TSC synchronized between different cores or different packages, and so this can end up being disabled. If you want high performance timing, you might first want to consider finding a CPU model that does have a synchronized TSC.

That said, if you're willing to sacrifice a significant amount of resolution (your timing will only be accurate to the last tick, meaning it could be off by tens of milliseconds), you could use CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE with clock_gettime. This is also implemented with the vdso as well, and guaranteed not to call into the kernel (for recent kernels and glibc).

Why Does The Call Latency on Clock_Gettime(Clock_Realtime, ..) Vary So Much