How to Get Clock_Gettime(2) Clock in Shell

How to check if the system supports Monotonic Clock?

Per the letter of POSIX, you may in fact need a runtime test even if the constant CLOCK_MONOTONIC is defined. The official way to handle this is with the _POSIX_MONOTONIC_CLOCK "feature-test macro", but those macros have really complicated semantics: quoting http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/unistd.h.html ,

If a symbolic constant is not defined or is defined with the value -1, the option is not supported for compilation. If it is defined with a value greater than zero, the option shall always be supported when the application is executed. If it is defined with the value zero, the option shall be supported for compilation and might or might not be supported at runtime.

Translating that three-way distinction into code would give you something like this:

#if !defined _POSIX_MONOTONIC_CLOCK || _POSIX_MONOTONIC_CLOCK < 0
    clock_gettime(CLOCK_REALTIME, &spec);
#elif _POSIX_MONOTONIC_CLOCK > 0
    clock_gettime(CLOCK_MONOTONIC, &spec);
#else
    if (clock_gettime(CLOCK_MONOTONIC, &spec))
        clock_gettime(CLOCK_REALTIME, &spec));
#endif

But it's simpler and more readable if you just always do the runtime test when CLOCK_MONOTONIC itself is defined:

#ifdef CLOCK_MONOTONIC
    if (clock_gettime(CLOCK_MONOTONIC, &spec))
#endif
        clock_gettime(CLOCK_REALTIME, &spec);

This increases the size of your code by some trivial amount on current-generation OSes that do support CLOCK_MONOTONIC, but the readability benefits are worth it in my opinion.

There is also a pretty strong argument for using CLOCK_MONOTONIC unconditionally; you're more likely to find an OS that doesn't support clock_gettime at all (e.g. MacOS X still doesn't have it as far as I know) than an OS that has clock_gettime but not CLOCK_MONOTONIC.

Difference between CLOCK_REALTIME and CLOCK_MONOTONIC?

CLOCK_REALTIME represents the machine's best-guess as to the current wall-clock, time-of-day time. As Ignacio and MarkR say, this means that CLOCK_REALTIME can jump forwards and backwards as the system time-of-day clock is changed, including by NTP.

CLOCK_MONOTONIC represents the absolute elapsed wall-clock time since some arbitrary, fixed point in the past. It isn't affected by changes in the system time-of-day clock.

If you want to compute the elapsed time between two events observed on the one machine without an intervening reboot, CLOCK_MONOTONIC is the best option.

Note that on Linux, CLOCK_MONOTONIC does not measure time spent in suspend, although by the POSIX definition it should. You can use the Linux-specific CLOCK_BOOTTIME for a monotonic clock that keeps running during suspend.

Is clock_gettime() adequate for submicrosecond timing?

No. You'll have to use platform-specific code to do it. On x86 and x86-64, you can use 'rdtsc' to read the Time Stamp Counter.

Just port the rdtsc assembly you're using.

__inline__ uint64_t rdtsc(void) {
  uint32_t lo, hi;
  __asm__ __volatile__ (      // serialize
  "xorl %%eax,%%eax \n        cpuid"
  ::: "%rax", "%rbx", "%rcx", "%rdx");
  /* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
  __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
  return (uint64_t)hi << 32 | lo;
}

Timing a process in C using clock(), time(), clock_gettimes() and the rdtsc() intrinsic returning confusing values

If someone else comes across this I came across this paper and am using this to time my code https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

clock_gettime takes longer to execute when program run from terminal

Just add more iterations to give the CPU time to ramp up to max clock speed. Your "slow" times are with the CPU at low-power idle clockspeed.

QtCreator apparently uses enough CPU time to make this happen before your program runs, or else you're compiling + running and the compilation process serves as a warm-up. (vs. bash's fork/execve being lighter weight.)

See Idiomatic way of performance evaluation? for more about doing warm-up runs when benchmarking, and also Why does this delay-loop start to run faster after several iterations with no sleep?

On my i7-6700k (Skylake) running Linux, increasing the loop iteration count to 1000 is sufficient to get the final iterations running at full clock speed, even after the first couple iterations handling page faults, warming up the iTLB, uop cache, data caches, and so on.

$ ./a.out      
It took 244 ns
It took 150 ns
It took 73 ns
It took 76 ns
It took 75 ns
It took 71 ns
It took 72 ns
It took 72 ns
It took 69 ns
It took 75 ns
...
It took 74 ns
It took 68 ns
It took 69 ns
It took 72 ns
It took 72 ns        # 382 "slow" iterations in this test run (copy/paste into wc to check)
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 16 ns
It took 16 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 14 ns
It took 16 ns
...

On my system, energy_performance_preference is set to balance_performance, so the hardware P-state governor isn't as aggressive as with performance. Use grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference to check, use sudo to change it:

sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_performance > "$i";done'

Even running it under perf stat ./a.out is enough to ramp up to max clock speed very quickly, though; it really don't take much. But bash's command parsing after you press return is very cheap, not much CPU work done before it calls execve and reaches main in your new process.

The printf with line-buffered output is what takes most of the CPU time in your program, BTW. That's why it takes so few iterations to ramp up to speed. e.g. if you run perf stat --all-user -r10 ./a.out, you'll see the user-space core clock cycles per second are only like 0.4GHz, the rest of the time spent in the kernel in write system calls.