Is Clock_Gettime() Adequate for Submicrosecond Timing

Is clock_gettime() adequate for submicrosecond timing?

No. You'll have to use platform-specific code to do it. On x86 and x86-64, you can use 'rdtsc' to read the Time Stamp Counter.

Just port the rdtsc assembly you're using.

__inline__ uint64_t rdtsc(void) {
  uint32_t lo, hi;
  __asm__ __volatile__ (      // serialize
  "xorl %%eax,%%eax \n        cpuid"
  ::: "%rax", "%rbx", "%rcx", "%rdx");
  /* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
  __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
  return (uint64_t)hi << 32 | lo;
}

faster equivalent of gettimeofday

Have you actually benchmarked, and found gettimeofday to be unacceptably slow?

At the rate of 100 messages a second, you have 10ms of CPU time per message. If you have multiple cores, assuming it can be fully parallelized, you can easily increase that by 4-6x - that's 40-60ms per message! The cost of gettimeofday is unlikely to be anywhere near 10ms - I'd suspect it to be more like 1-10 microseconds (on my system, microbenchmarking it gives about 1 microsecond per call - try it for yourself). Your optimization efforts would be better spent elsewhere.

While using the TSC is a reasonable idea, modern Linux already has a userspace TSC-based gettimeofday - where possible, the vdso will pull in an implementation of gettimeofday that applies an offset (read from a shared kernel-user memory segment) to rdtsc's value, thus computing the time of day without entering the kernel. However, some CPU models don't have a TSC synchronized between different cores or different packages, and so this can end up being disabled. If you want high performance timing, you might first want to consider finding a CPU model that does have a synchronized TSC.

That said, if you're willing to sacrifice a significant amount of resolution (your timing will only be accurate to the last tick, meaning it could be off by tens of milliseconds), you could use CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE with clock_gettime. This is also implemented with the vdso as well, and guaranteed not to call into the kernel (for recent kernels and glibc).

Use of _COARSE variants in clock_gettime() still calls sys_clock_gettime() system call

Looks like it should still be a system call, if this patch is anything to go by:
http://lwn.net/Articles/342018/

It just doesn't call specific functions to fetch the EXACT time from some hardware registers, which, at least on some hardware, is quite slow.

But there are lots of factors:

What hardware is it? clock_gettime() should be a virtual system call [vsyscall] on x86 and x86-64.

And finally, if you call it "as the first parameter" in a lot of function calls, then it's likely that it's simply how much time it takes.

I doubt there is any way to get current time without at least a virtual system call, since you do need some information from the kernel to get the current time - where is it supposed to find the current time from, if it doesn't make some sort of call to kernel code.

A virtual system call works by adding a little bit of "kernel code" into user-space, which has read-only access to certain pieces of the kernel memory-space, in particular "current process ID" and "parent process ID" and some types of time-information, such as "current time" and "CPU usage stats" I think. This allows the system call to be done completely in user-space, and thus is much faster than a "real" system call that transitions into kernel mode and back out again.

Using CPU counters versus gettimeofday?

CPU counters and wall clocks are different tools for different purposes.

When to use a wall clock:

When you want to measure time in a standard time unit (such as seconds). If you want to measure how long X task takes, use a wall clock.

Examples:

clock()
gettimeofday()
clock_gettime(2)
etc...

When to use RDTSC:

If you're looking to measure the relative times of two different tasks to as high precision as possible, then RDTSC may be suitable.

RDTSC measures the number of pseudo-cycles that have elapsed since the CPU has started up. Often (but not always), this is equal to the CPU clock speed of your processor. But there's no easy to determine the exact number of "ticks per second" without actually measuring it against a wall clock.

However, RDTSC is about as low overhead as it can get for a time function. So it is well suited for micro-optimizations when you're comparing one implementation against another to determine which is faster. (as opposed to how much absolute time it takes)

Other things to note:

In most cases, most benchmarking purposes can be done sufficiently well with wall clocks. So the use of RDTSC is pretty limited. Stick with standardized functions when possible.
High precision wall clocks are typically implemented on top of RDTSC. So if you're trying to use RDTSC to get a high-precision measurement of wall time, you'll just be reinventing the wheel.

_{As a side note, I use RDTSC both for seeing RNGs and as an anti-cheating measure for my overclocker benchmarks.}

Measure elapsed time in C?

The C Standard does not define a portable way to do this. The time() library function has a definition of 1 second, which is inappropriate for your purpose. As mentioned by @Puck, C11 did introduce timespec_get() to retrieve a more precise time value, but this function is not widely supported and may not provide the expected accuracy.

Other functions are available on selected systems:

The POSIX standard defines gettimeofday() and clock_gettime() which can return precise real time with the argument CLOCK_REALTIME.
OS/X has a more precise alternative: clock_gettime_nsec_np which returns a 64-bit value in nanosecond increments.
Microsoft documents this for Windows.

Note however that performing precise and reliable sub-microsecond benchmarks is a difficult game to say the least.

CPU TSC fetch operation especially in multicore-multi-processor environment

On newer CPUs (i7 Nehalem+ IIRC) the TSC is synchronzied across all cores and runs a constant rate.
So for a single processor, or more than one processor on a single package or mainboard(!) you can rely on a synchronzied TSC.

From the Intel System Manual 16.12.1

The time stamp counter in newer processors may support an enhancement,
referred to as invariant TSC. Processors support for invariant TSC is
indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a
constant rate in all ACPI P-, C-. and T-states. This is the
architectural behavior moving forward.

On older processors you can not rely on either constant rate or synchronziation.

Edit: At least on multiple processors in a single package or mainboard the invariant TSC is synchronized. The TSC is reset to zero at a /RESET and then ticks onward at a constant rate on each processor, without drift. The /RESET signal is guaranteed to arrive at each processor at the same time.

Is Clock_Gettime() Adequate for Submicrosecond Timing