What Is Microbenchmarking

What is microbenchmarking?

It means exactly what it says on the tin can - it's measuring the performance of something "small", like a system call to the kernel of an operating system.

The danger is that people may use whatever results they obtain from microbenchmarking to dictate optimizations. And as we all know:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of
all evil" -- Donald Knuth

There can be many factors that skew the result of microbenchmarks. Compiler optimizations is one of them. If the operation being measured takes so little time that whatever you use to measure it takes longer than the actual operation itself, your microbenchmarks will be skewed also.

For example, someone might take a microbenchmark of the overhead of for loops:

void TestForLoop()
{
    time start = GetTime();

    for(int i = 0; i < 1000000000; ++i)
    {
    }

    time elapsed = GetTime() - start;
    time elapsedPerIteration = elapsed / 1000000000;
    printf("Time elapsed for each iteration: %d\n", elapsedPerIteration);
}

Obviously compilers can see that the loop does absolutely nothing and not generate any code for the loop at all. So the value of elapsed and elapsedPerIteration is pretty much useless.

Even if the loop does something:

void TestForLoop()
{
    int sum = 0;
    time start = GetTime();

    for(int i = 0; i < 1000000000; ++i)
    {
        ++sum;
    }

    time elapsed = GetTime() - start;
    time elapsedPerIteration = elapsed / 1000000000;
    printf("Time elapsed for each iteration: %d\n", elapsedPerIteration);
}

The compiler may see that the variable sum isn't going to be used for anything and optimize it away, and optimize away the for loop as well. But wait! What if we do this:

void TestForLoop()
{
    int sum = 0;
    time start = GetTime();

    for(int i = 0; i < 1000000000; ++i)
    {
        ++sum;
    }

    time elapsed = GetTime() - start;
    time elapsedPerIteration = elapsed / 1000000000;
    printf("Time elapsed for each iteration: %d\n", elapsedPerIteration);
    printf("Sum: %d\n", sum); // Added
}

The compiler might be smart enough to realize that sum will always be a constant value, and optimize all that away as well. Many would be surprised at the optimizing capabilities of compilers these days.

But what about things that compilers can't optimize away?

void TestFileOpenPerformance()
{
    FILE* file = NULL;
    time start = GetTime();

    for(int i = 0; i < 1000000000; ++i)
    {
        file = fopen("testfile.dat");
        fclose(file);
    }

    time elapsed = GetTime() - start;
    time elapsedPerIteration = elapsed / 1000000000;
    printf("Time elapsed for each file open: %d\n", elapsedPerIteration);
}

Even this is not a useful test! The operating system may see that the file is being opened very frequently, so it may preload it in memory to improve performance. Pretty much all operating systems do this. The same thing happens when you open applications - operating systems may figure out the top ~5 applications you open the most and preload the application code in memory when you boot up the computer!

In fact, there are countless variables that come into play: locality of reference (e.g. arrays vs. linked lists), effects of caches and memory bandwidth, compiler inlining, compiler implementation, compiler switches, number of processor cores, optimizations at the processor level, operating system schedulers, operating system background processes, etc.

So microbenchmarking isn't exactly a useful metric in a lot of cases. It definitely does not replace whole-program benchmarks with well-defined test cases (profiling). Write readable code first, then profile to see what needs to be done, if any.

I would like to emphasize that microbenchmarks are not evil per se, but one has to use them carefully (that's true for lots of other things related to computers)

Difference between Benchmarking and Profiling

A benchmark is something that measures the time for some whole operation. e.g. I/O operations per second under some workload. So the result is typically a single number, in either seconds or operations per second. Or a data set with results for different parameters, so you can graph it.

You might use a benchmark to compare the same software on different hardware, or different versions of some other software that your benchmark interacts with. e.g. benchmark max connections per second with different apache settings.

Profiling is not aimed at comparing different things: it's about understanding the behaviour of a program. A profile result might be a table of time taken per function, or even per instruction with a sampling profiler. You can tell it's a profile not a benchmark because it makes no sense to say "that function took the least time so we'll keep that one and stop using the rest".

Read the wikipedia article to learn more about it: https://en.wikipedia.org/wiki/Profiling_(computer_programming)

You use a profile to figure out where to optimize. A 10% speedup in a function where your program spends 99% of its time is more valuable than a 100% speedup in any other function. Even better is when you can improve your high-level design so the expensive function is called less, as well as just making it faster.

Microbenchmarking is a specific form of benchmarking. It means you're testing one super-specific thing to measure just that in isolation, not the overall performance of anything that's really useful.

Example microbenchmark results:

Intel Haswell's L1 cache load-use latency is 4 cycles.
This version of memcpy achieves 80% of the throughput of the other version.
mov eax, ecx has 0c latency on Haswell, but mov ecx, ecx has 1c latency. (mov-elimination only works between different registers on Intel). See that link for full asm source of a static executable, and performance-counter results from running it with a couple different loop bodies to demonstrate mov-elimination.
Using CPU performance counters to measure how a micro-benchmark runs is a good way to do experiments to find out how CPUs work internally. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent for more examples of that. In that case, you're profiling your micro-benchmark to learn what's making it run at that speed. (Often you're more interested in the perf counters like uops_executed than you are in the actual time or clock cycle count, e.g. to test for micro-fusion / un-lamination without needing to make a loop where that actually affects cycles per iteration.)

Example non-micro benchmark results:

compressing this 100MB collection of files took 23 seconds with 7-zip (with specific options and hardware).
compiling a Linux kernel took 99 seconds on some hardware / software combination.

See also https://en.wikipedia.org/wiki/Benchmark_(computing)#Types_of_benchmarks.

Micro-benchmarking is a special case of benchmarking. If you do it right, it tells you which operations are expensive and which are cheap, which helps you while trying to optimize. If you do it wrong, you probably didn't even measure what you set out to measure at all. e.g. you wrote some C to test for loops vs. while loops, but the compiler made different code for different reasons, and your results are meaningless. (Different ways to express the same logic almost never matter with modern optimizing compilers; don't waste time on this.) Micro-benchmarking is hard.

The other way to tell it's a micro-benchmark is that you usually need to look at the compiler's asm output to make sure it's testing what you wanted it to test. (e.g. that it didn't optimize across iterations of your repeat-10M-times loop by hoisting something expensive out of the loop that's supposed to repeat the whole operation enough times to give duration that can be accurately measured.)

Micro-benchmarking can distort things, because they test your function with caches hot and branch predictors primed, and they don't run any other code between invocations of the code under test. This can make huge loop unrolling look good, when as part of a real program it would lead to more cache misses. Similarly, it makes big lookup-tables look good, because the whole lookup table ends up in cache. The full program usually dirties enough cache between calls to the function that the lookup table doesn't always hit in cache, so it would have been cheaper just to compute something. (Most programs are memory-bound. Re-computing something not too complex is often as fast as looking it up.)

How do I write a correct micro-benchmark in Java?

Tips about writing micro benchmarks from the creators of Java HotSpot:

Rule 0: Read a reputable paper on JVMs and micro-benchmarking. A good one is Brian Goetz, 2005. Do not expect too much from micro-benchmarks; they measure only a limited range of JVM performance characteristics.

Rule 1: Always include a warmup phase which runs your test kernel all the way through, enough to trigger all initializations and compilations before timing phase(s). (Fewer iterations is OK on the warmup phase. The rule of thumb is several tens of thousands of inner loop iterations.)

Rule 2: Always run with -XX:+PrintCompilation, -verbose:gc, etc., so you can verify that the compiler and other parts of the JVM are not doing unexpected work during your timing phase.

Rule 2.1: Print messages at the beginning and end of timing and warmup phases, so you can verify that there is no output from Rule 2 during the timing phase.

Rule 3: Be aware of the difference between -client and -server, and OSR and regular compilations. The -XX:+PrintCompilation flag reports OSR compilations with an at-sign to denote the non-initial entry point, for example: Trouble$1::run @ 2 (41 bytes). Prefer server to client, and regular to OSR, if you are after best performance.

Rule 4: Be aware of initialization effects. Do not print for the first time during your timing phase, since printing loads and initializes classes. Do not load new classes outside of the warmup phase (or final reporting phase), unless you are testing class loading specifically (and in that case load only the test classes). Rule 2 is your first line of defense against such effects.

Rule 5: Be aware of deoptimization and recompilation effects. Do not take any code path for the first time in the timing phase, because the compiler may junk and recompile the code, based on an earlier optimistic assumption that the path was not going to be used at all. Rule 2 is your first line of defense against such effects.

Rule 6: Use appropriate tools to read the compiler's mind, and expect to be surprised by the code it produces. Inspect the code yourself before forming theories about what makes something faster or slower.

Rule 7: Reduce noise in your measurements. Run your benchmark on a quiet machine, and run it several times, discarding outliers. Use -Xbatch to serialize the compiler with the application, and consider setting -XX:CICompilerCount=1 to prevent the compiler from running in parallel with itself. Try your best to reduce GC overhead, set Xmx(large enough) equals Xms and use UseEpsilonGC if it is available.

Rule 8: Use a library for your benchmark as it is probably more efficient and was already debugged for this sole purpose. Such as JMH, Caliper or Bill and Paul's Excellent UCSD Benchmarks for Java.

Is it possible to use a micro-benchmark framework to only time some statements?

I guess, calls to a DB are usually expensive enough to eliminate most of the problem with microbenchmarking. So your approach was probably fine. If you're measuring it in production, repeating the measurement many times, and don't care about a few nanoseconds, stick with System.nanoTime.

You're doing something very different from microbenchmarking like e.g. I did here. You're not trying to optimize a tiny piece of code and you don't want to eliminate external influences.

Microbenchmarking a part of a method makes no sense to me, as a method gets optimized as a whole (and possibly also inlined). It's a different level.

I don't think any framework could help, all they can do in your case is automate the work, which you don't seem to need. Note that System.nanoTime may take several hundreds cycles (which is probably fine in your case).

Java 8 framework for microbenchmarking

Use JMH:

JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targetting the JVM.

Java Microbenchmarking Harness vs System.getNanotime()

JMH maintainer here.

Let me ask the leading question: Why would one use the library, if you can code most of the things yourself? The answer is actually simple: of course you can write everything given the infinite time, but in practice we have to reuse code to fit into reasonable time.

Now, it would only seem that having two timestamps around the code is enough to measure its performance. However, you have to control what exactly you are measuring, e.g. whether you are still in the transitional warmup phase, does your code actually execute or you measure a hollow shell after the optimization, how statistically significant your effect is, etc. etc. etc. Good benchmark frameworks try to help with that.

You can have a glimpse of the issues you will have to face by looking at JMH Samples, our benchmarking talks, or the relevant SO answers. Oh, and using nanoTime is harder than you think.

Idiomatic way of performance evaluation?

Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)

Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.

This question is way too broad to say anything more specific.

Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes.

Beware common microbenchmark pitfalls:

Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: *Performance: memset example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.

Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.

Litmus tests:

If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.

For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

What Is Microbenchmarking