How to Write a Correct Micro-Benchmark in Java

How do I write a correct micro-benchmark in Java?

Tips about writing micro benchmarks from the creators of Java HotSpot:

Rule 0: Read a reputable paper on JVMs and micro-benchmarking. A good one is Brian Goetz, 2005. Do not expect too much from micro-benchmarks; they measure only a limited range of JVM performance characteristics.

Rule 1: Always include a warmup phase which runs your test kernel all the way through, enough to trigger all initializations and compilations before timing phase(s). (Fewer iterations is OK on the warmup phase. The rule of thumb is several tens of thousands of inner loop iterations.)

Rule 2: Always run with -XX:+PrintCompilation, -verbose:gc, etc., so you can verify that the compiler and other parts of the JVM are not doing unexpected work during your timing phase.

Rule 2.1: Print messages at the beginning and end of timing and warmup phases, so you can verify that there is no output from Rule 2 during the timing phase.

Rule 3: Be aware of the difference between -client and -server, and OSR and regular compilations. The -XX:+PrintCompilation flag reports OSR compilations with an at-sign to denote the non-initial entry point, for example: Trouble$1::run @ 2 (41 bytes). Prefer server to client, and regular to OSR, if you are after best performance.

Rule 4: Be aware of initialization effects. Do not print for the first time during your timing phase, since printing loads and initializes classes. Do not load new classes outside of the warmup phase (or final reporting phase), unless you are testing class loading specifically (and in that case load only the test classes). Rule 2 is your first line of defense against such effects.

Rule 5: Be aware of deoptimization and recompilation effects. Do not take any code path for the first time in the timing phase, because the compiler may junk and recompile the code, based on an earlier optimistic assumption that the path was not going to be used at all. Rule 2 is your first line of defense against such effects.

Rule 6: Use appropriate tools to read the compiler's mind, and expect to be surprised by the code it produces. Inspect the code yourself before forming theories about what makes something faster or slower.

Rule 7: Reduce noise in your measurements. Run your benchmark on a quiet machine, and run it several times, discarding outliers. Use -Xbatch to serialize the compiler with the application, and consider setting -XX:CICompilerCount=1 to prevent the compiler from running in parallel with itself. Try your best to reduce GC overhead, set Xmx(large enough) equals Xms and use UseEpsilonGC if it is available.

Rule 8: Use a library for your benchmark as it is probably more efficient and was already debugged for this sole purpose. Such as JMH, Caliper or Bill and Paul's Excellent UCSD Benchmarks for Java.

Max and min time benchmark in Java with JMH

If you run your @Benchmark with @BenchmarkMode(Mode.AverageTime), as a bonus, you will get an output like:

Result "StopAllBenchmarks.measureRight":
 2.387 ±(99.9%) 0.307 ns/op [Average]
 (min, avg, max) = (2.339, 2.387, 2.526), stdev = 0.080
 CI (99.9%): [2.079, 2.694] (assumes normal distribution)

Notice the min, avg, max

JMH - How to correctly benchmark Thread Pools?

I solved this issue by myself with the help of other answerers. In the last edit (And in all other edits) the issue was in my gradle configuration, so I was running this benchmark in all of my system threads, I use this gradle plugin to run JMH and before making all of my benchmarks in my gradle buildscript I set threads = 4 value, so you saw these strange benchmark results because JMH tried to benchmark on all available threads thread pool doing work on all available threads. I removed this configuration and set @State(Scope.Thread) and @Threads(1) annotations in benchmark class, a bit edited runInThreadPool() method to:

public static void runInThreadPool(int amountOfTasks, Blackhole bh, ExecutorService threadPool)
            throws InterruptedException, ExecutionException {
        Future<?>[] futures = new Future[amountOfTasks];
        for (int i = 0; i < amountOfTasks; i++) {
            futures[i] = threadPool.submit(PrioritySchedulerSamples::doWork, (ThreadFactory) runnable -> {
                Thread thread = new Thread(runnable);
                thread.setPriority(10);
                return thread;
            });
        }
        for (Future<?> future : futures) {
            bh.consume(future.get());
        }

        threadPool.shutdownNow();
        threadPool.awaitTermination(10, TimeUnit.SECONDS);
    }

So each thread in this thread pool runs in maximal priority.
And benchmark of all these changes:

Benchmark                                 (amountOfTasks)  Mode  Cnt         Score         Error  Units
PrioritySchedulerSamples.fixedThreadPool             2048  avgt    3   8021054,516 ± 2874987,327  ns/op
PrioritySchedulerSamples.noThreading                 2048  avgt    3  17583295,617 ± 5499026,016  ns/op

These results seems to be correct. (Especially for my system.)

I also made a list of common problems in microbenchmarking thread pools and basically all of the concurrent java components:

Make sure your microbenchmark is executing in one thread, use @Threads(1) and @State(Scope.Thread) annotations to make your microbenchmark executing in one thread. (use, for example, htop command to find out how many and which threads are consuming the most CPU percentage)
Make sure you execute the task completely in your microbenchmark, and wait for all Threads to complete this task. (maybe your microbenchmark doesn't wait for tasks to complete?)
Don't use Thread.sleep() for imitating the real work, instead JMH provides Blackhole.consumeCPU(long tokens) method which you can freely use as imitation of some work.
Make sure you know the component that you benchmark. (obvious, but I didn't know before this post java thread pools very well)
Make sure you know compiler optimization effects described in these JMH samles basically know JMH very well.

Plots in JMH [Java Micro-Benchmarking Harness]

JMH does not support plotting. You can write out the performance results into a file (e.g. with -rf csv or -rf json), and use whatever plotting tool you are familiar with. Or, you can extract the performance data from RunResult instance you got from Java API, and parse/render it with any embedded library.

If you use settings to observe single call executions time (or batch calls) and you want to see plot of each durations (something like last plot here), you can combine settings:

@Measurement(batchSize = 1000, iterations = 500)
@BenchmarkMode({Mode.SingleShotTime})

And a bit of scripting to get desired data in csv. In settings like this, there is only summary data in resulting csv from jmh.

mvn package && java -jar target/benchmarks.jar -foe true -rf csv | tee output.txt
N=5 # set here number of benchmarks plus 2
grep Iteration -A 3 output.txt | grep -v Warmup | sed 's/$/,/' | xargs -l$N | sed 's/,$//'

It will output something like:

Iteration 1: 93.915 ±(99.9%) 2066.879 s/op, readerA: 28.487 s/op, readerB: 28.525 s/op, writer: 224.735 s/op, --
Iteration 2: 100.483 ±(99.9%) 1265.993 s/op, readerA: 59.927 s/op, readerB: 60.912 s/op, writer: 180.610 s/op, --
Iteration 3: 76.458 ±(99.9%) 760.395 s/op, readerA: 52.513 s/op, readerB: 52.276 s/op, writer: 124.586 s/op, --
Iteration 4: 84.046 ±(99.9%) 1189.029 s/op, readerA: 46.112 s/op, readerB: 46.724 s/op, writer: 159.303 s/op, --

java micro benchmark to find average from list

Is it what you are looking for?

@Warmup(iterations = 1, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 1, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Benchmark)
public class MyBenchmark {

    ClassUnderBenchmark classUnderBenchmark = new ClassUnderBenchmark();

    @State(Scope.Benchmark)
    public static class MyTestState {

        int counter = 0;
        List<String> list = Arrays.asList("aaaaa", "bbbb", "ccc");
        String currentString;

        @Setup(Level.Invocation)
        public void init() throws IOException {
            this.currentString = list.get(counter++);
            if (counter == 3) {
                counter = 0;
            }
        }
    }

    @Benchmark
    @Threads(1)
    @BenchmarkMode(Mode.SampleTime)
    public void test(MyBenchmark.MyTestState myTestState) {
        classUnderBenchmark.toUpper(myTestState.currentString);
    }

    public static class ClassUnderBenchmark {

        Random r = new Random();

        public String toUpper(String name) {
            try {
                Thread.sleep(r.nextInt(100));
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            return name.toUpperCase();
        }
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(MyBenchmark.class.getSimpleName())
                .jvmArgs("-XX:+UseG1GC", "-XX:MaxGCPauseMillis=50")
                .build();
        new Runner(opt).run();
    }
}

Please see the javadoc (org.openjdk.jmh.annotations.Mode):

/**
 * <p>Sample time: samples the time for each operation.</p>
 *
 * <p>Runs by continuously calling {@link Benchmark} methods,
 * and randomly samples the time needed for the call. This mode automatically adjusts the sampling
 * frequency, but may omit some pauses which missed the sampling measurement. This mode is time-based, and it will
 * run until the iteration time expires.</p>
 */
SampleTime("sample", "Sampling time"),

This test will give you the output:

Result "test":

  N = 91
  mean =      0,056 ±(99.9%) 0,010 s/op

  Histogram, s/op:
    [0,000, 0,010) = 6 
    [0,010, 0,020) = 9 
    [0,020, 0,030) = 3 
    [0,030, 0,040) = 11 
    [0,040, 0,050) = 8 
    [0,050, 0,060) = 11 
    [0,060, 0,070) = 9 
    [0,070, 0,080) = 9 
    [0,080, 0,090) = 14 

  Percentiles, s/op:
      p(0,0000) =      0,003 s/op
     p(50,0000) =      0,059 s/op
     p(90,0000) =      0,092 s/op
     p(95,0000) =      0,095 s/op
     p(99,0000) =      0,100 s/op
     p(99,9000) =      0,100 s/op
     p(99,9900) =      0,100 s/op
     p(99,9990) =      0,100 s/op
     p(99,9999) =      0,100 s/op
    p(100,0000) =      0,100 s/op


Benchmark           Mode  Cnt  Score   Error  Units
MyBenchmark.test  sample   91  0,056 ± 0,010   s/op

How to benchmark '&' vs '%' cost correctly, using JMH

This is covered in detail in this blog post: http://psy-lob-saw.blogspot.co.za/2014/11/the-mythical-modulo-mask.html

Your benchmark is broken (comparing what seems like irrelevant quantities) because you are comparing (non_final_field & constant) with (constant % non_final_field). Replace with (non_final_field1 % non_final_field2) and (non_final_field1 & (non_final_field2-1)) where non_final_field2 is a power of 2.

In the context of HashMap the value is used to read from an array and the blog post covers the implications of that side as well.

What is microbenchmarking?

It means exactly what it says on the tin can - it's measuring the performance of something "small", like a system call to the kernel of an operating system.

The danger is that people may use whatever results they obtain from microbenchmarking to dictate optimizations. And as we all know:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of
all evil" -- Donald Knuth

There can be many factors that skew the result of microbenchmarks. Compiler optimizations is one of them. If the operation being measured takes so little time that whatever you use to measure it takes longer than the actual operation itself, your microbenchmarks will be skewed also.

For example, someone might take a microbenchmark of the overhead of for loops:

void TestForLoop()
{
    time start = GetTime();

    for(int i = 0; i < 1000000000; ++i)
    {
    }

    time elapsed = GetTime() - start;
    time elapsedPerIteration = elapsed / 1000000000;
    printf("Time elapsed for each iteration: %d\n", elapsedPerIteration);
}

Obviously compilers can see that the loop does absolutely nothing and not generate any code for the loop at all. So the value of elapsed and elapsedPerIteration is pretty much useless.

Even if the loop does something:

void TestForLoop()
{
    int sum = 0;
    time start = GetTime();

    for(int i = 0; i < 1000000000; ++i)
    {
        ++sum;
    }

    time elapsed = GetTime() - start;
    time elapsedPerIteration = elapsed / 1000000000;
    printf("Time elapsed for each iteration: %d\n", elapsedPerIteration);
}

The compiler may see that the variable sum isn't going to be used for anything and optimize it away, and optimize away the for loop as well. But wait! What if we do this:

void TestForLoop()
{
    int sum = 0;
    time start = GetTime();

    for(int i = 0; i < 1000000000; ++i)
    {
        ++sum;
    }

    time elapsed = GetTime() - start;
    time elapsedPerIteration = elapsed / 1000000000;
    printf("Time elapsed for each iteration: %d\n", elapsedPerIteration);
    printf("Sum: %d\n", sum); // Added
}

The compiler might be smart enough to realize that sum will always be a constant value, and optimize all that away as well. Many would be surprised at the optimizing capabilities of compilers these days.

But what about things that compilers can't optimize away?

void TestFileOpenPerformance()
{
    FILE* file = NULL;
    time start = GetTime();

    for(int i = 0; i < 1000000000; ++i)
    {
        file = fopen("testfile.dat");
        fclose(file);
    }

    time elapsed = GetTime() - start;
    time elapsedPerIteration = elapsed / 1000000000;
    printf("Time elapsed for each file open: %d\n", elapsedPerIteration);
}

Even this is not a useful test! The operating system may see that the file is being opened very frequently, so it may preload it in memory to improve performance. Pretty much all operating systems do this. The same thing happens when you open applications - operating systems may figure out the top ~5 applications you open the most and preload the application code in memory when you boot up the computer!

In fact, there are countless variables that come into play: locality of reference (e.g. arrays vs. linked lists), effects of caches and memory bandwidth, compiler inlining, compiler implementation, compiler switches, number of processor cores, optimizations at the processor level, operating system schedulers, operating system background processes, etc.

So microbenchmarking isn't exactly a useful metric in a lot of cases. It definitely does not replace whole-program benchmarks with well-defined test cases (profiling). Write readable code first, then profile to see what needs to be done, if any.

I would like to emphasize that microbenchmarks are not evil per se, but one has to use them carefully (that's true for lots of other things related to computers)

Run Micro-benchmark in application servers [JMH]

@Benchmark Javadoc says:

 * <p>{@link Benchmark} demarcates the benchmark payload, and JMH treats it specifically
 * as the wrapper which contains the benchmark code. In order to run the benchmark reliably,
 * JMH enforces a few stringent properties for these wrapper methods, including, but not
 * limited to:</p>

@Benchmark is the annotation demarcating the piece of code JMH should treat as benchmark body. It is not a magic annotation which measures any given method elsewhere in the program. And, you are not supposed to call @Benchmark methods on your own.

What you want is not benchmark, it's a tracing/monitoring/profiling solution which can instrument and account all the stages the request gets through. JMH is not such a solution, you should look elsewhere.