Multi-Threading Benchmarking Issues

Benchmark problems for testing concurency

Since the benchmarks game moved to a quad-core machine September 2008, many programs in different programming languages have been re-written to exploit quad-core - for example, the first 10 mandelbrot programs.

Bench Mark in Multi threaded environment

I've created a simple JMH benchmark to test the various cases:

@Fork(1)
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Measurement(iterations = 10)
@Warmup(iterations = 10)
@BenchmarkMode(Mode.AverageTime)
public class HashCodeBenchmark {
private final Object object = new Object();

@Benchmark
@Threads(1)
public void singleThread(Blackhole blackhole){
blackhole.consume(object.hashCode());
}

@Benchmark
@Threads(2)
public void twoThreads(Blackhole blackhole){
blackhole.consume(object.hashCode());
}

@Benchmark
@Threads(4)
public void fourThreads(Blackhole blackhole){
blackhole.consume(object.hashCode());
}

@Benchmark
@Threads(8)
public void eightThreads(Blackhole blackhole){
blackhole.consume(object.hashCode());
}
}

And the results are as follows:

Benchmark                       Mode  Cnt  Score   Error  Units
HashCodeBenchmark.eightThreads avgt 10 5.710 ± 0.087 ns/op
HashCodeBenchmark.fourThreads avgt 10 3.603 ± 0.169 ns/op
HashCodeBenchmark.singleThread avgt 10 3.063 ± 0.011 ns/op
HashCodeBenchmark.twoThreads avgt 10 3.067 ± 0.034 ns/op

So we can see that as long as there are no more threads than cores, the time per hashcode remains the same.

PS: As @Tom Cools had commented - you are measuring the allocation speed and not the hashCode() speed in your test.

Multi-threading benchmark

Consider that your program will finish when the last thread has finished checking its range of numbers. Perhaps some threads are faster than others?

How long does is_prime() take to determine that an even number is prime? It will find this on the first iteration. Finding the primality of an odd number will take at least two iterations and possibly up to sqrt(a) iterations if a is prime. is_prime() will be very much slower when it is given a large prime than an even number!

In your two thread case, one thread will check the primality of 100000000, 100000002, 100000004, etc. while the other thread will check 100000001, 100000003, 100000005, etc. One thread checks all the even numbers while the other checks all the odd numbers (including all those slow primes!).

Have your threads print out ("Thread at %ld done", l1) when they finish, and I think you will find that some threads are very much faster than others, due to the way you are dividing the domain between the threads. An even number of threads will give all even values to the same thread(s), resulting in a particularly poor partitioning, which is why your even thread numbers are slower than the odd.

This would make a great XKCD-esque comic. "We need to check all these numbers to find primes! By hand!" "Ok, I'll check the evens, you do the odds."

Your real problem here is that a fixed domain decomposition like you have done requires that each partition take the same amount of time to be optimal.

The way to solve this is to dynamically do the partitioning. A pattern commonly used involves a pool of worker threads that request work in chunks. If the chunk is small compared to the total work a thread will do, then all threads will finish their work in a similar amount of time.

For your problem, you could have a mutex protected global data set start_number, stop_number, total_twins. Each thread will save start_number before incrementing its global value by chunk_size. Then it searches the range [saved_start_number, saved_start_number+chunk_size), adding the number of twins found to the global total_twins when done. The worker threads keep doing this until start_number >= stop_number. Access to the globals use the mutex for protection. One has to adjust the chunk size to limit inefficiency from the cost of getting a chunk and contention on the mutex vs the inefficiency of idle worker threads having no more chunks to allocate while another thread is still working on it's last chunk. If you used an atomic increment to request a chunk, maybe the chunk size could be as small as a single value, but if one needed a network request in a distributed computing system then the chunk size would need to be much larger. That's the overview of how it works.

BTW, your is_prime() test is naive and exceedingly slow. If a number was not divisible by 2, can it be divided by 4? One can do much better!

JMH - How to correctly benchmark Thread Pools?

I solved this issue by myself with the help of other answerers. In the last edit (And in all other edits) the issue was in my gradle configuration, so I was running this benchmark in all of my system threads, I use this gradle plugin to run JMH and before making all of my benchmarks in my gradle buildscript I set threads = 4 value, so you saw these strange benchmark results because JMH tried to benchmark on all available threads thread pool doing work on all available threads. I removed this configuration and set @State(Scope.Thread) and @Threads(1) annotations in benchmark class, a bit edited runInThreadPool() method to:

public static void runInThreadPool(int amountOfTasks, Blackhole bh, ExecutorService threadPool)
throws InterruptedException, ExecutionException {
Future<?>[] futures = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
futures[i] = threadPool.submit(PrioritySchedulerSamples::doWork, (ThreadFactory) runnable -> {
Thread thread = new Thread(runnable);
thread.setPriority(10);
return thread;
});
}
for (Future<?> future : futures) {
bh.consume(future.get());
}

threadPool.shutdownNow();
threadPool.awaitTermination(10, TimeUnit.SECONDS);
}

So each thread in this thread pool runs in maximal priority.
And benchmark of all these changes:

Benchmark                                 (amountOfTasks)  Mode  Cnt         Score         Error  Units
PrioritySchedulerSamples.fixedThreadPool 2048 avgt 3 8021054,516 ± 2874987,327 ns/op
PrioritySchedulerSamples.noThreading 2048 avgt 3 17583295,617 ± 5499026,016 ns/op

These results seems to be correct. (Especially for my system.)

I also made a list of common problems in microbenchmarking thread pools and basically all of the concurrent java components:

  1. Make sure your microbenchmark is executing in one thread, use @Threads(1) and @State(Scope.Thread) annotations to make your microbenchmark executing in one thread. (use, for example, htop command to find out how many and which threads are consuming the most CPU percentage)
  2. Make sure you execute the task completely in your microbenchmark, and wait for all Threads to complete this task. (maybe your microbenchmark doesn't wait for tasks to complete?)
  3. Don't use Thread.sleep() for imitating the real work, instead JMH provides Blackhole.consumeCPU(long tokens) method which you can freely use as imitation of some work.
  4. Make sure you know the component that you benchmark. (obvious, but I didn't know before this post java thread pools very well)
  5. Make sure you know compiler optimization effects described in these JMH samles basically know JMH very well.


Related Topics



Leave a reply



Submit