How Would You Benchmark the Performance of a Function

How would you benchmark the performance of a function?

One of the best available opensource solutions is Google Benchmark.

You have to create simple wrappers around code you want to benchmark and link either statically or dynamically with the benchmark library. It is often useful to have such micro benchmarks compiled near with your code. For inspiration see awesome presentation.

static void BM_F(benchmark::State& state) {
const auto input1 = state.range_x();
const auto input2 = state.range_y();

while (state.KeepRunning()) F(input1, input2);
}

static void BM_D(benchmark::State& state) {
const auto input1 = state.range_x();
const auto input2 = state.range_y();

while (state.KeepRunning()) D(input1, input2);
}

BENCHMARK(BM_F)
->ArgPair(1, 10)
->ArgPair(10, 100)
->ArgPair(100, 1000);

BENCHMARK(BM_D)
->ArgPair(1, 10)
->ArgPair(10, 100)
->ArgPair(100, 1000);

If you want to measure raw CPU cycles, then your only choice is to use direct CPU instructions. For x86 you can use Time Stamp Counter.

But you should be aware, that such measuring will not resist any context switches performed by OS or jumping on CPUs. Your only choice in such situations will be to use an algorithm with a single flow of execution. Remember the ID of the CPU and the TSC value before entering to test function, and check the ID of the CPU after the test function. Then calculating the difference between TSC values. You may also set up CPU affinity for your process to stick the process to a specific CPU.

Another Linux-specific possible way to benchmark functions is to use perf tool.

But in any case, any measurement will add some error level to the result.

Idiomatic way of performance evaluation?

Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)

Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.

This question is way too broad to say anything more specific.

Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.



Beware common microbenchmark pitfalls

  • Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)

  • Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.

  • Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).

    related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.

  • Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.

  • On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).

    Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.

  • Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.

  • You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)

    For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.

  • If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.

  • Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.

Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.

Litmus tests:

  • If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)

  • Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).

  • Vary other test parameters as a sanity check.


For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

Measuring execution time of a function in C++

It is a very easy-to-use method in C++11. You have to use std::chrono::high_resolution_clock from <chrono> header.

Use it like so:

#include <chrono>

/* Only needed for the sake of this example. */
#include <iostream>
#include <thread>

void long_operation()
{
/* Simulating a long, heavy operation. */

using namespace std::chrono_literals;
std::this_thread::sleep_for(150ms);
}

int main()
{
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;

auto t1 = high_resolution_clock::now();
long_operation();
auto t2 = high_resolution_clock::now();

/* Getting number of milliseconds as an integer. */
auto ms_int = duration_cast<milliseconds>(t2 - t1);

/* Getting number of milliseconds as a double. */
duration<double, std::milli> ms_double = t2 - t1;

std::cout << ms_int.count() << "ms\n";
std::cout << ms_double.count() << "ms\n";
return 0;
}

This will measure the duration of the function long_operation.

Possible output:

150ms
150.068ms

Working example: https://godbolt.org/z/oe5cMd

Performance benchmarking of function wrapper over class vs only constructor call invocation and HaveSameMap

In your first snippet, you were not only creating the instance in makePoint but also the class Point itself. Every time you called makePoint(), you made a new class:

console.log(makePoint().constructor === makePoint().constructor) // false

I think you are looking for

class Point {
constructor(x, y) {
this.x = x;
this.y = y;
}
}
const makePoint = () => new Point(1, 2);
const a = makePoint();
const b = makePoint();

console.log(%HaveSameMap(a, b)); // true
console.log(a.constructor == b.constructor, a.constructor == Point); // true, true

How can i benchmark method execution time in java?

Other than using a profiler, a simple way of getting what you want is the following:

public class SomeClass{
public void somePublicMethod()
{
long startTime = System.currentTimeMillis();
someMethodWhichYouWantToProfile();
long endTime = System.currentTimeMillis();
System.out.println("Total execution time: " + (endTime-startTime) + "ms");
}
}


Related Topics



Leave a reply



Submit