Preventing Compiler Optimizations While Benchmarking

Benchmarks for C compiler optimization

https://en.wikipedia.org/wiki/SPECint is mostly written in C, and is the industry standard benchmark for real hardware, computer-architecture theoretical research (e.g. a larger ROB or some cache difference in a simulated CPU), and for compiler developers to test proposed patches that change code-gen.

The C parts of SPECfp (https://en.wikipedia.org/wiki/SPECfp) are also good choices. Or for a compiler back-end optimizer, the choice of front-end language isn't very significant. The Fortran programs are fine too.

Related: Tricks of a Spec master is a paper that covers the different benchmarks. Maybe originally from a conference.

In this lightning round talk, I will
cover at a high level the performance characteristics of
these benchmarks in terms of optimizations that GCC
does. For example, some benchmarks are classic floating point applications and benefit from SIMD (single instruction multiple data) instructions, while other benchmarks don’t.

Wikipedia is out of date. SPECint/fp 2017 was a long time coming, but it was released in 2017 and is a significant improvement over 2006. e.g. some benchmarks trivialized by clever compiler optimizations like loop inversion. (Some compilers over the years have added basically pattern-recognition to optimize the loop in libquantum, but they can't always do that in general for other loops even when it would be safe. Apparently it can also be easily auto-parallelized.)

For testing a compiler, you might actually want code that aggressive optimization can find major simplifications in, so SPECcpu 2006 is a good choice. Just be aware of the issues with libquantum.

https://www.anandtech.com/show/10353/investigating-cavium-thunderx-48-arm-cores/12 describes gcc as a compiler that "does not try to "break" benchmarks (libquantum...)". But compilers like ICC and SunCC that CPU vendors use / used for SPEC submissions for their own hardware (Intel x86 and Sun UltraSPARC and later x86) are as aggressive as possible on SPEC benchmarks.

SPEC result submissions are required to include compiler version and options used (and OS tuning options), so you can hopefully replicate them.

Using volatile to prevent compiler optimization in benchmarking code?

The C++03 standard says that reads and writes to volatile data is observable behavior (C++ 2003, 1.9 [intro.execution] / 6). I believe this guarantees that assignment to volatile data cannot be optimized away. Another kind of observable behavior is calls to I/O functions.
The C++11 standard is even more unambiguous in this regard: in 1.9/8 it explicitly says that

The least requirements on a conforming implementation are:

— Access to volatile objects are evaluated strictly according to the rules of the abstract machine.

If a compiler can prove that a code does not produce an observable behavior then it can optimize the code away. In your update (where volatile is not used), copy constructor and other function calls & overloaded operators might avoid any I/O calls and access to volatile data, and the compiler might well understand it. However if gNumCopies is a global variable that later used in an expression with observable behavior (e.g. printed), then this code will not be removed.

Do Go testing.B benchmarks prevent unwanted optimizations?

Converting my comment to an answer.

To be completely accurate, any benchmark should be careful to avoid
compiler optimisations eliminating the function under test and
artificially lowering the run time of the benchmark.

var result int

func BenchmarkFibComplete(b *testing.B) {
        var r int
        for n := 0; n < b.N; n++ {
                // always record the result of Fib to prevent
                // the compiler eliminating the function call.
                r = Fib(10)
        }
        // always store the result to a package level variable
        // so the compiler cannot eliminate the Benchmark itself.
        result = r
}

Source

The following page can also be useful.

Compiler And Runtime Optimizations

Another interesting read is

One other interesting flag is -N, which will disable the optimisation
pass in the compiler.

Source1 Source2

I'm not a 100% sure but the following should disable optimisations ? Someone with more experience needs to confirm it.

go test -gcflags=-N -bench=.

Possible unwanted compiler optimization on benchmark test

Of course the just in time compiler will optimize your code (assuming a reasonably high iteration count), just as it would in an actual programm running that code. Optimization, by itself, is therefore desireable in a benchmark. Of course you'll have a problem if the artificial nature of your code permits optimizations not available to the real code. In your case, the compiler may conclude that notNullInline will never throw, therefore have no effect, and choose to remove the entire loop.

Writing a correct benchmark is already discussed at How do I write a correct micro-benchmark in Java?

Measure time for function-runtime in tests with compiler optimization in c++

To prevent compiler from optimizing away function calls just make input and output of that function a volatile variable.

Result is guaranteed to be computed and stored in volatile output variable on each loop run.

While volatile input will prevent optimizer from precomputing value of your function in advance, if you don't mark input as volatile then compiler may just write a constant to output result variable on each loop iteration.

Click Try it online! linke below to see program in action and also assembly listing.

Your code example with improvements is below:

Try it online!

#include <cmath>
#include <iostream>
#include <chrono>

int function(int x) {
    return int(std::log2(x));
}

bool testFunctionSpeed() {
    size_t const num_iterations = 1 << 20;
    auto volatile input = 123;
    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < num_iterations; ++i) {
        auto volatile result = function(input);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(
        end - start).count() / num_iterations;
    std::cout << duration << " ns per iteration" << std::endl;
    return true;
}

int main() {
    testFunctionSpeed();
}

Output:

8 ns per iteration

I don't understand the definition of DoNotOptimizeAway

You haven't included the definition, just the documentation. I think you're asking for help understanding why it even exists, rather than the definition.

It stops compilers from CSEing and hoisting work out of repeat-loops, so you can repeat the same work enough times to be measurable. e.g. put something short in a loop that runs 1 billion times, and then you can measure the time for the whole loop easily (a second or so). See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of doing this by hand in asm. If you want compiler-generated code like that, you need a function / macro like DoNotOptimizeAway.

Compiling the whole program with optimization disabled would be useless: storing/reloading everything between C++ statements gives very different bottlenecks (usually store-forwarding latency). See Adding a redundant assignment speeds up code when compiled without optimization

See also Idiomatic way of performance evaluation? for general microbenchmarking pitfalls

Perhaps looking at the actual definition can also help.

This Q&A (Optimization barrier for microbenchmarks in MSVC: tell the optimizer you clobber memory?) describes how one implementation of a DoNotOptimize macro works (and asks how to port it from GNU C++ to MSVC).

The escape macro is from Chandler Carruth's CppCon2015 talk, "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!". That talk also goes into detail about exactly why it's needed when writing microbenchmarks: to stop whole loops from optimizing away when you compile with optimization enabled.

(Having the compiler hoist things out of loops instead of compute them repeatedly is harder to get right if it's a problem. Making a function __attribute__((noinline)) can help if it's big enough that it didn't need to inline. Check the compiler's asm output to see how much setup it hoisted.)

And BTW, a good definition for GNU C / C++ normally has zero extra cost:

asm volatile("" :: "r"(my_var)); compiles to zero asm instructions, but requires the compiler to have the value of my_var in a register of its choice. (And because of asm volatile, has to "run" that many times in the C++ abstract machine).

This will only impact optimization if the compiler could have transformed the calculation it was part of into something else. (e.g. using this on a loop counter would stop the compiler from using just pointer-increments and compare against an end-pointer to do the right number of iterations of for(i=0;i<n;i++) sum+=a[i];

Using a read-modify-write operand like asm volatile("" :"+r"(my_var)); would force the compiler to forget all range-restriction or constant-propagation info it knows about the value, and treat it like an incoming function arg. e.g. that it's 42, or that it's non-negative. This could impact optimization more.

When they say the "overhead is cancelled out in comparisons", they're hopefully not talking about explicitly subtracting anything from a single timing result, and not talking about benchmarking DoNotOptimizeAway on its own.

That wouldn't work. Performance analysis for modern CPUs does not work by adding up the costs of each instruction. Out-of-order pipelined execution means that an extra asm instruction can easily have zero extra cost if the front-end (total instruction throughput) wasn't the bottleneck, and if the execution unit it needs wasn't either.

If their portable definition is something like volatile T sink = input;, the extra asm store would only have a cost if your code bottlenecked on store throughput to cache.

So that claim about cancelling out sounds a bit optimistic. As I explained above, Plus the above context / optimization-dependent factors. It's possible that a DoNotOptimizeAway)

Related Q&As about the same functions:

Preventing compiler optimizations while benchmarking
Avoid optimizing away variable with inline asm
"Escape" and "Clobber" equivalent in MSVC

How to prevent GCC from optimizing out a busy wait loop?

I developed this answer after following a link from dmckee's answer, but it takes a different approach than his/her answer.

Function Attributes documentation from GCC mentions:

noinline
This function attribute prevents a function from being considered for inlining. If the function does not have side-effects, there are optimizations other than inlining that causes function calls to be optimized away, although the function call is live. To keep such calls from being optimized away, put asm ("");

This gave me an interesting idea... Instead of adding a nop instruction at the inner loop, I tried adding an empty assembly code in there, like this:

unsigned char i, j;
j = 0;
while(--j) {
    i = 0;
    while(--i)
        asm("");
}

And it worked! That loop has not been optimized-out, and no extra nop instructions were inserted.

What's more, if you use volatile, gcc will store those variables in RAM and add a bunch of ldd and std to copy them to temporary registers. This approach, on the other hand, doesn't use volatile and generates no such overhead.

Update: If you are compiling code using -ansi or -std, you must replace the asm keyword with __asm__, as described in GCC documentation.

In addition, you can also use __asm__ __volatile__("") if your assembly statement must execute where we put it, (i.e. must not be moved out of a loop as an optimization).

C# compiler optimizations for benchmarking purposes

Compiler (JIT) may optimize whole function call out if it finds out that it has no side effects. It likely need to be able to inline function to detect that.

I tried small function that only acts on input arguments and see it is optimized out by checking resulting assembly (make sure to try Release build with "Suppres optimization on module load" unchecked).

 ...
 for (int i = 0; i < 1000; i++) 
 {    
    int res = (int) Func(i);
 }
 ...

 static int Func(int arg1)
 {
    return arg1 * arg1;
 }

Disassembly:

      for (int i = 0; i < 1000; i++) 
    00000016  xor         eax,eax 
    00000018  inc         eax 
    00000019  cmp         eax,3E8h 
    0000001e  jl          00000018 
        }