Benchmarks for C compiler optimization
https://en.wikipedia.org/wiki/SPECint is mostly written in C, and is the industry standard benchmark for real hardware, computer-architecture theoretical research (e.g. a larger ROB or some cache difference in a simulated CPU), and for compiler developers to test proposed patches that change code-gen.
The C parts of SPECfp (https://en.wikipedia.org/wiki/SPECfp) are also good choices. Or for a compiler back-end optimizer, the choice of front-end language isn't very significant. The Fortran programs are fine too.
Related: Tricks of a Spec master is a paper that covers the different benchmarks. Maybe originally from a conference.
In this lightning round talk, I will
cover at a high level the performance characteristics of
these benchmarks in terms of optimizations that GCC
does. For example, some benchmarks are classic floating point applications and benefit from SIMD (single instruction multiple data) instructions, while other benchmarks don’t.
Wikipedia is out of date. SPECint/fp 2017 was a long time coming, but it was released in 2017 and is a significant improvement over 2006. e.g. some benchmarks trivialized by clever compiler optimizations like loop inversion. (Some compilers over the years have added basically pattern-recognition to optimize the loop in libquantum, but they can't always do that in general for other loops even when it would be safe. Apparently it can also be easily auto-parallelized.)
For testing a compiler, you might actually want code that aggressive optimization can find major simplifications in, so SPECcpu 2006 is a good choice. Just be aware of the issues with libquantum.
https://www.anandtech.com/show/10353/investigating-cavium-thunderx-48-arm-cores/12 describes gcc as a compiler that "does not try to "break" benchmarks (libquantum...)". But compilers like ICC and SunCC that CPU vendors use / used for SPEC submissions for their own hardware (Intel x86 and Sun UltraSPARC and later x86) are as aggressive as possible on SPEC benchmarks.
SPEC result submissions are required to include compiler version and options used (and OS tuning options), so you can hopefully replicate them.
Using volatile to prevent compiler optimization in benchmarking code?
The C++03 standard says that reads and writes to volatile data is observable behavior (C++ 2003, 1.9 [intro.execution] / 6). I believe this guarantees that assignment to volatile data cannot be optimized away. Another kind of observable behavior is calls to I/O functions.
The C++11 standard is even more unambiguous in this regard: in 1.9/8 it explicitly says that
The least requirements on a conforming implementation are:
— Access to volatile objects are evaluated strictly according to the rules of the abstract machine.
If a compiler can prove that a code does not produce an observable behavior then it can optimize the code away. In your update (where volatile is not used), copy constructor and other function calls & overloaded operators might avoid any I/O calls and access to volatile data, and the compiler might well understand it. However if gNumCopies
is a global variable that later used in an expression with observable behavior (e.g. printed), then this code will not be removed.
Do Go testing.B benchmarks prevent unwanted optimizations?
Converting my comment to an answer.
To be completely accurate, any benchmark should be careful to avoid
compiler optimisations eliminating the function under test and
artificially lowering the run time of the benchmark.
var result int
func BenchmarkFibComplete(b *testing.B) {
var r int
for n := 0; n < b.N; n++ {
// always record the result of Fib to prevent
// the compiler eliminating the function call.
r = Fib(10)
}
// always store the result to a package level variable
// so the compiler cannot eliminate the Benchmark itself.
result = r
}
Source
The following page can also be useful.
Compiler And Runtime Optimizations
Another interesting read is
One other interesting flag is -N, which will disable the optimisation
pass in the compiler.
Source1 Source2
I'm not a 100% sure but the following should disable optimisations ? Someone with more experience needs to confirm it.
go test -gcflags=-N -bench=.
Possible unwanted compiler optimization on benchmark test
Of course the just in time compiler will optimize your code (assuming a reasonably high iteration count), just as it would in an actual programm running that code. Optimization, by itself, is therefore desireable in a benchmark. Of course you'll have a problem if the artificial nature of your code permits optimizations not available to the real code. In your case, the compiler may conclude that notNullInline will never throw, therefore have no effect, and choose to remove the entire loop.
Writing a correct benchmark is already discussed at How do I write a correct micro-benchmark in Java?
Measure time for function-runtime in tests with compiler optimization in c++
To prevent compiler from optimizing away function calls just make input and output of that function a volatile
variable.
Result is guaranteed to be computed and stored in volatile output variable on each loop run.
While volatile input will prevent optimizer from precomputing value of your function in advance, if you don't mark input as volatile then compiler may just write a constant to output result variable on each loop iteration.
Click Try it online!
linke below to see program in action and also assembly listing.
Your code example with improvements is below:
Try it online!
#include <cmath>
#include <iostream>
#include <chrono>
int function(int x) {
return int(std::log2(x));
}
bool testFunctionSpeed() {
size_t const num_iterations = 1 << 20;
auto volatile input = 123;
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < num_iterations; ++i) {
auto volatile result = function(input);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(
end - start).count() / num_iterations;
std::cout << duration << " ns per iteration" << std::endl;
return true;
}
int main() {
testFunctionSpeed();
}
Output:
8 ns per iteration
I don't understand the definition of DoNotOptimizeAway
You haven't included the definition, just the documentation. I think you're asking for help understanding why it even exists, rather than the definition.
It stops compilers from CSEing and hoisting work out of repeat-loops, so you can repeat the same work enough times to be measurable. e.g. put something short in a loop that runs 1 billion times, and then you can measure the time for the whole loop easily (a second or so). See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of doing this by hand in asm. If you want compiler-generated code like that, you need a function / macro like DoNotOptimizeAway
.
Compiling the whole program with optimization disabled would be useless: storing/reloading everything between C++ statements gives very different bottlenecks (usually store-forwarding latency). See Adding a redundant assignment speeds up code when compiled without optimization
See also Idiomatic way of performance evaluation? for general microbenchmarking pitfalls
Perhaps looking at the actual definition can also help.
This Q&A (Optimization barrier for microbenchmarks in MSVC: tell the optimizer you clobber memory?) describes how one implementation of a DoNotOptimize
macro works (and asks how to port it from GNU C++ to MSVC).
The escape
macro is from Chandler Carruth's CppCon2015 talk, "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!". That talk also goes into detail about exactly why it's needed when writing microbenchmarks: to stop whole loops from optimizing away when you compile with optimization enabled.
(Having the compiler hoist things out of loops instead of compute them repeatedly is harder to get right if it's a problem. Making a function __attribute__((noinline))
can help if it's big enough that it didn't need to inline. Check the compiler's asm output to see how much setup it hoisted.)
And BTW, a good definition for GNU C / C++ normally has zero extra cost:asm volatile("" :: "r"(my_var));
compiles to zero asm instructions, but requires the compiler to have the value of my_var
in a register of its choice. (And because of asm volatile
, has to "run" that many times in the C++ abstract machine).
This will only impact optimization if the compiler could have transformed the calculation it was part of into something else. (e.g. using this on a loop counter would stop the compiler from using just pointer-increments and compare against an end-pointer to do the right number of iterations of for(i=0;i<n;i++) sum+=a[i];
Using a read-modify-write operand like asm volatile("" :"+r"(my_var));
would force the compiler to forget all range-restriction or constant-propagation info it knows about the value, and treat it like an incoming function arg. e.g. that it's 42
, or that it's non-negative. This could impact optimization more.
When they say the "overhead is cancelled out in comparisons", they're hopefully not talking about explicitly subtracting anything from a single timing result, and not talking about benchmarking DoNotOptimizeAway
on its own.
That wouldn't work. Performance analysis for modern CPUs does not work by adding up the costs of each instruction. Out-of-order pipelined execution means that an extra asm instruction can easily have zero extra cost if the front-end (total instruction throughput) wasn't the bottleneck, and if the execution unit it needs wasn't either.
If their portable definition is something like volatile T sink = input;
, the extra asm store would only have a cost if your code bottlenecked on store throughput to cache.
So that claim about cancelling out sounds a bit optimistic. As I explained above, Plus the above context / optimization-dependent factors. It's possible that a DoNotOptimizeAway
)
Related Q&As about the same functions:
- Preventing compiler optimizations while benchmarking
- Avoid optimizing away variable with inline asm
- "Escape" and "Clobber" equivalent in MSVC
How to prevent GCC from optimizing out a busy wait loop?
I developed this answer after following a link from dmckee's answer, but it takes a different approach than his/her answer.
Function Attributes documentation from GCC mentions:
noinline
This function attribute prevents a function from being considered for inlining. If the function does not have side-effects, there are optimizations other than inlining that causes function calls to be optimized away, although the function call is live. To keep such calls from being optimized away, putasm ("");
This gave me an interesting idea... Instead of adding a nop
instruction at the inner loop, I tried adding an empty assembly code in there, like this:
unsigned char i, j;
j = 0;
while(--j) {
i = 0;
while(--i)
asm("");
}
And it worked! That loop has not been optimized-out, and no extra nop
instructions were inserted.
What's more, if you use volatile
, gcc will store those variables in RAM and add a bunch of ldd
and std
to copy them to temporary registers. This approach, on the other hand, doesn't use volatile
and generates no such overhead.
Update: If you are compiling code using -ansi
or -std
, you must replace the asm
keyword with __asm__
, as described in GCC documentation.
In addition, you can also use __asm__ __volatile__("")
if your assembly statement must execute where we put it, (i.e. must not be moved out of a loop as an optimization).
C# compiler optimizations for benchmarking purposes
Compiler (JIT) may optimize whole function call out if it finds out that it has no side effects. It likely need to be able to inline function to detect that.
I tried small function that only acts on input arguments and see it is optimized out by checking resulting assembly (make sure to try Release build with "Suppres optimization on module load" unchecked).
...
for (int i = 0; i < 1000; i++)
{
int res = (int) Func(i);
}
...
static int Func(int arg1)
{
return arg1 * arg1;
}
Disassembly:
for (int i = 0; i < 1000; i++)
00000016 xor eax,eax
00000018 inc eax
00000019 cmp eax,3E8h
0000001e jl 00000018
}
Related Topics
Constexpr Initializing Static Member Using Static Function
Cuda How to Get Grid, Block, Thread Size and Parallalize Non Square Matrix Calculation
Function Prologue and Epilogue in C
C++ Threads, Std::System_Error - Operation Not Permitted
How to Read from a Version Resource in Visual C++
What Happens When You Deallocate a Pointer Twice or More in C++
Why Do I Need Double Layer of Indirection for MACros
How Portable Is End Iterator Decrement
Find All Substring's Occurrences and Locations
How to Return Numpy.Array from Boost::Python
How to Create the Cartesian Product of a Type List
Declare a Reference and Initialize Later
Do All Pointers Have the Same Size in C++
C++11 Static_Assert and Template Instantiation
What's the Fastest Way to Pack 32 0/1 Values into the Bits of a Single 32-Bit Variable