How to use profile guided optimizations in g++?
-fprofile-generate will instrument the application with profiling code. The application will, while actually running, log certain events that could improve performance if this usage pattern was known at compile time. Branches, possibility for inlining, etc, can all be logged, but I'm not sure in detail how GCC implements this.
After the program exits, it will dump all this data into *.gcda files, which are essentially log data for a test run. After rebuilding the application with -fprofile-use flag, GCC will take the *.gcda log data into account when doing its optimizations, usually increasing the performance significantly. Of course, this depends on many factors.
profile-guided optimization (C)
It works by placing extra code to count the number of times each codepath is taken. When you compile a second time the compiler uses the knowledge gained about execution of your program that it could only guess at before. There are a couple things PGO can work toward:
- Deciding which functions should be inlined or not depending on how often they are called.
- Deciding how to place hints about which branch of an "if" statement should be predicted on based on the percentage of calls going one way or the other.
- Deciding how to optimize loops based on how many iterations get taken each time that loop is called.
You never really know how much these things can help until you test it.
What information does GCC Profile Guided Optimization (PGO) collect and which optimizations use it?
-fprofile-generate
enables -fprofile-arcs
, -fprofile-values
and -fvpt
.
-fprofile-use
enables -fbranch-probabilities
, -fvpt
, -funroll-loops
, -fpeel-loops
and -ftracer
Source: http://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Optimize-Options.html#Optimize-Options
PS. Information about LTO also on that page.
G++/CMake Profile Guided Optimization cannot find generated .gcda files
For anyone wondering...
I did not write the original build system of the project. I discovered that it compiled one executable and one library from the same source. Whilst the executable was profiled and recompiled successfully (PGO was done flawlessly), the library was not used anywhere. Hence the missing profile files. Because the profiler outputs extremely long names, I thought the error came from the executable. Thank you all for the help.
nested for loop faster after profile guided optimization but with higher cache misses
I noticed only one significant difference between assembly codes generated with or without PGO. Without PGO sum
variable is spilled from register to memory, once per inner loop iteration. This writing variable to memory and loading it back might in theory slow down things very significantly. Fortunately modern processors optimize it with store-to-load forwarding, so that slowdown is not so big. Still Intel's optimization manual does not recommend to spill floating point variables to memory, especially when they are computed by long-latency operations, like floating point multiplication.
What is really puzzling here is why GCC needs PGO to avoid spilling register to memory. It is enough unused floating point registers, and even without PGO compiler could get all information necessary for proper optimization from single source file...
These unnecessary load/store operations explain not only why PGO code is faster, but also why it increases percentage of cache misses. Without PGO register is always spilled to the same location in memory, so this additional memory access increases both number of memory accesses and number of cache hits, while it does not change number of cache misses. With PGO we have less memory accesses but same amount of cache misses, so their percentage increases.
Related Topics
Unit Testing for C++ Code - Tools and Methodology
Is Accessing Data in the Heap Faster Than from the Stack
Why Is It Undefined Behavior to Delete[] an Array of Derived Objects via a Base Pointer
Why C++ Copy Constructor Must Use Const Object
How to Separate C++ Main Function and Classes from Objective-C And/Or C Routines at Compile and Link
Template Instantiation Details of Gcc and Ms Compilers
How to Actually Implement the Rule of Five
Should I Worry About the Alignment During Pointer Casting
How to Compile and Run C/C++ Code in a Unix Console or MAC Terminal
How to Properly Link Libraries with Cmake
Evaluate a String with a Switch in C++
Why Can't a Derived Class Call Protected Member Function in This Code
How to Use Setprecision in C++