How to Use Profile Guided Optimizations in G++

How to use profile guided optimizations in g++?

-fprofile-generate will instrument the application with profiling code. The application will, while actually running, log certain events that could improve performance if this usage pattern was known at compile time. Branches, possibility for inlining, etc, can all be logged, but I'm not sure in detail how GCC implements this.

After the program exits, it will dump all this data into *.gcda files, which are essentially log data for a test run. After rebuilding the application with -fprofile-use flag, GCC will take the *.gcda log data into account when doing its optimizations, usually increasing the performance significantly. Of course, this depends on many factors.

profile-guided optimization (C)

It works by placing extra code to count the number of times each codepath is taken. When you compile a second time the compiler uses the knowledge gained about execution of your program that it could only guess at before. There are a couple things PGO can work toward:

  • Deciding which functions should be inlined or not depending on how often they are called.
  • Deciding how to place hints about which branch of an "if" statement should be predicted on based on the percentage of calls going one way or the other.
  • Deciding how to optimize loops based on how many iterations get taken each time that loop is called.

You never really know how much these things can help until you test it.

What information does GCC Profile Guided Optimization (PGO) collect and which optimizations use it?

-fprofile-generate enables -fprofile-arcs, -fprofile-values and -fvpt.

-fprofile-use enables -fbranch-probabilities, -fvpt, -funroll-loops, -fpeel-loops and -ftracer

Source: http://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Optimize-Options.html#Optimize-Options

PS. Information about LTO also on that page.

G++/CMake Profile Guided Optimization cannot find generated .gcda files

For anyone wondering...

I did not write the original build system of the project. I discovered that it compiled one executable and one library from the same source. Whilst the executable was profiled and recompiled successfully (PGO was done flawlessly), the library was not used anywhere. Hence the missing profile files. Because the profiler outputs extremely long names, I thought the error came from the executable. Thank you all for the help.

nested for loop faster after profile guided optimization but with higher cache misses

I noticed only one significant difference between assembly codes generated with or without PGO. Without PGO sum variable is spilled from register to memory, once per inner loop iteration. This writing variable to memory and loading it back might in theory slow down things very significantly. Fortunately modern processors optimize it with store-to-load forwarding, so that slowdown is not so big. Still Intel's optimization manual does not recommend to spill floating point variables to memory, especially when they are computed by long-latency operations, like floating point multiplication.

What is really puzzling here is why GCC needs PGO to avoid spilling register to memory. It is enough unused floating point registers, and even without PGO compiler could get all information necessary for proper optimization from single source file...

These unnecessary load/store operations explain not only why PGO code is faster, but also why it increases percentage of cache misses. Without PGO register is always spilled to the same location in memory, so this additional memory access increases both number of memory accesses and number of cache hits, while it does not change number of cache misses. With PGO we have less memory accesses but same amount of cache misses, so their percentage increases.



Related Topics



Leave a reply



Submit