List of Common C++ Optimization Techniques

List of common C++ Optimization Techniques

Two ways to write better programs:

Make best use of language

Code Complete by Steve McConnell
Effective C++
Exceptional C++

profile your application

Identify what areas of code are taking how much time
See if you can use better data structures/ algorithms to make things faster

There is not much language specific optimization one can do - it is limited to using language constructs (learn from #1). The main benefit comes from #2 above.

What are the known C/C++ optimizations for GCC

Beware of the reports from profiling tools, they can be misleading. For instance, consider an application that does a large amount of string comparisons and not much else. A report is going to tell you that you spend >90% of your time in string comparison functions. So naturally, you decide to implement an optimized version of that code only find out that the profiler tells you that you are still spending 90% of your time there (because that is all your program does...).

You have to have intimate knowledge of your application and apply that to a profiler else you might be wasting effort.

Compilers today do a fairly good job of optimizing (especially with extra flags as options). It is unlikely that you will benefit from anything at a language level (i.e. how you handle arrays) - you will probably have to read/write asm if you want to hand tune things.

What are some good code optimization methods?

Don't think about performance, think about clarity and correctness.
Use a profiler.
Keep using a profiler.
See 1.

Optimization Techniques for C++

The quality is indeed bad, but I think he leads to the fact that CPUs are good for calculations, but suffer from bad performance for memory seek (RAM is much slower then CPU), and branches (because CPU works as a pipeline, and branches might cause the pipeline to break).

Here are some cases where more instructions are faster:

Branch prediction - even if we need to do more instructions, but it causes for a better branch prediction, the pipeline of the CPU will be full more time, and less ops will be "thrown out" of it, which ultimately leads to better performance. This thread for example, shows how doing the same thing, but first sorting - improves performnce.
CPU Cache - If your code is more cache optimized, and follows the principle of locality - it is more likely to be faster then a code who doesn't, even if the code that doesn't do half the amount of instructions. This thread gives an example for a small cache optimization - that the same number of instructions might result in much slower code if it is not cache optimized.
It also matters which instructions are done. Sometimes - some instructions might be slower to perform then others, for example - divide might be slower then integer addition.

Note: All of the above are machine dependent and how/if they actually change the performance might vary from one architecture to the other.

What kinds of C optimization practices in x86 that were historically recommended to use are no longer effective?

Of the optimizations which are commonly recommended, a couple which are basically never fruitful given modern compilers include:

Mathematical transformations

Modern compilers understand mathematics, and will perform transformations on mathematical expressions where appropriate.

Optimizations such as conversion of multiplication to addition, or constant multiplication or division to bit shifting, are already performed by modern compilers, even at low optimization levels. Examples of these optimizations include:

x * 2  ->  x + x
x * 2  ->  x << 1

Note that some specific cases may differ. For instance, x >> 1 is not the same as x / 2; it is not appropriate to substitute one for the other!

Additionally, many of these suggested optimizations aren't actually any faster than the code they replace.

Stupid code tricks

I'm not even sure what to call this, but tricks like XOR swapping (a ^= b; b ^= a; a ^= b;) are not optimizations at all. They're just party tricks — they are slower, and more fragile, than the obvious approach. Don't use them.

The `register` keyword

This keyword is ignored by many modern compilers, as it its intended meaning (forcing a variable to be stored in a register) is not meaningful given current register allocation algorithms.

Code transformations

Compilers will automatically perform a wide variety of code transformations where appropriate. Several such transformations which are frequently recommended for manual application, but which are rarely useful when applied thus, include:

Loop unrolling. (This is often actually harmful when applied indiscriminately, as it bloats code size.)
Function inlining. (Tag a function as static, and it'll usually be inlined where appropriate when optimization is enabled.)

Does gcc always do this kind of optimization? (common subexpression elimination)

Yes it is a very common optimisation called Loop invariant code motion, also called hoisting or scalar promotion, often performed as a side effect of Common subexpression elimination.

It is valid to compute sys->pot.atoms just once before the loop if the compiler can ascertain that neither sys nor sys->pot.atoms can be modified inside the loop.

Note however, as commented by Groo, that if sys or sys->pot or sys->pot.atoms are specified as volatile, then it would be incorrect to compute it only once if the expression sys->pot.atoms is evaluated multiple times in the loop body or expressions.

What optimizations should be left for compiler?

Looking at how both gcc and clang handle your code, the micro optimisation you contemplate is vain. The compilers already apply standard common subexpression elimination techniques that remove to overhead you are trying to eliminate.

As a matter of fact, the code generated handles 2 components at a time using XMM registers.

If performance is a must, then here are steps that will save the day:

the real judge is the wall clock. Write a benchmark with realistic data and measure performance until you get consistent results.
if you have a profiler, use it to determine where are the bottlenecks if any. Changing algorithms for those parts that appear to hog performance is an effective approach.
try and get the best from the compiler: study the optimization options and try and let the compiler use more aggressive techniques if they are appropriate for the target system. For example -mavx512f -mavx512cd let the gcc generate code that handles 8 components at a time using the 512-bit ZMM registers.
This is a non intrusive technique as the code does not change, so you don't risk introducing new bugs by hand optimizing the code.

Optimisation is a difficult art. In my experience, simplifying the code gets better results and far fewer bugs than adding extra subtle stuff to try and improve performance at the cost of readability and correctness.

Looking at the code, an obvious simplification seems to generate the same results and might facilitate the optimizer's job (but again, let the wall clock be the judge):

double get_kinetic_energy(const double v[], int n, double mass)
{
    double sum = 0.0;

    for (int i = 0; i < 3 * n; i++)
        sum += v[i] * v[i];

    return 0.5 * mass * sum;
}

Optimization of C code

Initially:

for (i = 0; i <= N; i++) {
    for (j = i + 1; j <= N; j++) {
        x = x + 4*(2*i+j)*(i+2*k);
        if (i > j){
           y = y + 8*(i-j);
        }else{
           y = y + 8*(j-i);
        }
    }
}

Removing y calculations:

for (i = 0; i <= N; i++) {
    for (j = i + 1; j <= N; j++) {
        x = x + 4*(2*i+j)*(i+2*k);
    }
}

Splitting i, j, k:

for (i = 0; i <= N; i++) {
    for (j = i + 1; j <= N; j++) {
        x = x + 8*i*i + 16*i*k ;                // multiple of  1  (no j)
        x = x + (4*i + 8*k)*j ;                 // multiple of  j
    }
}

Moving them externally (and removing the loop that runs N-i times):

for (i = 0; i <= N; i++) {
    x = x + (8*i*i + 16*i*k) * (N-i) ;
    x = x + (4*i + 8*k) * ((N*N+N)/2 - (i*i+i)/2) ;
}

Rewritting:

for (i = 0; i <= N; i++) {
    x = x +         ( 8*k*(N*N+N)/2 ) ;
    x = x +   i   * ( 16*k*N + 4*(N*N+N)/2 + 8*k*(-1/2) ) ;
    x = x +  i*i  * ( 8*N + 16*k*(-1) + 4*(-1/2) + 8*k*(-1/2) );
    x = x + i*i*i * ( 8*(-1) + 4*(-1/2) ) ;
}

Rewritting - recalculating:

for (i = 0; i <= N; i++) {
    x = x + 4*k*(N*N+N) ;                            // multiple of 1
    x = x +   i   * ( 16*k*N + 2*(N*N+N) - 4*k ) ;   // multiple of i
    x = x +  i*i  * ( 8*N - 20*k - 2 ) ;             // multiple of i^2
    x = x + i*i*i * ( -10 ) ;                        // multiple of i^3
}

Another move to external (and removal of the i loop):

x = x + ( 4*k*(N*N+N) )              * (N+1) ;
x = x + ( 16*k*N + 2*(N*N+N) - 4*k ) * ((N*(N+1))/2) ;
x = x + ( 8*N - 20*k - 2 )           * ((N*(N+1)*(2*N+1))/6);
x = x + (-10)                        * ((N*N*(N+1)*(N+1))/4) ;

Both the above loop removals use the summation formulas:

Sum(1, i = 0..n) = n+1

Sum(i¹, i = 0..n) = n(n + 1)/2

Sum(i², i = 0..n) = n(n + 1)(2n + 1)/6

Sum(i³, i = 0..n) = n²(n + 1)²/4

Optimizing code for various C/C++ compilers

Don't choose the implementation based on predefined macros. Let the build system control it.

This lets you build and compare multiple implementations against each other during unit testing.

List of Common C++ Optimization Techniques