How to Vectorize My Loop with G++

How to vectorize my loop with g++?

The O3 flag turns on -ftree-vectorize automatically. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options

So in both cases the compiler is trying to do loop vectorization.

Using g++ 4.8.2 to compile with:

# In newer versions of GCC use -fopt-info-vec-missed instead of -ftree-vectorize
g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -o test

Gives this:

Analyzing loop at test.cpp:16                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                
Vectorizing loop at test.cpp:16                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                
test.cpp:16: note: create runtime check for data references *it2$_M_current_106 and *_39                                                                                                                                                                                    
test.cpp:16: note: created 1 versioning for alias checks.                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                
test.cpp:16: note: LOOP VECTORIZED.                                                                                                                                                                                                                                         
Analyzing loop at test_old.cpp:29                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                
test.cpp:22: note: vectorized 1 loops in function.                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                
test.cpp:18: note: Unroll loop 7 times                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                
test.cpp:16: note: Unroll loop 7 times                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                
test.cpp:28: note: Unroll loop 1 times

Compiling without the -ftree-vectorize flag:

g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorizer-verbose=1 -o test

Returns only this:

test_old.cpp:16: note: Unroll loop 7 times

test_old.cpp:28: note: Unroll loop 1 times

Line 16 is the start of the loop function, so the compiler is definitely vectorizing it. Checking the assembler confirms this too.

I seem to be getting some aggressive caching on the laptop I'm currently using which is making it very hard to accurately measure how long the function takes to run.

But here's a couple of other things you can try too:

Use the __restrict__ qualifier to tell the compiler that there is no overlap between the arrays.
Tell the compiler the arrays are aligned with __builtin_assume_aligned (not portable)

Here's my resulting code (I removed the template since you will want to use different alignment for different data types)

#include <iostream>
#include <chrono>
#include <vector>

void foo( double * __restrict__ p1,
          double * __restrict__ p2,
          size_t start,
          size_t end )
{
  double* pA1 = static_cast<double*>(__builtin_assume_aligned(p1, 16));
  double* pA2 = static_cast<double*>(__builtin_assume_aligned(p2, 16));

  for (size_t i = start; i < end; ++i)
  {
      pA1[i] = pA1[i] - pA2[i];
      pA1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x, y;
    n = 12800000;
    std::vector<double> v,u;

    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }

    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(&v[0], &u[0], 0, n );
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;

    return 0;
}

Like I said I've had trouble getting consistent time measurements, so can't confirm if this will give you a performance increase (or maybe even decrease!)

How to autovectorize a loop with access stride 2 with g++ without openCL or intrinsics

Oh, I found @Peter Cordes 's comment and I combined with my initial answer:

https://gcc.godbolt.org/z/bxzsfxPGx

and -fopt-info-vec-missed doesn't say anything to me

void f(const unsigned char *input, const unsigned size, unsigned char *output) {
    constexpr unsigned MAX_SIZE = 2000;
    unsigned char odd[MAX_SIZE / 2];
    unsigned char even[MAX_SIZE / 2];
    for (unsigned i = 0, j = 0; size > i; i += 2, ++j) {
        even[j] = input[i];
        odd[j] = input[i + 1];
    }

    for (unsigned i = 0; size / 2 > i; ++i) {
        output[i] = (even[i] << 4) | odd[i];
    }
}

Automatic vectorization with g++ of a loop with bit operations

There is a solution with using of AVX2 :

__m256 _B = _mm256_setr_ps(B[0], B[1], B[2], B[3], B[0], B[1], B[2], B[3]);
__m256i _shift = _mm256_setr_epi32(0, 2, 4, 6, 8, 10, 12, 14);
__m256i _mask = _mm256_set1_epi32(3);
for (size_t j = 0; j < J/2; j++)
{
    short x = ((short*)data)[j];
    __m256i _index = _mm256_and_si256(_mm256_srlv_epi32(_mm256_set1_epi32(x), _shift), _mask);
    _mm256_storeu_ps(A, _mm256_add_ps(_mm256_loadu_ps(A), _mm256_permutevar8x32_ps(_B, _index)));
    A += 8;
}

How to auto-vectorize range-based for loops?

Posted bug report here:

https://connect.microsoft.com/VisualStudio/feedback/details/807826/range-based-for-loops-are-not-vectorized

Response:

Hi, thanks for the report.

Vectorizing range-based-for-loop-y code is something we are actively
making better. We'll address vectorizing this, plus enabling
auto-vectorization for other C++ language & library features in future
releases of the compiler.

The emission of reason code 1304 (on x64) and reason code 1301 (on
x86) are artifacts of compiler internals. The details of that, for
this particular code, is not important.

Thanks for the report! I am closing this MSConnect item. Feel free to
respond if you need anything else.

Eric Brumer Microsoft Visual C++ Team

g++ , range based for and vectorization

Optimizing loops is very rarely about optimizing the actual loop iteration code (for ( T k : j ) in this case), but very much about optimizing what is IN the loop.

Now, since this is ... in this case, it's impossible to say if, for example, unrolling the loop will help, or declaring functions inline [or simply moving them so that the compiler can see them and put them inline], using auto-vectorization, or perhaps using a completely different algorithm inside the loop.

The examples in the paragraph above in a bit more detail:

Unrolling the loop - essentially do several of the loop iterations without going back to the start of the loop. This is most helpful when the loop content is very small. There is automatic unrolling, where the compiler does the unrolling, or you can unroll the code manually, by simply doing, say, four items in each loop iteration and then stepping four items forward in each loop variable update or updating the iterator multiple times during the loop itself [but this of course means not using the range-based for-loop].
Inline functions - the compiler will take (usually small) functions and place them into the loop itself, rather than having the call. This saves on the time it takes for the processor to call out to another place in the code and return back. Most compilers only do this for functions that are "visible" to the compiler during compilation - so the source has to be either in the same source file, or in a header file that is included in the source file that is compiled.
Auto-vectorisation - using SSE, MMX or AVX instructions to process multiple data items in one instruction (e.g. one SSE instruction can add four float values to another four float in one instruction). This is faster than operating on a single data item at a time (most of the time, sometimes it's no benefit because of additional complications with trying to combine the different data items and then sorting out what goes where when the calculation is finished).
Choose different algorithm - there are often several ways to solve a particular problem. Depending on what you are trying to achieve, a for-loop [of whatever kind] may not be the right solution in the first place, or the code inside the loop could perhaps use a more clever way to calculate/rearrange/whatever-it-does to achieve the result you need.

But ... is far too vague to say which, if any, of the above solutions will work to improve your code.

Why can GCC not vectorize this function and loop?

Vectorization is possible with gcc, after some slight modifications of the code:

#include <cmath>

double BlackBoxFunction(const double x) {
    return 1.0/sqrt(x);
}

double ComputeIntegral(const int n, const double a, const double b) {
    const double dx = (b - a)/n;
    double I = 0.0;
    double d_i = 0.0;
    for (int i = 0; i < n; i++) {
      const double xip12 = a + dx*(d_i + 0.5);
      d_i = d_i + 1.0;
      const double yip12 = BlackBoxFunction(xip12);
      const double dI = yip12*dx;
      I += dI; 
  }
  return I;
}

This was compiled with the compiler options: -Ofast -march=haswell -fopt-info-vec-missed -funsafe-math-optimizations. The main loop compiles to

.L7:
    vaddpd  ymm2, ymm4, ymm7
    inc     eax
    vaddpd  ymm4, ymm4, ymm8
    vfmadd132pd     ymm2, ymm9, ymm5
    vsqrtpd ymm2, ymm2
    vdivpd  ymm2, ymm6, ymm2
    vfmadd231pd     ymm3, ymm5, ymm2
    cmp     eax, edx
    jne     .L7

See the following Godbolt link

I removed the #pragma omp ..., because they didn't improve the vectorization, but they did not made the vectorization worse either.

Note that only changing the compiler option from -O3 to -Ofast is
sufficient to enable vectorization. Nevertheless, it is more efficient to use a double counter than an int counter which is converted to double each iteration.

Note also that the vectorization reports are quite misleading. Inspect the generated assembly code to see whether or not the vectorization was successful.

How to vectorize with gcc?

The original page offers details on getting gcc to automatically vectorize
loops, including a few examples:

http://gcc.gnu.org/projects/tree-ssa/vectorization.html

While the examples are great, it turns out the syntax for calling those options with latest GCC seems to have changed a bit, see now:

https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info

In summary, the following options will work for x86 chips with SSE2,
giving a log of loops that have been vectorized:

gcc -O2 -ftree-vectorize -msse2 -mfpmath=sse -ftree-vectorizer-verbose=5

Note that -msse is also a possibility, but it will only vectorize loops
using floats, not doubles or ints. (SSE2 is baseline for x86-64. For 32-bit code use -mfpmath=sse as well. That's the default for 64-bit but not 32-bit.)

Modern versions of GCC enable -ftree-vectorize at -O3 so just use that in GCC4.x and later:

gcc   -O3 -msse2 -mfpmath=sse  -ftree-vectorizer-verbose=5

(Clang enables auto-vectorization at -O2. ICC defaults to optimization enabled + fast-math.)

Most of the following was written by Peter Cordes, who could have just written a new answer. Over time, as compilers change, options and compiler output will change. I am not entirely sure whether it is worth tracking it in great detail here. Comments? -- Author

To also use instruction set extensions supported by the hardware you're compiling on, and tune for it, use -march=native.

Reduction loops (like sum of an array) will need OpenMP or -ffast-math to treat FP math as associative and vectorize. Example on the Godbolt compiler explorer with -O3 -march=native -ffast-math including a reduction (array sum) which is scalar without -ffast-math. (Well, GCC8 and later do a SIMD load and then unpack it to scalar elements, which is pointless vs. simple unrolling. The loop bottlenecks on the latency of the one addss dependency chain.)

Sometimes you don't need -ffast-math, just -fno-math-errno can help gcc inline math functions and vectorize something involving sqrt and/or rint / nearbyint.

Other useful options include -flto (link-time optimization for cross-file inlining, constant propagation, etc) and / or profile-guided optimization with -fprofile-generate / test run(s) with realistic input(s) /-fprofile-use. PGO enables loop unrolling for "hot" loops; in modern GCC that's off by default even at -O3.

How to Vectorize My Loop with G++