Sse Sse2 and Sse3 for Gnu C++

SSE SSE2 and SSE3 for GNU C++

Sorry don't know of a tutorial.

Your best bet (IMHO) is to use SSE via the "intrinsic" functions Intel provides to wrap (generally) single SSE instructions.
These are made available via a set of include files named *mmintrin.h e.g xmmintrin.h is the original SSE instruction set.

Begin familiar with the contents of Intel's Optimization Reference Manual is a good idea (see section 4.3.1.2 for an example of intrinsics) and the SIMD sections are essential reading. The instruction set reference manuals are pretty helpful too, in that each instruction's documentation includes the "intrinsic" function it corresponds to.

Do spend some time inspecting the assembler produced by the compiler from intrinsics (you'll learn a lot) and on profiling/performance measurement (you'll avoid wasting time SSE-ing code for little return on the effort).

Update 2011-05-31: There is some very nice coverage of intrinsics and vectorization in Agner Fog's optimization PDFs (thanks) although it's a bit spread about (e.g section 12 of the first one and section 5 of the second one). These aren't exactly tutorial material (in fact there's a "these manuals are not for beginners" warning) but they do rightly treat SIMD (whether used via asm, intrinsics or compiler vectorization) as just one part of the larger optimization toolbox.

Update 2012-10-04: A nice little Linux Journal article on gcc vector intrinsics deserves a mention here. More general than just SSE (covers PPC and ARM extensions too). There's a good collection of references on the last page, which drew my attention to Intel's "intrinsics guide".

Detect the availability of SSE/SSE2 instruction set in Visual Studio

From the documentation:

_M_IX86_FP
Expands to a value indicating which /arch compiler option was used:
0 if /arch:IA32 was used.
1 if /arch:SSE was used.
2 if /arch:SSE2 was used. This value is the default if /arch was not specified.

I don't see any mention of _SSE_.

GCC SSE code optimization

Vectorization in GCC is enabled at -O3. That's why at -O0, you see only the ordinary scalar SSE2 instructions (movsd, addsd, etc). Using GCC 4.6.1 and your second example:

#define N 10000
#define NTIMES 100000

double a[N] __attribute__ ((aligned (16)));
double b[N] __attribute__ ((aligned (16)));
double c[N] __attribute__ ((aligned (16)));
double r[N] __attribute__ ((aligned (16)));

int
main (void)
{
  int i, times;
  for (times = 0; times < NTIMES; times++)
    {
      for (i = 0; i < N; ++i)
        r[i] = (a[i] + b[i]) * c[i];
    }

  return 0;
}

and compiling with gcc -S -O3 -msse2 sse.c produces for the inner loop the following instructions, which is pretty good:

.L3:
    movapd  a(%eax), %xmm0
    addpd   b(%eax), %xmm0
    mulpd   c(%eax), %xmm0
    movapd  %xmm0, r(%eax)
    addl    $16, %eax
    cmpl    $80000, %eax
    jne .L3

As you can see, with the vectorization enabled GCC emits code to perform two loop iterations in parallel. It can be improved, though - this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with gcc -S -O3 -msse2 -mavx sse.c gives for the inner loop:

.L3:
    vmovapd a(%eax), %ymm0
    vaddpd  b(%eax), %ymm0, %ymm0
    vmulpd  c(%eax), %ymm0, %ymm0
    vmovapd %ymm0, r(%eax)
    addl    $32, %eax
    cmpl    $80000, %eax
    jne .L3

Note that v in front of each instruction and that instructions use the 256-bit YMM registers, four iterations of the original loop are executed in parallel.

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

Most compilers will automatically define:

__SSE__
__SSE2__
__SSE3__
__AVX__
__AVX2__

etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this:

$ gcc -msse3 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE_MATH__ 1

or:

$ gcc -mavx2 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

or to just check the pre-defined macros for a default build on your particular platform:

$ gcc -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __SSE2_MATH__ 1
#define __SSE2__ 1
#define __SSE3__ 1
#define __SSE_MATH__ 1
#define __SSE__ 1
#define __SSSE3__ 1

More recent Intel processors support AVX-512, which is not a monolithic instruction set. One can see the support available from GCC (version 6.2) for two examples below.

Here is Knights Landing:

$ gcc -march=knl -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __AVX512CD__ 1
#define __AVX512ER__ 1
#define __AVX512F__ 1
#define __AVX512PF__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

Here is Skylake AVX-512:

$ gcc -march=skylake-avx512 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512F__ 1
#define __AVX512VL__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

Intel has disclosed additional AVX-512 subsets (see ISA extensions). GCC (version 7) supports compiler flags and preprocessor symbols associated with the 4FMAPS, 4VNNIW, IFMA, VBMI and VPOPCNTDQ subsets of AVX-512:

for i in 4fmaps 4vnniw ifma vbmi vpopcntdq ; do echo "==== $i ====" ; gcc -mavx512$i -dM -E - < /dev/null | egrep "AVX512" | sort ; done
==== 4fmaps ====
#define __AVX5124FMAPS__ 1
#define __AVX512F__ 1
==== 4vnniw ====
#define __AVX5124VNNIW__ 1
#define __AVX512F__ 1
==== ifma ====
#define __AVX512F__ 1
#define __AVX512IFMA__ 1
==== vbmi ====
#define __AVX512BW__ 1
#define __AVX512F__ 1
#define __AVX512VBMI__ 1
==== vpopcntdq ====
#define __AVX512F__ 1
#define __AVX512VPOPCNTDQ__ 1

Note that the SSE macros won't work with Visual C++. You have to use _M_IX86_FP instead.

Is sse2 enabled by default in g++?

If you are compiling for 64bit, then this is totally fine and documented behavior.

As stated in the gcc docs the SSE instruction set is enabled by default when using an x86-64 compiler:

-mfpmath=unit

Generate floating point arithmetics for selected unit unit. The choices for unit are:

`387'

Use the standard 387 floating point coprocessor present majority of chips and emulated otherwise. Code compiled with this option will run almost everywhere. The temporary results are computed in 80bit precision instead of precision specified by the type resulting in slightly different results compared to most of other chips. See -ffloat-store for more detailed description.

This is the default choice for i386 compiler.

`sse'

Use scalar floating point instructions present in the SSE instruction set. This instruction set is supported by Pentium3 and newer chips, in the AMD line by Athlon-4, Athlon-xp and Athlon-mp chips. The earlier version of SSE instruction set supports only single precision arithmetics, thus the double and extended precision arithmetics is still done using 387. Later version, present only in Pentium4 and the future AMD x86-64 chips supports double precision arithmetics too.

For the i386 compiler, you need to use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For the x86-64 compiler, these extensions are enabled by default.

The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80bit.

This is the default choice for the x86-64 compiler.

GCC generates SSE instructions instead of AVX

Your code does all integer arithmetic; there are no integer operations in the AVX extension. They were added in AVX2, which you haven’t enabled.

Before you go and rewrite all your code to use float or buy a processor with AVX2, I should point out that the array-of-structures memory layout you appear to be using defeats many auto-vectorizers, so it isn’t totally obvious that your code would take advantage of AVX if integer ops were available. You may want to consider using a structure-of-arrays layout instead, though that may also prove to be a relatively invasive change to make.

How to enable sse3 autovectorization in gcc

The problem is that the complex type is not SIMD friendly. I have never been a fan of the complex type because it's a composite object that usually does not map to a primitive type or single operation in hardware (certainly not with x86 hardware).

In order to make complex arithmetic SIMD friendly you need to operate on multiple complex numbers simultaneous. For SSE you need to operate on four complex numbers at once.

We can use GCC's vector extensions to make the syntax easier.

typedef float v4sf __attribute__ ((vector_size (16)));

Then we can delcare a union of an array and the vector extension

typedef union {
  v4sf v;
  float e[4];
} float4

And lastly we define a block of four complex numbers like this

typedef struct {
  float4 x;
  float4 y;
} complex4;

where x is four real parts and y is four imaginary components.

Once we have this we can multiple 4 complex numbers at once like this

static complex4 complex4_mul(complex4 a, complex4 b) {
  return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}

and finally we get to your function modified to operate on four complex numbers at a time.

complex4 f4(complex4 x[], int n) {
  v4sf one = {1,1,1,1};
  complex4 p = {one,one};
  for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
  return p;
}

Let's look at the assembly (Intel syntax) to see if it's optimal

.L3:
    movaps  xmm4, XMMWORD PTR [rsi]
    add     rsi, 32
    movaps  xmm1, XMMWORD PTR -16[rsi]
    cmp     rdx, rsi
    movaps  xmm2, xmm4
    movaps  xmm5, xmm1
    mulps   xmm1, xmm3
    mulps   xmm2, xmm3
    mulps   xmm5, xmm0
    mulps   xmm0, xmm4
    subps   xmm2, xmm5
    addps   xmm0, xmm1
    movaps  xmm3, xmm2
    jne     .L3

That's exactly four 4-wide multiplications, one 4-wide addition, and one 4-wide subtraction. The variable p stays in register and only the array x is loaded from memory just like we want.

Let's look at the algebra for the product of complex numbers

{a, bi}*{c, di} = {(ac - bd),(bc + ad)i}

That's exactly four multiplications, one addition, and one subtraction.

As I explained in this answer efficient SIMD algebraically is often identical to the scalar arithmetic. So we have replaced four 1-wide multiplications, addition, and subtraction, with four 4-wide multiplications, addition, and subtraction. That's the best you can do with 4-wide SIMD: four for the price of one.

Note that this does not need any instructions beyond SSE and no additional SSE instructions (except for FMA4) will be any better. So on a 64-bit system you can compile with -O3.

It is trivial to extend this for 8-wide SIMD with AVX.

One major advantage of using GCC's vector extensions is you get FMA without any additional effort. E.g. if you compile with -O3 -mfma4 the main loop is

.L3:
    vmovaps xmm0, XMMWORD PTR 16[rsi]
    add     rsi, 32
    vmulps  xmm1, xmm0, xmm2
    vmulps  xmm0, xmm0, xmm3
    vfmsubps        xmm1, xmm3, XMMWORD PTR -32[rsi], xmm1
    vmovaps xmm3, xmm1
    vfmaddps        xmm2, xmm2, XMMWORD PTR -32[rsi], xmm0
    cmp     rdx, rsi
    jne     .L3

Where can I find an official reference listing the operation of SSE intrinsic functions?

As well as Intel's vol.2 PDF manual, there is also an online intrinsics guide.

The Intel® Intrinsics Guide contains reference information for Intel intrinsics, which provide access to Intel instructions such as Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Advanced Vector Extensions (Intel® AVX), and Intel® Advanced Vector Extensions 2 (Intel® AVX2).

It has a full-text search, so an intrinsic can be found by its name, or by CPU instruction, CPU feature, etc. It also has a control on which ISA extension to show. This allows, for example, not searching KNC that you wouldn't likely be able to use, or MMX that is far less useful these days.

See also the tag wiki for the sse tag for links to guides and a couple tutorials, as well as this official documentation.

Sse Sse2 and Sse3 for Gnu C++