How to Find an Official Reference Listing the Operation of Sse Intrinsic Functions

Where can I find an official reference listing the operation of SSE intrinsic functions?

As well as Intel's vol.2 PDF manual, there is also an online intrinsics guide.

The Intel® Intrinsics Guide contains reference information for Intel intrinsics, which provide access to Intel instructions such as Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Advanced Vector Extensions (Intel® AVX), and Intel® Advanced Vector Extensions 2 (Intel® AVX2).

It has a full-text search, so an intrinsic can be found by its name, or by CPU instruction, CPU feature, etc. It also has a control on which ISA extension to show. This allows, for example, not searching KNC that you wouldn't likely be able to use, or MMX that is far less useful these days.

See also the tag wiki for the sse tag for links to guides and a couple tutorials, as well as this official documentation.

Where can I find a reference for the AMD FMA 4 intrinsics?

You find the intrinsics in the file fma4intrin.h. Here are the 256 bit instructions from this file, some function attributes stripped. The __buitin* functions emit the FMA instruction which is part of their name. So if you want to find a intrinsic function name, you need to lookup the correct __builtin_instructionname after the return and use the surrounding function wrapper.

/* 256b Floating point multiply/add type instructions.  */
_mm256_macc_ps (__m256 __A, __m256 __B, __m256 __C)
{
  return (__m256) __builtin_ia32_vfmaddps256 ((__v8sf)__A, (__v8sf)__B, (__v8sf)__C);
}

_mm256_macc_pd (__m256d __A, __m256d __B, __m256d __C)
{
  return (__m256d) __builtin_ia32_vfmaddpd256 ((__v4df)__A, (__v4df)__B, (__v4df)__C);
}

_mm256_msub_ps (__m256 __A, __m256 __B, __m256 __C)
{
  return (__m256) __builtin_ia32_vfmaddps256 ((__v8sf)__A, (__v8sf)__B, -(__v8sf)__C);
}

_mm256_msub_pd (__m256d __A, __m256d __B, __m256d __C)
{
  return (__m256d) __builtin_ia32_vfmaddpd256 ((__v4df)__A, (__v4df)__B, -(__v4df)__C);
}

_mm256_nmacc_ps (__m256 __A, __m256 __B, __m256 __C)
{
  return (__m256) __builtin_ia32_vfmaddps256 (-(__v8sf)__A, (__v8sf)__B, (__v8sf)__C);
}

_mm256_nmacc_pd (__m256d __A, __m256d __B, __m256d __C)
{
  return (__m256d) __builtin_ia32_vfmaddpd256 (-(__v4df)__A, (__v4df)__B, (__v4df)__C);
}

_mm256_nmsub_ps (__m256 __A, __m256 __B, __m256 __C)
{
  return (__m256) __builtin_ia32_vfmaddps256 (-(__v8sf)__A, (__v8sf)__B, -(__v8sf)__C);
}

_mm256_nmsub_pd (__m256d __A, __m256d __B, __m256d __C)
{
  return (__m256d) __builtin_ia32_vfmaddpd256 (-(__v4df)__A, (__v4df)__B, -(__v4df)__C);
}

_mm256_maddsub_ps (__m256 __A, __m256 __B, __m256 __C)
{
  return (__m256) __builtin_ia32_vfmaddsubps256 ((__v8sf)__A, (__v8sf)__B, (__v8sf)__C);
}

_mm256_maddsub_pd (__m256d __A, __m256d __B, __m256d __C)
{
  return (__m256d) __builtin_ia32_vfmaddsubpd256 ((__v4df)__A, (__v4df)__B, (__v4df)__C);
}

_mm256_msubadd_ps (__m256 __A, __m256 __B, __m256 __C)
{
  return (__m256) __builtin_ia32_vfmaddsubps256 ((__v8sf)__A, (__v8sf)__B, -(__v8sf)__C);
}

_mm256_msubadd_pd (__m256d __A, __m256d __B, __m256d __C)
{
  return (__m256d) __builtin_ia32_vfmaddsubpd256 ((__v4df)__A, (__v4df)__B, -(__v4df)__C);
}

SSE intrinsic over int16[8] to extract the sign of each element

You can use min/max operations to get the desired result, e.g.

inline __m128i _mm_sgn_epi16(__m128i v)
{
    v = _mm_min_epi16(v, _mm_set1_epi16(1));
    v = _mm_max_epi16(v, _mm_set1_epi16(-1));
    return v;
}

This is probably a little more efficient than explicitly comparing with zero + shifting + combining results.

Note that there is already an _mm_sign_epi16 intrinsic in SSSE3 (PSIGNW - see tmmintrin.h), which behaves somewhat differently, so I changed the name for the required function to _mm_sgn_epi16. Using _mm_sign_epi16 might be more efficient when SSSE3 is available however, so you could do something like this:

inline __m128i _mm_sgn_epi16(__m128i v)
{
#ifdef __SSSE3__
    v = _mm_sign_epi16(_mm_set1_epi16(1), v); // use PSIGNW on SSSE3 and later
#else
    v = _mm_min_epi16(v, _mm_set1_epi16(1));  // use PMINSW/PMAXSW on SSE2/SSE3.
    v = _mm_max_epi16(v, _mm_set1_epi16(-1));
#endif
    return v;
}

performance of intrinsic functions with sse

All that really matters is the addps. In a more realistic use case, where you might be, say, adding two large vectors of floats in a loop, the body of the loop will just contain addps, two loads and a store, and some scalar integer instructions for address arithmetic. On a modern superscalar CPU many of these instructions will execute in parallel.

Note also that you're compiling with optimisation disabled, so you won't get particularly efficient code. Try gcc -O3 -msse3 ....

SSE - Non-Existant haddsub intrinsic?

Generally you want to avoid designing your code to use horizontal ops in the first place; try to do the same thing to multiple data in parallel, instead of different things with different elements. But sometimes a local optimization is still worth it, and horizontal stuff can be better than pure scalar.

Intel experimented with adding horizontal ops in SSE3, but never added dedicated hardware to support them. They decode to 2 shuffles + 1 vertical op on all CPUs that support them (including AMD). See Agner Fog's instruction tables. More recent ISA extensions have mostly not included more horizontal ops, except for SSE4.1 dpps/dppd (which is also usually not worth using vs. manually shuffling).

SSSE3 pmaddubsw makes sense because element-width is already a problem for widening multiplication, and SSE4.1 phminposuw got dedicated HW support right away to make it worth using (and doing the same thing without it would cost a lot of uops, and it's specifically very useful for video encoding). But AVX / AVX2 / AVX512 horizontal ops are very scarce. AVX512 did introduce some nice shuffles, so you can build your own horizontal ops out of the powerful 2-input lane-crossing shuffles if needed.

If the most efficient solution to your problem already includes shuffling together two inputs two different ways and feeding that to an add or sub, then sure, haddpd is an efficient way to encode that; especially without AVX where preparing the inputs might have required a movaps instruction as well because shufpd is destructive (silently emitted by the compiler when using intrinsics, but still costs front-end bandwidth, and latency on CPUs like Sandybridge and earlier which don't eliminate reg-reg moves).

But if you were going to use the same input twice, haddpd is the wrong choice. See also Fastest way to do horizontal float vector sum on x86. hadd / hsub are only a good idea with two different inputs, e.g. as part of an on-the-fly transpose as part of some other operation on a matrix.

Anyway, the point is, build your own haddsub_pd if you want it, out of two shuffles + SSE3 addsubpd (which does have single-uop hardware support on CPUs that support it.) With AVX, it will be just as fast as a hypothetical haddsubpd instruction, and without AVX will typically cost one extra movaps because the compiler needs to preserve both inputs to the first shuffle. (Code-size will be bigger, but I'm talking about cost in uops for the front-end, and execution-port pressure for the back-end.)

 // Requires SSE3 (for addsubpd)

  // inputs: a=[a1 a0]  b=[b1 b0]
  // output:   [b1+b0, a1-a0],  like haddpd for b and hsubpd for a
static inline
__m128d haddsub_pd(__m128d a, __m128d b) {
    __m128d lows  = _mm_unpacklo_pd(a,b);  // [b0,    a0]
    __m128d highs = _mm_unpackhi_pd(a,b);  // [b1,    a1]
    return _mm_addsub_pd(highs, lows);     // [b1+b0, a1-a0]
}

With gcc -msse3 and clang (on Godbolt) we get the expected:

    movapd  xmm2, xmm0          # ICC saves a code byte here with movaps, but gcc/clang use movapd on double vectors for no advantage on any CPU.
    unpckhpd        xmm0, xmm1
    unpcklpd        xmm2, xmm1
    addsubpd        xmm0, xmm2
    ret

This wouldn't typically matter when inlining, but as a stand-alone function gcc and clang have trouble when they need the return value in the same register that b starts in, instead of a. (e.g. if the args are reversed so it's haddsub(b,a)).

# gcc for  haddsub_pd_reverseargs(__m128d b, __m128d a) 
    movapd  xmm2, xmm1          # copy b
    unpckhpd        xmm1, xmm0
    unpcklpd        xmm2, xmm0
    movapd  xmm0, xmm1          # extra copy to put the result in the right register
    addsubpd        xmm0, xmm2
    ret

clang actually does a better job, using a different shuffle (movhlps instead of unpckhpd) to still only use one register-copy:

# clang5.0
    movapd  xmm2, xmm1              # clangs comments go in least-significant-element first order, unlike my comments in the source which follow Intel's convention in docs / diagrams / set_pd() args order
    unpcklpd        xmm2, xmm0      # xmm2 = xmm2[0],xmm0[0]
    movhlps xmm0, xmm1              # xmm0 = xmm1[1],xmm0[1]
    addsubpd        xmm0, xmm2
    ret

For an AVX version with __m256d vectors, the in-lane behaviour of _mm256_unpacklo/hi_pd is actually what you want, for once, to get the even / odd elements.

static inline
__m256d haddsub256_pd(__m256d b, __m256d a) {
    __m256d lows  = _mm256_unpacklo_pd(a,b);  // [b2, a2 | b0, a0]
    __m256d highs = _mm256_unpackhi_pd(a,b);  // [b3, a3 | b1, a1]
    return _mm256_addsub_pd(highs, lows);     // [b3+b2, a3-a2 | b1+b0, a1-a0]
}

# clang and gcc both have an easy time avoiding wasted mov instructions
    vunpcklpd       ymm2, ymm1, ymm0 # ymm2 = ymm1[0],ymm0[0],ymm1[2],ymm0[2]
    vunpckhpd       ymm0, ymm1, ymm0 # ymm0 = ymm1[1],ymm0[1],ymm1[3],ymm0[3]
    vaddsubpd       ymm0, ymm0, ymm2

Of course, if you have the same input twice, i.e. you wanted the sum and difference between the two elements of a vector, you only need one shuffle to feed addsubpd

// returns [a1+a0  a1-a0]
static inline
__m128d sumdiff(__m128d a) {
    __m128d swapped = _mm_shuffle_pd(a,a, 0b01);
    return _mm_addsub_pd(swapped, a);
}

This actually compiles quite clunkily with both gcc and clang:

    movapd  xmm1, xmm0
    shufpd  xmm1, xmm0, 1
    addsubpd        xmm1, xmm0
    movapd  xmm0, xmm1
    ret

But the 2nd movapd should go away when inlining, if the compiler doesn't need the result in the same register it started with. I think gcc and clang are both missing an optimization here: they could swap xmm0 after copying it:

     # compilers should do this, but don't
    movapd  xmm1, xmm0         # a = xmm1 now
    shufpd  xmm0, xmm0, 1      # swapped = xmm0
    addsubpd xmm0, xmm1        # swapped +- a
    ret

Presumably their SSA-based register allocators don't think of using a 2nd register for the same value of a to free up xmm0 for swapped. Usually it's fine (and even preferable) to produce the result in a different register, so this is rarely a problem when inlining, only when looking at the stand-alone version of a function

How compilers treat SSE (or any) intrinsic functions?

The intrinsics compile down to the instructions the represent, whether this is efficient or not depends on how they are used.

also, each compiler treats intrinsics a little differently (aka its implementation specific), but GCC is open source, so you can see how they treat the SSE ones, Open Watcom*, LCC, PCC and TCC* are all open source C compilers, although thwey don't have SSE intrinsics, they should still have intrinsics, and you can see how they handle them.

I think what you read was related to auto vectorization of code, something GCC(see this) and ICC are very good at, but they aren't as good as hand optimized code, at least not yet

*might have been updated with support for SSE, haven't checked lately...

how to use SSE instruction in the x64 architecture in c++?

The modern method to use assembly instructions in C/C++ is to use intrinsics. Intrinsics have several advantages over inline assembly such as:

You don't have to worry about 32-bit and 64-bit mode.
You don't need to worry about registers and register spilling.
No need to worry AT&T and Intel Syntax.
No need to worry about calling conversions.
The compiler can optimize intrinsics further which it won't do with inline assembly.
Intrinsics are compatible (for the most intrinsics) with GCC, MSVC, ICC, and Clang.

I also like intrinsics because it's easy to emulate hardware with them for example to prepare for AVX512.

You can find the list of Intrinsics MSVC supports here. Intel has better information on intrinsics as well which agrees mostly with MSVC's intrinsics.

But sometimes you still need or want inline assembly. In my opinion it's really stupid that Microsoft does not allow inline assembly in 64-bit mode. This means they have to define intrinsics for several things that other compilers can still do with inline assembly. One example is CPUID. Visual Studio has an intrinsic for CPUID but GCC still uses inline assembly. Another example is adc. For a long time MSVC had no intrinsic for adc but now it appears they do.

Additionally, because they have to create intrinsics for everything it causes confusion. They have to create an intrinsic for mulx but the Intel's documentation for this is wrong. They also have to create intrinics for adcx and adox as well but their documentation disagrees with Intel's and the generated assembly shows that no intrinsic produces adox. So once again the programmer is left waiting for an intrinsic for adox. If they had just allowed inline assembly then there would be no problem.

But back to SSE. With few exceptions, e.g. _mm_set_epi64x in 32-bit mode on MSVC (I don't know if that's been fixed) the SSE/AVX/AVX2 intrinsics work as expected with MSVC, GCC, ICC, and Clang.