Avx2: Computing Dot Product of 512 Float Arrays

AVX2: Computing dot product of 512 float arrays

_mm256_dp_ps is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps and _mm256_add_ps in a loop would be much slower.

GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics for, unlike MSVC and ICC.

The code below is probably close to theoretical performance limit of your CPU. Untested.

Compile it with clang or gcc -O3 -march=native. (Requires at least -mavx -mfma, but -mtune options implied by -march are good, too, and so are the other -mpopcnt and other things arch=native enables. Tune options are critical to this compiling efficiently for most CPUs with FMA, specifically -mno-avx256-split-unaligned-load: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)

Or compile it with MSVC -O2 -arch:AVX2

#include <immintrin.h>
#include <vector>
#include <assert.h>

// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.

// Multiply 8 floats by another 8 floats.
template<int offsetRegs>
inline __m256 mul8( const float* p1, const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_mul_ps( a, b );
}

// Returns acc + ( p1 * p2 ), for 8-wide float lanes.
template<int offsetRegs>
inline __m256 fma8( __m256 acc, const float* p1, const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_fmadd_ps( a, b, acc );
}

// Compute dot product of float vectors, using 8-wide FMA instructions.
float dotProductFma( const std::vector<float>& a, const std::vector<float>& b )
{
    assert( a.size() == b.size() );
    assert( 0 == ( a.size() % 32 ) );
    if( a.empty() )
        return 0.0f;

    const float* p1 = a.data();
    const float* const p1End = p1 + a.size();
    const float* p2 = b.data();

    // Process initial 32 values. Nothing to add yet, just multiplying.
    __m256 dot0 = mul8<0>( p1, p2 );
    __m256 dot1 = mul8<1>( p1, p2 );
    __m256 dot2 = mul8<2>( p1, p2 );
    __m256 dot3 = mul8<3>( p1, p2 );
    p1 += 8 * 4;
    p2 += 8 * 4;

    // Process the rest of the data.
    // The code uses FMA instructions to multiply + accumulate, consuming 32 values per loop iteration.
    // Unrolling manually for 2 reasons:
    // 1. To reduce data dependencies. With a single register, every loop iteration would depend on the previous result.
    // 2. Unrolled code checks for exit condition 4x less often, therefore more CPU cycles spent computing useful stuff.
    while( p1 < p1End )
    {
        dot0 = fma8<0>( dot0, p1, p2 );
        dot1 = fma8<1>( dot1, p1, p2 );
        dot2 = fma8<2>( dot2, p1, p2 );
        dot3 = fma8<3>( dot3, p1, p2 );
        p1 += 8 * 4;
        p2 += 8 * 4;
    }

    // Add 32 values into 8
    const __m256 dot01 = _mm256_add_ps( dot0, dot1 );
    const __m256 dot23 = _mm256_add_ps( dot2, dot3 );
    const __m256 dot0123 = _mm256_add_ps( dot01, dot23 );
    // Add 8 values into 4
    const __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0123 ), _mm256_extractf128_ps( dot0123, 1 ) );
    // Add 4 values into 2
    const __m128 r2 = _mm_add_ps( r4, _mm_movehl_ps( r4, r4 ) );
    // Add 2 lower values into the final result
    const __m128 r1 = _mm_add_ss( r2, _mm_movehdup_ps( r2 ) );
    // Return the lowest lane of the result vector.
    // The intrinsic below compiles into noop, modern compilers return floats in the lowest lane of xmm0 register.
    return _mm_cvtss_f32( r1 );
}

Possible further improvements:

Unroll by 8 vectors instead of 4. I’ve checked gcc 9.2 asm output, compiler only used 8 vector registers out of the 16 available.
Make sure both input vectors are aligned, e.g. use a custom allocator which calls _aligned_malloc / _aligned_free on msvc, or aligned_alloc / free on gcc & clang. Then replace _mm256_loadu_ps with _mm256_load_ps.

To auto-vectorize a simple scalar dot product, you'd also need OpenMP SIMD or -ffast-math (implied by -Ofast) to let the compiler treat FP math as associative even though it's not (because of rounding). But GCC won't use multiple accumulators when auto-vectorizing, even if it does unroll, so you'd bottleneck on FMA latency, not load throughput.

(2 loads per FMA means the throughput bottleneck for this code is vector loads, not actual FMA operations.)

Dot Product of Vectors with SIMD

First of all: You shouldn't assume that you can optimize better than the compiler could.

Yes, you are now using AVX instructions in your "optimized" code. But you also wrote code which the compiler now has issues unrolling, in addition to the plain vectorization.

For comparison, let's have a look at what a compiler would actually make from your "slow" C implementation, just the hot loop without footer.

ICC, compiled with -O3 -march=skylake -ffast-math:

..B1.13:
    vmovups   ymm2, YMMWORD PTR [rsi+rdi*4]
    vmovups   ymm3, YMMWORD PTR [32+rsi+rdi*4]
    vfmadd231ps ymm1, ymm2, YMMWORD PTR [r8+rdi*4]
    vfmadd231ps ymm0, ymm3, YMMWORD PTR [32+r8+rdi*4]
    add       rdi, 16
    cmp       rdi, rax
    jb        ..B1.13

Clang, with the same parameters is even more pessimistic and unrolls this to the following:

.LBB0_4:
    vmovups ymm4, ymmword ptr [rsi + 4*rcx]
    vmovups ymm5, ymmword ptr [rsi + 4*rcx + 32]
    vmovups ymm6, ymmword ptr [rsi + 4*rcx + 64]
    vmovups ymm7, ymmword ptr [rsi + 4*rcx + 96]
    vfmadd132ps     ymm4, ymm0, ymmword ptr [rdi + 4*rcx]
    vfmadd132ps     ymm5, ymm1, ymmword ptr [rdi + 4*rcx + 32]
    vfmadd132ps     ymm6, ymm2, ymmword ptr [rdi + 4*rcx + 64]
    vfmadd132ps     ymm7, ymm3, ymmword ptr [rdi + 4*rcx + 96]
    vmovups ymm0, ymmword ptr [rsi + 4*rcx + 128]
    vmovups ymm1, ymmword ptr [rsi + 4*rcx + 160]
    vmovups ymm2, ymmword ptr [rsi + 4*rcx + 192]
    vmovups ymm3, ymmword ptr [rsi + 4*rcx + 224]
    vfmadd132ps     ymm0, ymm4, ymmword ptr [rdi + 4*rcx + 128]
    vfmadd132ps     ymm1, ymm5, ymmword ptr [rdi + 4*rcx + 160]
    vfmadd132ps     ymm2, ymm6, ymmword ptr [rdi + 4*rcx + 192]
    vfmadd132ps     ymm3, ymm7, ymmword ptr [rdi + 4*rcx + 224]
    add     rcx, 64
    add     rax, 2
    jne     .LBB0_4

Surprise, both compilers were already able to use AVX instructions, no intrinsic hacking required.

But what's more interesting, is that both compiler decided that one accumulation register wasn't enough to saturate the AVX pipeline, but rather used 2 respectively 4 accumulation registers. Having more operations in flight helps masking latency of the FMA, up to the point where the actual memory throughput limit is reached.

Just don't forget the -ffast-math compiler option, without it wouldn't be legal to pull the final accumulation out of the vectorized loop.

GCC, also with the same options, is actually "only" doing as good as your "optimized" solution:

.L7:
    add     r8, 1
    vmovaps ymm3, YMMWORD PTR [r9+rax]
    vfmadd231ps     ymm1, ymm3, YMMWORD PTR [rcx+rax]
    add     rax, 32
    cmp     r8, r10
    jb      .L7

However GCC was still a little bit smarter in also adding a header to that loop, so it could use vmovaps (aligned memory access) instead of vmovups (unaligned memory access) for the first load.

For completeness, with pure AVX (-O3 -march=ivybridge -ffast-math):

ICC:

..B1.12:
    vmovups   xmm2, XMMWORD PTR [r8+rdi*4]
    vmovups   xmm5, XMMWORD PTR [32+r8+rdi*4]
    vinsertf128 ymm3, ymm2, XMMWORD PTR [16+r8+rdi*4], 1
    vinsertf128 ymm6, ymm5, XMMWORD PTR [48+r8+rdi*4], 1
    vmulps    ymm4, ymm3, YMMWORD PTR [rsi+rdi*4]
    vmulps    ymm7, ymm6, YMMWORD PTR [32+rsi+rdi*4]
    vaddps    ymm1, ymm1, ymm4
    vaddps    ymm0, ymm0, ymm7
    add       rdi, 16
    cmp       rdi, rax
    jb        ..B1.12

Clang:

.LBB0_5:
    vmovups xmm4, xmmword ptr [rdi + 4*rcx]
    vmovups xmm5, xmmword ptr [rdi + 4*rcx + 32]
    vmovups xmm6, xmmword ptr [rdi + 4*rcx + 64]
    vmovups xmm7, xmmword ptr [rdi + 4*rcx + 96]
    vinsertf128     ymm4, ymm4, xmmword ptr [rdi + 4*rcx + 16], 1
    vinsertf128     ymm5, ymm5, xmmword ptr [rdi + 4*rcx + 48], 1
    vinsertf128     ymm6, ymm6, xmmword ptr [rdi + 4*rcx + 80], 1
    vinsertf128     ymm7, ymm7, xmmword ptr [rdi + 4*rcx + 112], 1
    vmovups xmm8, xmmword ptr [rsi + 4*rcx]
    vmovups xmm9, xmmword ptr [rsi + 4*rcx + 32]
    vmovups xmm10, xmmword ptr [rsi + 4*rcx + 64]
    vmovups xmm11, xmmword ptr [rsi + 4*rcx + 96]
    vinsertf128     ymm8, ymm8, xmmword ptr [rsi + 4*rcx + 16], 1
    vmulps  ymm4, ymm8, ymm4
    vaddps  ymm0, ymm4, ymm0
    vinsertf128     ymm4, ymm9, xmmword ptr [rsi + 4*rcx + 48], 1
    vmulps  ymm4, ymm4, ymm5
    vaddps  ymm1, ymm4, ymm1
    vinsertf128     ymm4, ymm10, xmmword ptr [rsi + 4*rcx + 80], 1
    vmulps  ymm4, ymm4, ymm6
    vaddps  ymm2, ymm4, ymm2
    vinsertf128     ymm4, ymm11, xmmword ptr [rsi + 4*rcx + 112], 1
    vmulps  ymm4, ymm4, ymm7
    vaddps  ymm3, ymm4, ymm3
    add     rcx, 32
    cmp     rax, rcx
    jne     .LBB0_5

GCC:

.L5:
    vmovups xmm3, XMMWORD PTR [rdi+rax]
    vinsertf128     ymm1, ymm3, XMMWORD PTR [rdi+16+rax], 0x1
    vmovups xmm4, XMMWORD PTR [rsi+rax]
    vinsertf128     ymm2, ymm4, XMMWORD PTR [rsi+16+rax], 0x1
    add     rax, 32
    vmulps  ymm1, ymm1, ymm2
    vaddps  ymm0, ymm0, ymm1
    cmp     rax, rcx
    jne     .L5

Pretty much the same optimizations applied, just a few additional operations as FMA is missing and unaligned 256bit loads are not advisable for Ivy Bridge.

Intel AVX: 256-bits version of dot product for double precision floating point variables

I would use a 4*double multiplication, then a hadd (which unfortunately adds only 2*2 floats in the upper and lower half), extract the upper half (a shuffle should work equally, maybe faster) and add it to the lower half.

The result is in the low 64 bit of dotproduct.

__m256d xy = _mm256_mul_pd( x, y );
__m256d temp = _mm256_hadd_pd( xy, xy );
__m128d hi128 = _mm256_extractf128_pd( temp, 1 );
__m128d dotproduct = _mm_add_pd( (__m128d)temp, hi128 );

Edit:

After an idea of Norbert P. I extended this version to do 4 dot products at one time.

__m256d xy0 = _mm256_mul_pd( x[0], y[0] );
__m256d xy1 = _mm256_mul_pd( x[1], y[1] );
__m256d xy2 = _mm256_mul_pd( x[2], y[2] );
__m256d xy3 = _mm256_mul_pd( x[3], y[3] );

// low to high: xy00+xy01 xy10+xy11 xy02+xy03 xy12+xy13
__m256d temp01 = _mm256_hadd_pd( xy0, xy1 );   

// low to high: xy20+xy21 xy30+xy31 xy22+xy23 xy32+xy33
__m256d temp23 = _mm256_hadd_pd( xy2, xy3 );

// low to high: xy02+xy03 xy12+xy13 xy20+xy21 xy30+xy31
__m256d swapped = _mm256_permute2f128_pd( temp01, temp23, 0x21 );

// low to high: xy00+xy01 xy10+xy11 xy22+xy23 xy32+xy33
__m256d blended = _mm256_blend_pd(temp01, temp23, 0b1100);

__m256d dotproduct = _mm256_add_pd( swapped, blended );

Why don't wider versions of VDPPD / VDPPS exist, like 512-bit bit?

dpps is usually only helpful in the first place if you're using SIMD "wrong" (inefficiently), for horizontal operations within vectors rather than vertical across multiple vectors. (Sometimes you do that anyway if you can get some speedup for something without re-engineering some data structure that a lot of existing code uses, or if other use-cases for the data are more important.)

Slides + text: SIMD at Insomniac Games (GDC 2015) have some good stuff about designing your data structures to be SIMD friendly (SoA vs. AoS), and how that can give much better speedups than abusing a SIMD vector to hold an x,y,z geometry vector.

I have only been able to find 128-bit versions of the Vector Dot Product AVX/SIMD instructions

AVX1 includes vdpps ymm, ymm, ymm/m256, imm8. It does two independent 128-bit DPPS operations in the two lanes of a YMM, like for most other horizontal / shuffle instructions that got widened in less-than-useful ways in AVX and especially AVX2. Perhaps that's what you meant?

It seems like a reasonably important instruction family

Not at all. It's only useful for dot products of 2 to 4 elements, and decodes to multiple uops anyway. Can be worth using on Intel if it does exactly what you need, but AMD just microcodes it. https://uops.info/ / https://agner.org/optimize/. Or at least on Intel before Ice Lake also mostly "gave up" on it, decoding as 6 uops (14c latency), up from 4 in Skylake (13c latency) for vdpps xmm. Still not as bad as Zen's 8 uops (15c latency).

Geometry vectors of 3 elements in a 4-element SIMD vector get (ab)used some; that's what dpps is for. (And possibly sometimes stuff like 3x3 or 4x4 matrices, maybe even 2x2.) Two separate in-lane DPPS operations for vdpps ymm are probably more useful for getting some SIMD benefit for Array-of-Structs (AoS) data formats than a single 8-wide operation.

dpps isn't useful for dot products of larger arrays. For arrays, FMA into multiple accumulators which you only combine horizontally at the end, like in Improving performance of floating-point dot-product of an array with SIMD. See also this Q&A for some perf-tuning experiments.

A lane-crossing vdpps wouldn't be worth building any more dedicated hardware in execution units for, vs. just letting software handle arrays of 5 or more elements or micro-coding it. AVX-512 masking makes the immediate control operand of dpps less valuable. (The immediate lets you ignore some input elements, and zero vs. broadcast the result to your choice of elements, with two 4-bit bitmasks.)

That raises another point: vdpps already uses all the bits in the immediate. (The YMM version uses the same immediate control for both halves). So there wouldn't be room to scale it up to an 8-wide operation without changing how the immediate works. (e.g. maybe always broadcast, dropping the output 0-masking, so you an input control mask bit for each of 8 floats in a YMM? But that would have required custom hardware separate from what's needed for the xmm version).

Note that without any special instructions / uops, you can do the same thing as vdpps (ignoring the functionality of the immediate) with vmulps and then a horizontal sum. (vshufps / vaddps / vshufps / vaddps to leave the result broadcasted to each element.) Intel's hardware support brings this down from 5 to 4 uops (including handling the masking) on Skylake, 3p01 (the FP math execution units) plus 1 p5 (shuffle unit).

(Unfortunately haddps can't take advantage of any of that horizontal-FP hardware :/)

Ice Lake must have dropped some of this dedicated hardware since it's so rarely used; it's 6 uops there. (The 2 extra uops are for p0/p6 and p1/p5, so maybe a shuffle and maybe something turning the immediate into a mask? port 6 doesn't have any vector ALU execution units even in Ice Lake, AFAIK.)

Replicating that special-purpose hardware that dpps uses to an even wider vector width for AVX-512 would probably not have been worth the cost in transistors. 512-bits is very wide for a CPU; the FMA unit(s) already take significant area, and AVX-512 introduces a bunch of new instructions, including more lane-crossing shuffles, that are useful more of the time than a wider vdpps zmm would be on average across most code.

Usually dpps is only helpful in the first place if you're using SIMD wrong.

Efficiently compute absolute values of std::complexfloat vector with AVX2

Inspired by the answer of Dan M. I first implemented his version with some tweaks:

First changed it to use the wider 256 Bit registers, then marked the temporary re and im arrays with __attribute__((aligned (32))) to be able to use aligned load

void computeAbsolute1 (const std::complex<float>* cplxIn, float* absOut, const int length)
{
    for (int i = 0; i < length; i += 8)
    {
        float re[8] __attribute__((aligned (32))) = {cplxIn[i].real(), cplxIn[i + 1].real(), cplxIn[i + 2].real(), cplxIn[i + 3].real(), cplxIn[i + 4].real(), cplxIn[i + 5].real(), cplxIn[i + 6].real(), cplxIn[i + 7].real()};
        float im[8] __attribute__((aligned (32))) = {cplxIn[i].imag(), cplxIn[i + 1].imag(), cplxIn[i + 2].imag(), cplxIn[i + 3].imag(), cplxIn[i + 4].imag(), cplxIn[i + 5].imag(), cplxIn[i + 6].imag(), cplxIn[i + 7].imag()};
        __m256 x4 = _mm256_load_ps (re);
        __m256 y4 = _mm256_load_ps (im);
        __m256 b4 = _mm256_sqrt_ps (_mm256_add_ps (_mm256_mul_ps (x4,x4), _mm256_mul_ps (y4,y4)));
        _mm256_storeu_ps (absOut + i, b4);
    }
}

However manually shuffling the values this way seemed like a task that could be speeded up somehow. Now this is the solution I came up with, that runs 2 - 3 times faster in a quick test compiled by clang with full optimization:

#include <complex>
#include <immintrin.h>

void computeAbsolute2 (const std::complex<float>* __restrict cplxIn, float* __restrict absOut, const int length)
{    
    for (int i = 0; i < length; i += 8)
    {
        // load 8 complex values (--> 16 floats overall) into two SIMD registers
        __m256 inLo = _mm256_loadu_ps (reinterpret_cast<const float*> (cplxIn + i    ));
        __m256 inHi = _mm256_loadu_ps (reinterpret_cast<const float*> (cplxIn + i + 4));

        // seperates the real and imaginary part, however values are in a wrong order
        __m256 re = _mm256_shuffle_ps (inLo, inHi, _MM_SHUFFLE (2, 0, 2, 0));
        __m256 im = _mm256_shuffle_ps (inLo, inHi, _MM_SHUFFLE (3, 1, 3, 1));

        // do the heavy work on the unordered vectors
        __m256 abs = _mm256_sqrt_ps (_mm256_add_ps (_mm256_mul_ps (re, re), _mm256_mul_ps (im, im)));

        // reorder values prior to storing
        __m256d ordered = _mm256_permute4x64_pd (_mm256_castps_pd(abs), _MM_SHUFFLE(3,1,2,0));
        _mm256_storeu_ps (absOut + i, _mm256_castpd_ps(ordered));
    }
}

I think I'll go with that implementation if no one comes up with a faster solution

This compiles efficiently with gcc and clang (on the Godbolt compiler explorer).

Speeding up gather

I would first make sure that you are using the most recent Intel processor possible. Intel has invested a lot of engineering in improving the gather instruction.

This being said, it is not magical. If there are cache misses, you will pay a price for them.

I would try to write the same code without SIMD instructions. Is it about the same speed? If it is, then chances are good that your are limited by memory access. Vectorization is good to solve computational limitations, and to write and store data in vector-size units, but even in principle, it cannot be expected to help much with problems bound by random access and cache issues.

Your code repeatedly calls VPGATHERDPS. According to Agner Fog, this instruction has a latency of 12 cycles and a throughput of one instruction every 4 cycles. The latency is, of course, a best-case scenario, cache misses will increase the latency.

You should benchmark your code and ensure that you are close to 4 cycles per loop iteration. The main loop should complete in about 300 cycles, and that's quite fast all things said.

You do not tell us a lot about your problem but we can guess that it is much slower than 300 cycles. If so, then you are probably having cache issues. If your table is large and you are accessing it randomly, then it is a hard problem. If you need better performance, you may need to reengineer the problem.

Avx2: Computing Dot Product of 512 Float Arrays