Get Sum of Values Stored in _M256D with Sse/Avx

Get sum of values stored in __m256d with SSE/AVX

It appears that you're doing a horizontal sum for every element of an output array. (Perhaps as part of a matmul?) This is usually sub-optimal; try to vectorize over the 2nd-from-inner loop so you can produce result[i + 0..3] in a vector and not need a horizontal sum at all.

For horizontal reductions in general, see Fastest way to do horizontal SSE vector sum (or other reduction): extract the high half and add to the low half. Repeat until you're down to 1 element.

If you're using this inside an inner loop, you definitely don't want to be using hadd(same,same). That costs 2 shuffle uops instead of 1, unless your compiler saves you from yourself. (And gcc/clang don't.) hadd is good for code-size but pretty much nothing else when you only have 1 vector. It can be useful and efficient with two different inputs.

For AVX, this means the only 256-bit operation we need is an extract, which is fast on AMD and Intel. Then the rest is all 128-bit:

#include <immintrin.h>

inline
double hsum_double_avx(__m256d v) {
    __m128d vlow  = _mm256_castpd256_pd128(v);
    __m128d vhigh = _mm256_extractf128_pd(v, 1); // high 128
            vlow  = _mm_add_pd(vlow, vhigh);     // reduce down to 128

    __m128d high64 = _mm_unpackhi_pd(vlow, vlow);
    return  _mm_cvtsd_f64(_mm_add_sd(vlow, high64));  // reduce to scalar
}

If you wanted the result broadcast to every element of a __m256d, you'd use vshufpd and vperm2f128 to swap high/low halves (if tuning for Intel). And use 256-bit FP add the whole time. If you cared about early Ryzen at all, you might reduce to 128, use _mm_shuffle_pd to swap, then vinsertf128 to get a 256-bit vector. Or with AVX2, vbroadcastsd on the final result of this. But that would be slower on Intel than staying 256-bit the whole time while still avoiding vhaddpd.

Compiled with gcc7.3 -O3 -march=haswell on the Godbolt compiler explorer

    vmovapd         xmm1, xmm0               # silly compiler, vextract to xmm1 instead
    vextractf128    xmm0, ymm0, 0x1
    vaddpd          xmm0, xmm1, xmm0
    vunpckhpd       xmm1, xmm0, xmm0         # no wasted code bytes on an immediate for vpermilpd or vshufpd or anything
    vaddsd          xmm0, xmm0, xmm1         # scalar means we never raise FP exceptions for results we don't use
    vzeroupper
    ret

After inlining (which you definitely want it to), vzeroupper sinks to the bottom of the whole function, and hopefully the vmovapd optimizes away, with vextractf128 into a different register instead of destroying xmm0 which holds the _mm256_castpd256_pd128 result.

On first-gen Ryzen (Zen 1 / 1+), according to Agner Fog's instruction tables, vextractf128 is 1 uop with 1c latency, and 0.33c throughput.

@PaulR's version is unfortunately terrible on AMD before Zen 2; it's like something you might find in an Intel library or compiler output as a "cripple AMD" function. (I don't think Paul did that on purpose, I'm just pointing out how ignoring AMD CPUs can lead to code that runs slower on them.)

On Zen 1, vperm2f128 is 8 uops, 3c latency, and one per 3c throughput. vhaddpd ymm is 8 uops (vs. the 6 you might expect), 7c latency, one per 3c throughput. Agner says it's a "mixed domain" instruction. And 256-bit ops always take at least 2 uops.

     # Paul's version                      # Ryzen      # Skylake
    vhaddpd       ymm0, ymm0, ymm0         # 8 uops     # 3 uops
    vperm2f128    ymm1, ymm0, ymm0, 49     # 8 uops     # 1 uop
    vaddpd        ymm0, ymm0, ymm1         # 2 uops     # 1 uop
                           # total uops:   # 18         # 5

vs.

     # my version with vmovapd optimized out: extract to a different reg
    vextractf128    xmm1, ymm0, 0x1        # 1 uop      # 1 uop
    vaddpd          xmm0, xmm1, xmm0       # 1 uop      # 1 uop
    vunpckhpd       xmm1, xmm0, xmm0       # 1 uop      # 1 uop
    vaddsd          xmm0, xmm0, xmm1       # 1 uop      # 1 uop
                           # total uops:   # 4          # 4

Total uop throughput is often the bottleneck in code with a mix of loads, stores, and ALU, so I expect the 4-uop version is likely to be at least a little better on Intel, as well as much better on AMD. It should also make slightly less heat, and thus allow slightly higher turbo / use less battery power. (But hopefully this hsum is a small enough part of your total loop that this is negligible!)

The latency is not worse, either, so there's really no reason to use an inefficient hadd / vpermf128 version.

Zen 2 and later have 256-bit wide vector registers and execution units (including shuffle). They don't have to split lane-crossing shuffles into many uops, but conversely vextractf128 is no longer about as cheap as vmovdqa xmm. Zen 2 is a lot closer to Intel's cost model for 256-bit vectors.

Fastest way to do horizontal vector sum with AVX instructions

If you have two __m256d vectors x1 and x2 that each contain four doubles that you want to horizontally sum, you could do:

__m256d x1, x2;
// calculate 4 two-element horizontal sums:
// lower 64 bits contain x1[0] + x1[1]
// next 64 bits contain x2[0] + x2[1]
// next 64 bits contain x1[2] + x1[3]
// next 64 bits contain x2[2] + x2[3]
__m256d sum = _mm256_hadd_pd(x1, x2);
// extract upper 128 bits of result
__m128d sum_high = _mm256_extractf128_pd(sum, 1);
// add upper 128 bits of sum to its lower 128 bits
__m128d result = _mm_add_pd(sum_high, _mm256_castpd256_pd128(sum));
// lower 64 bits of result contain the sum of x1[0], x1[1], x1[2], x1[3]
// upper 64 bits of result contain the sum of x2[0], x2[1], x2[2], x2[3]

So it looks like 3 instructions will do 2 of the horizontal sums that you need. The above is untested, but you should get the concept.

Fastest way to do horizontal SSE vector sum (or other reduction)

In general for any kind of vector horizontal reduction, extract / shuffle high half to line up with low, then vertical add (or min/max/or/and/xor/multiply/whatever); repeat until a there's just a single element (with high garbage in the rest of the vector).

If you start with vectors wider than 128-bit, narrow in half until you get to 128 (then you can use one of the functions in this answer on that vector). But if you need the result broadcast to all elements at the end, then you can consider doing full-width shuffles all the way.

Related Q&As for wider vectors, and integers, and FP

__m128 and __m128d This answer (see below)
__m256d with perf analysis for Ryzen 1 vs. Intel (showing why vextractf128 is vastly better than vperm2f128) Get sum of values stored in __m256d with SSE/AVX
__m256 How to sum __m256 horizontally?
Intel AVX: 256-bits version of dot product for double precision floating point variables of single vectors.
Dot product of arrays (not just a single vector of 3 or 4 elements): do vertical mul/add or FMA into multiple accumulators, and hsum at the end. Complete AVX+FMA array dot-product example, including an efficient hsum after the loop. (For the simple sum or other reduction of an array, use that pattern but without the multiply part, e.g. add instead of fma). Do not do the horizontal work separately for each SIMD vector; do it once at the end.
How to count character occurrences using SIMD as an integer example of counting _mm256_cmpeq_epi8 matches, again over a whole array, only hsumming at the end. (Worth special mention for doing some 8-bit accumulation then widening 8 -> 64-bit to avoid overflow without doing a full hsum at that point.)

Integer

__m128i 32-bit elements: this answer (see below). 64-bit elements should be obvious: only one pshufd/paddq step.
__m128i 8-bit unsigned uint8_t elements without wrapping/overflow: psadbw against _mm_setzero_si128(), then hsum the two qword halves (or 4 or 8 for wider vectors). Fastest way to horizontally sum SSE unsigned byte vector shows 128-bit with SSE2.
Summing 8-bit integers in __m512i with AVX intrinsics has an AVX512 example. How to count character occurrences using SIMD has an AVX2 __m256i example.
(For int8_t signed bytes you can XOR set1_epi8(0x80) to flip to unsigned before SAD, then subtract the bias from the final hsum; see details here, also showing an optimization for doing only 9 bytes from memory instead of 16).
16-bit unsigned: _mm_madd_epi16 with set1_epi16(1) is a single-uop widening horizontal add: SIMD: Accumulate Adjacent Pairs. Then proceed with a 32-bit hsum.
__m256i and __m512i with 32-bit elements.
Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2. For AVX512, Intel added a bunch of "reduce" inline functions (not hardware instructions) that do this for you, like _mm512_reduce_add_ps (and pd, epi32, and epi64). Also reduce_min/max/mul/and/or. Doing it manually leads to basically the same asm.
horizontal max (instead of add): Getting max value in a __m128i vector with SSE?

Main answer to this question: mostly float and `__m128`

Here are some versions tuned based on Agner Fog's microarch guide's microarch guide and instruction tables. See also the x86 tag wiki. They should be efficient on any CPU, with no major bottlenecks. (e.g. I avoided things that would help one uarch a bit but be slow on another uarch). Code-size is also minimized.

The common SSE3 / SSSE3 2x hadd idiom is only good for code-size, not speed on any existing CPUs. There are use-cases for it (like transpose and add, see below), but a single vector isn't one of them.

I've also included an AVX version. Any kind of horizontal reduction with AVX / AVX2 should start with a vextractf128 and a "vertical" operation to reduce down to one XMM (__m128) vector. In general for wide vectors, your best bet is to narrow in half repeatedly until you're down to a 128-bit vector, regardless of element type. (Except for 8-bit integer, then vpsadbw as a first step if you want to hsum without overflow to wider elements.)

See the asm output from all this code on the Godbolt Compiler Explorer. See also my improvements to Agner Fog's C++ Vector Class Library horizontal_add functions. (message board thread, and code on github). I used CPP macros to select optimal shuffles for code-size for SSE2, SSE4, and AVX, and for avoiding movdqa when AVX isn't available.

There are tradeoffs to consider:

code size: smaller is better for L1 I-cache reasons, and for code fetch from disk (smaller binaries). Total binary size mostly matters for compiler decisions made repeatedly all over a program. If you're bothering to hand-code something with intrinsics, it's worth spending a few code bytes if it gives any speedup for the whole program (be careful of microbenchmarks that make unrolling look good).
uop-cache size: Often more precious than L1 I$. 4 single-uop instructions can take less space than 2 haddps, so this is highly relevant here.
latency: Sometimes relevant
throughput (back-end ports): usually irrelevant, horizontal sums shouldn't be the only thing in an innermost loop. Port pressure matters only as part of the whole loop that contains this.
throughput (total front-end fused-domain uops): If surrounding code doesn't bottleneck on the same port that the hsum uses, this is a proxy for the impact of the hsum on the throughput of the whole thing.

When a horizontal add is infrequent:

CPUs with no uop-cache might favour 2x haddps if it's very rarely used: It's slowish when it does run, but that's not often. Being only 2 instructions minimizes the impact on the surrounding code (I$ size).

CPUs with a uop-cache will probably favour something that takes fewer uops, even if it's more instructions / more x86 code-size. Total uops cache-lines used is what we want to minimize, which isn't as simple as minimizing total uops (taken branches and 32B boundaries always start a new uop cache line).

Anyway, with that said, horizontal sums come up a lot, so here's my attempt at carefully crafting some versions that compile nicely. Not benchmarked on any real hardware, or even carefully tested. There might be bugs in the shuffle constants or something.

If you're making a fallback / baseline version of your code, remember that only old CPUs will run it; newer CPUs will run your AVX version, or SSE4.1 or whatever.

Old CPUs like K8, and Core2(merom) and earlier only have 64bit shuffle units. Core2 has 128bit execution units for most instructions, but not for shuffles. (Pentium M and K8 handle all 128b vector instructions as two 64bit halves).

Shuffles like movhlps that move data in 64-bit chunks (no shuffling within 64-bit halves) are fast, too.

Related: shuffles on new CPUs, and tricks for avoiding 1/clock shuffle throughput bottleneck on Haswell and later: Do 128bit cross lane operations in AVX512 give better performance?

On old CPUs with slow shuffles:

movhlps (Merom: 1uop) is significantly faster than shufps (Merom: 3uops). On Pentium-M, cheaper than movaps. Also, it runs in the FP domain on Core2, avoiding the bypass delays from other shuffles.
unpcklpd is faster than unpcklps.
pshufd is slow, pshuflw/pshufhw are fast (because they only shuffle a 64bit half)
pshufb mm0 (MMX) is fast, pshufb xmm0 is slow.
haddps is very slow (6uops on Merom and Pentium M)
movshdup (Merom: 1uop) is interesting: It's the only 1uop insn that shuffles within 64b elements.

shufps on Core2(including Penryn) brings data into the integer domain, causing a bypass delay to get it back to the FP execution units for addps, but movhlps is entirely in the FP domain. shufpd also runs in the float domain.

movshdup runs in the integer domain, but is only one uop.

AMD K10, Intel Core2(Penryn/Wolfdale), and all later CPUs, run all xmm shuffles as a single uop. (But note the bypass delay with shufps on Penryn, avoided with movhlps)

Without AVX, avoiding wasted movaps/movdqa instructions requires careful choice of shuffles. Only a few shuffles work as a copy-and-shuffle, rather than modifying the destination. Shuffles that combine data from two inputs (like unpck* or movhlps) can be used with a tmp variable that's no longer needed instead of _mm_movehl_ps(same,same).

Some of these can be made faster (save a MOVAPS) but uglier / less "clean" by taking a dummy arg for use as a destination for an initial shuffle. For example:

// Use dummy = a recently-dead variable that vec depends on,
//  so it doesn't introduce a false dependency,
//  and the compiler probably still has it in a register
__m128d highhalf_pd(__m128d dummy, __m128d vec) {
#ifdef __AVX__
    // With 3-operand AVX instructions, don't create an extra dependency on something we don't need anymore.
    (void)dummy;
    return _mm_unpackhi_pd(vec, vec);
#else
    // Without AVX, we can save a MOVAPS with MOVHLPS into a dead register
    __m128 tmp = _mm_castpd_ps(dummy);
    __m128d high = _mm_castps_pd(_mm_movehl_ps(tmp, _mm_castpd_ps(vec)));
    return high;
#endif
}

SSE1 (aka SSE):

float hsum_ps_sse1(__m128 v) {                                  // v = [ D C | B A ]
    __m128 shuf   = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1));  // [ C D | A B ]
    __m128 sums   = _mm_add_ps(v, shuf);      // sums = [ D+C C+D | B+A A+B ]
    shuf          = _mm_movehl_ps(shuf, sums);      //  [   C   D | D+C C+D ]  // let the compiler avoid a mov by reusing shuf
    sums          = _mm_add_ss(sums, shuf);
    return    _mm_cvtss_f32(sums);
}
    # gcc 5.3 -O3:  looks optimal
    movaps  xmm1, xmm0     # I think one movaps is unavoidable, unless we have a 2nd register with known-safe floats in the upper 2 elements
    shufps  xmm1, xmm0, 177
    addps   xmm0, xmm1
    movhlps xmm1, xmm0     # note the reuse of shuf, avoiding a movaps
    addss   xmm0, xmm1

    # clang 3.7.1 -O3:  
    movaps  xmm1, xmm0
    shufps  xmm1, xmm1, 177
    addps   xmm1, xmm0
    movaps  xmm0, xmm1
    shufpd  xmm0, xmm0, 1
    addss   xmm0, xmm1

I reported a clang bug about pessimizing the shuffles. It has its own internal representation for shuffling, and turns that back into shuffles. gcc more often uses the instructions that directly match the intrinsic you used.

Often clang does better than gcc, in code where the instruction choice isn't hand-tuned, or constant-propagation can simplify things even when the intrinsics are optimal for the non-constant case. Overall it's a good thing that compilers work like a proper compiler for intrinsics, not just an assembler. Compilers can often generate good asm from scalar C that doesn't even try to work the way good asm would. Eventually compilers will treat intrinsics as just another C operator as input for the optimizer.

SSE3

float hsum_ps_sse3(__m128 v) {
    __m128 shuf = _mm_movehdup_ps(v);        // broadcast elements 3,1 to 2,0
    __m128 sums = _mm_add_ps(v, shuf);
    shuf        = _mm_movehl_ps(shuf, sums); // high half -> low half
    sums        = _mm_add_ss(sums, shuf);
    return        _mm_cvtss_f32(sums);
}

    # gcc 5.3 -O3: perfectly optimal code
    movshdup    xmm1, xmm0
    addps       xmm0, xmm1
    movhlps     xmm1, xmm0
    addss       xmm0, xmm1

This has several advantages:

doesn't require any movaps copies to work around destructive shuffles (without AVX): movshdup xmm1, xmm2's destination is write-only, so it creates tmp out of a dead register for us. This is also why I used movehl_ps(tmp, sums) instead of movehl_ps(sums, sums).
small code-size. The shuffling instructions are small: movhlps is 3 bytes, movshdup is 4 bytes (same as shufps). No immediate byte is required, so with AVX, vshufps is 5 bytes but vmovhlps and vmovshdup are both 4.

I could save another byte with addps instead of addss. Since this won't be used inside inner loops, the extra energy to switch the extra transistors is probably negligible. FP exceptions from the upper 3 elements aren't a risk, because all elements hold valid FP data. However, clang/LLVM actually "understands" vector shuffles, and emits better code if it knows that only the low element matters.

Like the SSE1 version, adding the odd elements to themselves may cause FP exceptions (like overflow) that wouldn't happen otherwise, but this shouldn't be a problem. Denormals are slow, but IIRC producing a +Inf result isn't on most uarches.

SSE3 optimizing for code-size

If code-size is your major concern, two haddps (_mm_hadd_ps) instructions will do the trick (Paul R's answer). This is also the easiest to type and remember. It is not fast, though. Even Intel Skylake still decodes each haddps to 3 uops, with 6 cycle latency. So even though it saves machine-code bytes (L1 I-cache), it takes up more space in the more-valuable uop-cache. Real use-cases for haddps: a transpose-and-sum problem, or doing some scaling at an intermediate step in this SSE atoi() implementation.

AVX:

This version saves a code byte vs. Marat's answer to the AVX question.

#ifdef __AVX__
float hsum256_ps_avx(__m256 v) {
    __m128 vlow  = _mm256_castps256_ps128(v);
    __m128 vhigh = _mm256_extractf128_ps(v, 1); // high 128
           vlow  = _mm_add_ps(vlow, vhigh);     // add the low 128
    return hsum_ps_sse3(vlow);         // and inline the sse3 version, which is optimal for AVX
    // (no wasted instructions, and all of them are the 4B minimum)
}
#endif

 vmovaps xmm1,xmm0               # huh, what the heck gcc?  Just extract to xmm1
 vextractf128 xmm0,ymm0,0x1
 vaddps xmm0,xmm1,xmm0
 vmovshdup xmm1,xmm0
 vaddps xmm0,xmm1,xmm0
 vmovhlps xmm1,xmm1,xmm0
 vaddss xmm0,xmm0,xmm1
 vzeroupper 
 ret

Double-precision:

double hsum_pd_sse2(__m128d vd) {                      // v = [ B | A ]
    __m128 undef  = _mm_undefined_ps();                       // don't worry, we only use addSD, never touching the garbage bits with an FP add
    __m128 shuftmp= _mm_movehl_ps(undef, _mm_castpd_ps(vd));  // there is no movhlpd
    __m128d shuf  = _mm_castps_pd(shuftmp);
    return  _mm_cvtsd_f64(_mm_add_sd(vd, shuf));
}

# gcc 5.3.0 -O3
    pxor    xmm1, xmm1          # hopefully when inlined, gcc could pick a register it knew wouldn't cause a false dep problem, and avoid the zeroing
    movhlps xmm1, xmm0
    addsd   xmm0, xmm1

# clang 3.7.1 -O3 again doesn't use movhlps:
    xorpd   xmm2, xmm2          # with  #define _mm_undefined_ps _mm_setzero_ps
    movapd  xmm1, xmm0
    unpckhpd        xmm1, xmm2
    addsd   xmm1, xmm0
    movapd  xmm0, xmm1    # another clang bug: wrong choice of operand order

// This doesn't compile the way it's written
double hsum_pd_scalar_sse2(__m128d vd) {
    double tmp;
    _mm_storeh_pd(&tmp, vd);       // store the high half
    double lo = _mm_cvtsd_f64(vd); // cast the low half
    return lo+tmp;
}

    # gcc 5.3 -O3
    haddpd  xmm0, xmm0   # Lower latency but less throughput than storing to memory

    # ICC13
    movhpd    QWORD PTR [-8+rsp], xmm0    # only needs the store port, not the shuffle unit
    addsd     xmm0, QWORD PTR [-8+rsp]

Storing to memory and back avoids an ALU uop. That's good if shuffle port pressure, or ALU uops in general, are a bottleneck. (Note that it doesn't need to sub rsp, 8 or anything because the x86-64 SysV ABI provides a red-zone that signal handlers won't step on.)

Some people store to an array and sum all the elements, but compilers usually don't realize that the low element of the array is still there in a register from before the store.

Integer:

pshufd is a convenient copy-and-shuffle. Bit and byte shifts are unfortunately in-place, and punpckhqdq puts the high half of the destination in the low half of the result, opposite of the way movhlps can extract the high half into a different register.

Using movhlps for the first step might be good on some CPUs, but only if we have a scratch reg. pshufd is a safe choice, and fast on everything after Merom.

int hsum_epi32_sse2(__m128i x) {
#ifdef __AVX__
    __m128i hi64  = _mm_unpackhi_epi64(x, x);           // 3-operand non-destructive AVX lets us save a byte without needing a mov
#else
    __m128i hi64  = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
#endif
    __m128i sum64 = _mm_add_epi32(hi64, x);
    __m128i hi32  = _mm_shufflelo_epi16(sum64, _MM_SHUFFLE(1, 0, 3, 2));    // Swap the low two elements
    __m128i sum32 = _mm_add_epi32(sum64, hi32);
    return _mm_cvtsi128_si32(sum32);       // SSE2 movd
    //return _mm_extract_epi32(hl, 0);     // SSE4, even though it compiles to movd instead of a literal pextrd r32,xmm,0
}

    # gcc 5.3 -O3
    pshufd xmm1,xmm0,0x4e
    paddd  xmm0,xmm1
    pshuflw xmm1,xmm0,0x4e
    paddd  xmm0,xmm1
    movd   eax,xmm0

int hsum_epi32_ssse3_slow_smallcode(__m128i x){
    x = _mm_hadd_epi32(x, x);
    x = _mm_hadd_epi32(x, x);
    return _mm_cvtsi128_si32(x);
}

On some CPUs, it's safe to use FP shuffles on integer data. I didn't do this, since on modern CPUs that will at most save 1 or 2 code bytes, with no speed gains (other than code size/alignment effects).

How to find the horizontal maximum in a 256-bit AVX vector

I don't think you can do much better than 4 instructions: 2 shuffles and 2 comparisons.

__m256d x = ...; // input

__m128d y = _mm256_extractf128_pd(x, 1); // extract x[2], and x[3]
__m128d m1 = _mm_max_pd(x, y); // m1[0] = max(x[0], x[2]), m1[1] = max(x[1], x[3])
__m128d m2 = _mm_permute_pd(m1, 1); // set m2[0] = m1[1], m2[1] = m1[0]
__m128d m = _mm_max_pd(m1, m2); // both m[0] and m[1] contain the horizontal max(x[0], x[1], x[2], x[3])

Trivial modification to only work with 256-bit vectors:

__m256d x = ...; // input

__m256d y = _mm256_permute2f128_pd(x, x, 1); // permute 128-bit values
__m256d m1 = _mm256_max_pd(x, y); // m1[0] = max(x[0], x[2]), m1[1] = max(x[1], x[3]), etc.
__m256d m2 = _mm256_permute_pd(m1, 5); // set m2[0] = m1[1], m2[1] = m1[0], etc.
__m256d m = _mm256_max_pd(m1, m2); // all m[0] ... m[3] contain the horizontal max(x[0], x[1], x[2], x[3])

(untested)

AVX2: What is the best way to multiply and sum 4 complex values with 4 double values?

Here’s some pieces you gonna need. Untested.

// 4 FP64 complex numbers in 2 AVX SIMD vectors
struct Complex4 { __m256d vec1, vec2; };

// Change the macro if you have FMA and optimizing for Intel
#define USE_FMA 0

Complex4 add( Complex4 a, Complex4 b )
{
    Complex4 res;
    res.vec1 = _mm256_add_pd( a.vec1, b.vec1 );
    res.vec2 = _mm256_add_pd( a.vec2, b.vec2 );
    return res;
}

// Multiply vectors component-wise.
// This is not a complex multiplication, just 2 vmulpd instructions
Complex4 mul_cwise( Complex4 a, Complex4 b )
{
    __m256d r0, r1;
    Complex4 res;
    // Compute the product
    res.vec1 = _mm256_mul_pd( a.vec1, b.vec1 );
    res.vec2 = _mm256_mul_pd( a.vec2, b.vec2 );
    return res;
}

// Multiply + accumulate vectors vectors component-wise
// This is not a complex multiplication.
Complex4 fma_cwise( Complex4 a, Complex4 b, Complex4 acc )
{
#if USE_FMA
    Complex4 res;
    res.vec1 = _mm256_fmadd_pd( a.vec1, b.vec1, acc.vec1 );
    res.vec2 = _mm256_fmadd_pd( a.vec2, b.vec2, acc.vec2 );
    return res;
#else
    return add( mul_cwise( a, b ), acc );
#endif
}

// Load 4 scalars from memory, and duplicate them into 2 AVX vectors
// This is for the complex * real use case
Complex4 load_duplicating( double* rsi )
{
    Complex4 res;
#ifdef __AVX2__
    __m256d tmp = _mm256_loadu_pd( rsi );
    res.vec1 = _mm256_permute4x64_pd( tmp, _MM_SHUFFLE( 1, 1, 0, 0 ) );
    res.vec2 = _mm256_permute4x64_pd( tmp, _MM_SHUFFLE( 3, 3, 2, 2 ) );
#else
    res.vec1 = _mm256_setr_m128d( _mm_loaddup_pd( rsi ), _mm_loaddup_pd( rsi + 1 ) );
    res.vec2 = _mm256_setr_m128d( _mm_loaddup_pd( rsi + 2 ), _mm_loaddup_pd( rsi + 3 ) );
#endif
    return res;
}

// Load 4 complex numbers
Complex4 load_c4( double* rsi )
{
    Complex4 res;
    res.vec1 = _mm256_loadu_pd( rsi );
    res.vec2 = _mm256_loadu_pd( rsi + 4 );
    return res;
}

// Multiply 2 complex numbers by another 2 complex numbers
__m256d mul_CxC_2( __m256d a, __m256d b )
{
    // The formula is [ a.x*b.x - a.y*b.y,  a.x*b.y + a.y*b.x ]
    // Computing it as a.xy * b.xx -+ a.yx * b.yy

    __m256d ayx = _mm256_permute_pd( a, _MM_SHUFFLE2( 0, 1 ) ); // a.yx
    __m256d byy = _mm256_permute_pd( b, _MM_SHUFFLE2( 1, 1 ) ); // b.yy
    __m256d p1 = _mm256_mul_pd( ayx, byy ); // a.yx * b.yy

    __m256d bxx = _mm256_permute_pd( b, _MM_SHUFFLE2( 0, 0 ) ); // b.xx

#if USE_FMA
    return _mm256_fmaddsub_pd( a, bxx, p1 );
#else
    return _mm256_addsub_pd( _mm256_mul_pd( a, bxx ), p1 );
#endif
}

Couple more notes.

Consider C++ for such code. For one, operators and overloaded functions help. Also, your whole hotspot can be written with a few lines with Eigen library, with performance often comparable to manually written intrinsics.

Use compiler flags or GCC-specific function attributes to make sure all of these functions are always inlined.

Add the sum from #2 to *dst

It’s never a good idea to store intermediate values for code like that. If you have 6 things to multiply/add on input, compute them all, and only store the result once. Memory access is inefficient in general, read-after-write especially so.

such that gcc/clang will reduce to FMA if available.

Reducing things to FMA is usually good idea on Intel, not necessarily on AMD. AMD chips have twice as high throughput of 50% additions / 50% multiplications instruction mix, compared to 100% FMA. Unlike Intel, on AMD floating point additions and multiplications don’t compete for execution units. By “throughput” I mean instructions/second; marketing people decided that 1 FMA operation = 2 floating point operations, so the FLOPs number is the same.

How to get data out of AVX registers?

Careful: _mm256_fmadd_ps isn't part of AVX1. FMA3 has its own feature bit, and was only introduced on Intel with Haswell. AMD introduced FMA3 with Piledriver (AVX1+FMA4+FMA3, no AVX2).

At the asm level, if you want to get eight 32bit elements into integer registers, it is actually faster to store to the stack and then do scalar loads. pextrd is a 2-uop instruction on SnB-family, and Bulldozer-family. (and Nehalem and Silvermont, which don't support AVX).

The only CPU where vextractf128 + 2xmovd + 6xpextrd isn't terrible is AMD Jaguar. (cheap pextrd, and only one load port.) (See Agner Fog's insn tables)

A wide aligned store can forward to overlapping narrow loads. (Of course, you can use movd to get the low element, so you have a mix of load port and ALU port uops).

Of course, you seem to be extracting floats by using an integer extract and then converting it back to a float. That seems horrible.

Get Sum of Values Stored in _M256D with Sse/Avx