Preventing Gcc from Automatically Using Avx and Fma Instructions When Compiled with -Mavx and -Mfma

How to check if compiled code uses SSE and AVX instructions?

Under Linux, you could also decompile your binary:

objdump -d YOURFILE > YOURFILE.asm

Then find all SSE instructions:

awk '/[ \t](addps|addss|andnps|andps|cmpps|cmpss|comiss|cvtpi2ps|cvtps2pi|cvtsi2ss|cvtss2s|cvttps2pi|cvttss2si|divps|divss|ldmxcsr|maxps|maxss|minps|minss|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movss|movups|mulps|mulss|orps|rcpps|rcpss|rsqrtps|rsqrtss|shufps|sqrtps|sqrtss|stmxcsr|subps|subss|ucomiss|unpckhps|unpcklps|xorps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|psadbw|pshufw)[ \t]/' YOURFILE.asm

Find only packed SSE instructions (suggested by @Peter Cordes in comments):

awk '/[ \t](addps|andnps|andps|cmpps|cvtpi2ps|cvtps2pi|cvttps2pi|divps|maxps|minps|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movntq|movups|mulps|orps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|pmulhuw|psadbw|pshufw|rcpps|rsqrtps|shufps|sqrtps|subps|unpckhps|unpcklps|xorps)[ \t]/' YOURFILE.asm

Find all SSE2 instructions (except MOVSD and CMPSD, which were first introduced in 80386):

awk '/[ \t](addpd|addsd|andnpd|andpd|cmppd|comisd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvtsd2si|cvtsd2ss|cvtsi2sd|cvtss2sd|cvttpd2dq|cvttpd2pi|cvtps2dq|cvttsd2si|divpd|divsd|maxpd|maxsd|minpd|minsd|movapd|movhpd|movlpd|movmskpd|movupd|mulpd|mulsd|orpd|shufpd|sqrtpd|sqrtsd|subpd|subsd|ucomisd|unpckhpd|unpcklpd|xorpd|movdq2q|movdqa|movdqu|movq2dq|paddq|pmuludq|pshufhw|pshuflw|pshufd|pslldq|psrldq|punpckhqdq|punpcklqdq)[ \t]/' YOURFILE.asm

Find only packed SSE2 instructions:

awk '/[ \t](addpd|andnpd|andpd|cmppd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvttpd2dq|cvttpd2pi|cvttps2dq|divpd|maxpd|minpd|movapd|movapd|movhpd|movhpd|movlpd|movlpd|movmskpd|movntdq|movntpd|movupd|movupd|mulpd|orpd|pshufd|pshufhw|pshuflw|pslldq|psrldq|punpckhqdq|shufpd|sqrtpd|subpd|unpckhpd|unpcklpd|xorpd)[ \t]/' YOURFILE.asm

Find all SSE3 instructions:

awk '/[ \t](addsubpd|addsubps|haddpd|haddps|hsubpd|hsubps|movddup|movshdup|movsldup|lddqu|fisttp)[ \t]/' YOURFILE.asm

Find all SSSE3 instructions:

awk '/[ \t](psignw|psignd|psignb|pshufb|pmulhrsw|pmaddubsw|phsubw|phsubsw|phsubd|phaddw|phaddsw|phaddd|palignr|pabsw|pabsd|pabsb)[ \t]/' YOURFILE.asm

Find all SSE4 instructions:

awk '/[ \t](mpsadbw|phminposuw|pmulld|pmuldq|dpps|dppd|blendps|blendpd|blendvps|blendvpd|pblendvb|pblenddw|pminsb|pmaxsb|pminuw|pmaxuw|pminud|pmaxud|pminsd|pmaxsd|roundps|roundss|roundpd|roundsd|insertps|pinsrb|pinsrd|pinsrq|extractps|pextrb|pextrd|pextrw|pextrq|pmovsxbw|pmovzxbw|pmovsxbd|pmovzxbd|pmovsxbq|pmovzxbq|pmovsxwd|pmovzxwd|pmovsxwq|pmovzxwq|pmovsxdq|pmovzxdq|ptest|pcmpeqq|pcmpgtq|packusdw|pcmpestri|pcmpestrm|pcmpistri|pcmpistrm|crc32|popcnt|movntdqa|extrq|insertq|movntsd|movntss|lzcnt)[ \t]/' YOURFILE.asm

Find most common AVX instructions (including scalar, including AVX2, AVX-512 family and some FMA like vfmadd132pd):

awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmulsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vfnmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub213pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpbroadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherdd|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsllvq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpcmpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcompressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvttpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vcvttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vfixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrsqrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|vpminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatterqq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcastmw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgatherpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0qps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpclassss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|vp4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vpshrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec|vaesdeclast|vaesenc|vaesenclast)[ \t]/' YOURFILE.asm

NOTE: tested with gawk and nawk.

Compile-time AVX detection when using multi-versioning

As answered in the comments:

I'd recommend moving each of the CPU targets to a separate translation unit, which is compiled with the corresponding compiler flags. The common doStuffImpl function can be implemented in a header, included in each of the TUs. In that header, you can use predefined macros like __AVX__ to test for available ISA extensions. The __attribute__((target)) attributes are no longer needed and can be removed in this case.

gcc (6.1.0) using 'wrong' instructions in SSE intrinsics

If you give the -msse4.2 flag to gcc's command line during a compilation step, it will assume that it is free to use up to the SSE 4.2 instruction set for the entire translation unit. This can lead to the behavior that you described. If you need code that only uses SSE2 and below code, then using -msse2 (or no flag at all if you're building for x86_64) is required.

Some options that I can think of are:

If you can easily break down your code at the function level, then gcc's multiversioning feature can help. It requires a relatively recent version of the compiler, but it allows you to do things like this (taken from the link above):

 __attribute__ ((target ("default")))
 int foo ()
 {
   // The default version of foo.
   return 0;
 }

 __attribute__ ((target ("sse4.2")))
 int foo ()
 {
   // foo version for SSE4.2
   return 1;
 }

 __attribute__ ((target ("arch=atom")))
 int foo ()
 {
   // foo version for the Intel ATOM processor
   return 2;
 }

 __attribute__ ((target ("arch=amdfam10")))
 int foo ()
 {
   // foo version for the AMD Family 0x10 processors.
   return 3;
 }

 int main ()
 {
   int (*p)() = &foo;
   assert ((*p) () == foo ());
   return 0;
 }

In this example, gcc will automatically compile the different versions of foo() and dispatch to the appropriate one at runtime based on the CPU's capabilities.

You can break the different implementations (SSE2, SSE4.2, etc.) into different translation units, then dispatch appropriately to the right implementation at runtime.
You can put all of the SIMD code into a shared library and build the shared library multiple times with different compiler flags. Then at runtime, you can detect the CPU's capabilities and load the appropriate version of the shared library. This is the approach taken by libraries like Intel's Math Kernel Library.

Implementing matrix operation using AVX in C

b[k][j] is your problem: the elements b[k + 0..3][j] aren't contiguous in memory. Using SIMD (in a reasonable / useful way) is not something you can drop in to the classic naive matmul loop. See What Every Programmer Should Know About Memory? - there's an appendix with an example of an SSE2 matmul (with cache-blocking) which shows how to do operations in a different order that's SIMD-friendly.

Soonts's answer shows how to vectorize at all, by vectorizing over j, the middle loop. But that leaves a relatively poor memory access pattern, and 3 loads + 3 ALU operations inside the loop. (This answer started out as a comment on it, see it for the code I'm talking about and proposing changes to.)

Loop inversion should be possible to do j as the inner-most loop. That would mean doing stores for d[i][j] += ... inside the inner-most loop, but OTOH it makes more loop invariants in 2 * a[i][k] * ( b[k][j]- c[k] ) so you can usefully transform to d[i][j] += (2*a_ik) * b[k][j] - (2*a_ik*c_k), i.e. one VFMSUBPD and one VADDPD per load&store. (With the bv load folding into the FMSUB as a memory source operand, and the dv load folding into VADDPD, so hopefully only 3 uops for the front-end, including a separate store, not including loop overhead.)

The compiler will have to unroll and avoid an indexed addressing mode so the store-address uop can stay micro-fused and run on port 7 on Intel CPUs (Haswell through Skylake-family), not competing with the two loads. Ice Lake doesn't have that problem, having two full independent store-AGUs separate from the two load AGUs. But probably still needs some loop unrolling to avoid a front-end bottleneck.

Here's an example, untested (original version contributed by Soonts, thanks). It optimizes down to 2 FP math ops in the loop in a different way: simply hoisting 2*a out of the loop, doing SUB then FMA for dv += (2av)*(sub_result). But bv can't be a source operand for vsubpd because we need bv - cv. But we can fix that by negating cv to allow (-cv) + bv in the inner loop, with bv as a memory source operand. Sometimes compilers will do things like that for you, but here it seems they didn't, so I did it manually. Otherwise we get a separate vmovupd load going through the front-end.

#include <stdint.h>
#include <stdlib.h>
#include <immintrin.h>

// This double [N][N] C99 VLA syntax isn't portable to C++ even with GNU extensions
// restrict tells the compiler the output doesn't overlap with any of the inputs
void matop(size_t N, size_t K, double d[restrict N][N], const double a[restrict N][K], const double b[restrict K][N], const double c[restrict K])
{
    for( size_t i = 0; i < N; i++ ) {
        // loop-invariant pointers for this outer iteration
        //double* restrict rowDi = &d[ i ][ 0 ];
        const double* restrict rowAi = &a[ i ][ 0 ];
        for( size_t k = 0; k < K; k++ ) {
            const double* restrict rowBk = &b[ k ][ 0 ];
            double* restrict rowDi = &d[ i ][ 0 ];
#if 0  // pure scalar
            // auto-vectorizes ok; still a lot of extra checking outside outermost loop even with  restrict
            for (size_t j=0 ; j<N ; j++){
                rowDi[j] += 2*rowAi[k] * (rowBk[j] - c[k]);
            }
#else   // SIMD inner loop with cleanup

         // *** TODO: unroll over 2 or 3 i values
         // and maybe also 2 or 3 k values, to reuse each bv a few times while it's loaded.
            __m256d av = _mm256_broadcast_sd( rowAi + k );
            av = _mm256_add_pd( av, av );   // 2*a[ i ][ k ] broadcasted
            const __m256d cv = _mm256_broadcast_sd( &c[ k ] );
            const __m256d minus_ck = _mm256_xor_pd(cv, _mm256_set1_pd(-0.0));    // broadcasted -c[k]

            //const size_t N_aligned = ( (size_t)N / 4 ) * 4;
            size_t N_aligned = N & -4;    // round down to a multiple of 4 j iterations
            const double* endBk = rowBk + N_aligned;
            //for( ; j < N_aligned; j += 4 )
            for ( ; rowBk != endBk ; rowBk += 4, rowDi += 4) {  // coax GCC into using pointer-increments in the asm, instead of j+=4
                // Load the output vector to update
                __m256d dv = _mm256_loadu_pd( rowDi );
                // Update with FMA
                __m256d bv = _mm256_loadu_pd( rowBk );
                __m256d t2 = _mm256_add_pd( minus_ck, bv );   // bv - cv
                dv = _mm256_fmadd_pd( av, t2, dv );
                // Store back to the same address
                _mm256_storeu_pd( rowDi, dv );
            }
            // rowDi and rowBk point to the double after the last full vector

            // The remainder, if you can't pad your rows to a multiple of 4 and step on that padding
            for(int j=0 ; j < (N&3); j++ )
                rowDi[ j ] += _mm256_cvtsd_f64( av ) * ( rowBk[ j ] + _mm256_cvtsd_f64( minus_ck ) );
#endif
        }
    }
}

Without unrolling (https://godbolt.org/z/6WeYKbnYY), GCC11's inner loop asm looks like this, all single-uop instructions that can stay micro-fused even in the back-end on Haswell and later.

.L7:                                             # do{
        vaddpd  ymm0, ymm2, YMMWORD PTR [rax]     # -c[k] + rowBk[0..3]
        add     rax, 32                           # rowBk += 4
        add     rdx, 32                           # rowDi += 4
        vfmadd213pd     ymm0, ymm1, YMMWORD PTR [rdx-32]  # fma(2aik, Bkj-ck, Dij)
        vmovupd YMMWORD PTR [rdx-32], ymm0        # store FMA result
        cmp     rcx, rax
        jne     .L7                              # }while(p != endp)

But it's 6 total uops, 3 of them loop overhead (pointer increments and fused cmp+jne), so Haswell through Skylake could only run it at 1 iteration per 1.5 clocks, bottlenecked on the 4-wide issue stage in the front-end. (Which wouldn't let OoO exec get ahead on executing the pointer increments and loop branch, to notice early and recover while the back-end was still chewing on older loads and FP math.)

So loop unrolling should be helpful, since we managed to coax GCC into using indexed addressing modes. Without that it's relatively useless with AVX code on Intel Haswell/Skylake CPUs, with each vaddpd ymm5, ymm4, [rax + r14] decoding as 1 micro-fused uop, but unlaminating into 2 at issue into the back-end, not helping us get more work through the narrowest part of the front-end. (A lot like if we'd used a separate vmovupd load like we got with _mm256_sub_pd(bv, cv) instead of add(bv, -cv).)

The vmovupd ymmword ptr [rbp + r14], ymm5 store stays micro-fused but can't run on port 7, limiting us to a total of 2 memory operations per clock (up to 1 of which can be a store.) So a best case of 1.5 cycles per vector.

Compiled on https://godbolt.org/z/rd3rn9zor with GCC and clang -O3 -march=skylake -funroll-loops. GCC does actually use pointer increments with loads folded into 8x vaddpd and 8x vfmadd213pd. But clang uses indexed addressing modes and doesn't unroll. (You probably don't want -funroll-loops for your whole program, so either compile this separately or manually unroll. GCC's unrolling fully peels a prologue that does 0..7 vector iterations before entering the actual SIMD loop, so it's quite aggressive.)

GCC's loop-unrolling looks useful here for large N, amortizing the pointer increments and loop overhead over multiple vectors. (GCC doesn't know how to invent multiple accumulators for FP dep chains in a dot product for example, making its unrolling useless in that case, unlike clang.)

Unfortunately clang doesn't unroll the inner loop for us, but it does use vmaskmovpd in an interesting way for the cleanup.

It's maybe good that we use a separate loop counter for cleanup, in a way that lets the compiler easily prove the trip-count for the cleanup is 0..3, so it doesn't try to auto-vectorize with YMM.

The other way to do it, using an actual j variable for the inner loop and its cleanup, more like Soonts' edit. IIRC, compilers did try to auto-vectorize the cleanup for this, wasting code size and some always-false branching.

            size_t j = 0;     // used for cleanup loop after
            for( ; j < N_aligned; j += 4 )
            {
                // Load the output vector to update
                __m256d dv = _mm256_loadu_pd( rowDi + j );
                // Update with FMA
                __m256d bv = _mm256_loadu_pd( rowBk + j );
                __m256d t2 = _mm256_sub_pd( bv, cv );   // bv - cv
                dv = _mm256_fmadd_pd( av, t2, dv );
                // Store back to the same address
                _mm256_storeu_pd( rowDi + j, dv );
            }
            // The remainder, if you can't pad your rows to a multiple of 4
            for( ; j < N; j++ )
                rowDi[ j ] += _mm256_cvtsd_f64( av ) * ( rowBk[ j ] - _mm256_cvtsd_f64( cv ) );

This has a fairly good mix of load&store vs. FP math for modern CPUs (https://agner.org/optimize/ and https://uops.info/), especially Intel where we can do 2 loads and 1 store. I think Zen 2 or 3 can also do 2 loads + 1 store. It needs to hit in L1d cache to sustain that kind of throughput, though. (And even then, Intel's optimization manual says the max sustained L1d bandwidth on Skylake is less than the full 96 bytes/cycle that would require. More like mid-80s IIRC, so we can't quite expect one result vector per cycle, even with sufficient unrolling to avoid front-end bottlenecks.)

There's no latency bottleneck, since we move on to a new dv every iteration instead of accumulating anything across loop iterations.

The other advantage to this is that memory access to d[i][j] and b[k][j] would be sequential, with no other memory access in the inner-most loop. (The middle loop would do broadcast-loads of a[i][k] and c[k]. Those seem likely to cache-miss if the inner loop evicts too much; with some unrolling of the outer loop, one SIMD load and some shuffling could help, but probably cache-blocking would avoid a need for that.)

Looping over the same d[i] row repeatedly for different b[k] rows gives us locality for the part that we're modifying (i.e. use k as the middle loop, keeping i as the outer-most.) With k as the outer loop, we'd be looping K times over the whole d[0..N-1][0..N-1], probably needing to write + read each pass all that way out to whichever level of cache or memory could hold it.

But really you'd still want to cache-block if each row is really long, so you avoid the cache misses to bring all of b[][] in from DRAM N times. And avoid evicting the stuff you're going to broadcast-load next.

Smarter unrolling: a first step towards cache-blocking

Some of the above problems with maxing out load/store execution unit throughput, and requiring the compiler to use non-indexed addressing modes, can go away if we do more with each vector of data while it's loaded.

For example, instead of working on just one row of d[][], we could be working on 2, 3, or 4. Then every (rowBk[j] - c[k]) result can be used that many times (with a different 2aik) for a d[i+unroll][j + 0..vec] vector.

And we can also load a couple different (rowBk+K*0..unroll)[j+0..3], each with a corresponding minus_ck0, minus_ck1, etc. (Or keep an array of vectors; as long as it's small and the compiler has enough registers, the elements won't exist in memory.)

With multiple bv-cv and dv vectors in registers all at the same time, we can do significantly more FMAs per load without increasing the total amount of FP work. It takes more registers for constants, though, otherwise we could be defeating the purpose by forcing more reloads.

The d[i][j] += (2*a_ik) * b[k][j] - (2*a_ik*c_k) transformation wouldn't be useful here; we want to keep bv-cv separate from i so we can reuse that result as an input for different FMAs.

The b[k][j]+(-c[k]) can still benefit from micro-fusion of a load with a vaddpd so ideally it would still use a pointer increment, but the front-end might not be a bottleneck anymore.

Don't overdo it with this; too many memory input streams can be a problem for cache conflict misses especially for some N values that might create aliasing, and also for HW prefetching tracking them all. (Although Intel's L2 streamer is said to track 1 forward and 1 backward stream per 4k page, IIRC.) Probably about 4 to 8 ish streams is ok. But if d[][] isn't missing in L1d, then it's not really an input stream from memory. You don't want your b[][] input rows to be evicting the d data, though, since you'll be looping over 2 to 4 rows of d data repeatedly.

By comparison: Soonts's loop - less frequent cleanup, but worse memory access pattern.

Soonts's current loop with 3 loads and 3 ALU operations isn't ideal, although 1 load per FMA operation is already ok if they hit in cache (most modern CPUs can do 2 each per clock, although AMD Zen can also do 2 FP adds in parallel with mul/fma). If that extra ALU operation was a bottleneck, we could pre-multiply a[][] by 2 once, taking only O(N*K) work vs. O(N^2*K) to do it on the fly. But it's probably not a bottleneck and thus not worth it.

More importantly, the memory access pattern in Soonts's current answer is looping forward 1 double at a time for broadcast loads of c[k] and a[i][k] which is good, but the bv = _mm256_loadu_pd of b[k][j + 0..3] is unfortunately striding down a column.

If you're going to unroll as Soonts suggested, don't just do two dep chains for one dv, do at least two vectors, d[i][j + 0..3] and 4..7 so you use a whole 64 bytes (full cache line) from every b[k][j] you touch. Or four vectors for a pair of cache-lines. (Intel CPUs at least use an adjacent-line prefetcher, which likes to complete a 128-byte aligned pair of cache lines, so you'd benefit from aligning the rows of b[][] by 128. Or at least by 64, and get some benefit from adjacent-line prefetching.

If a vertical slice of b[][] fits in some level of cache (along with the row of d[i][] you're currently accumulating into), the next stride down the next group of columns can benefit from that prefetching and locality. If not, fully using the lines you touch is more important, so they don't have to get pulled in again later.

So with Soonts's vectorization strategy, for large problems where this won't fit in L1d cache, probably good to make sure b's rows are aligned by 64, even if that means padding at the end of each row. (The storage geometry doesn't have to match the actual matrix dimension; you pass N and row_stride separately. You use one for index calculations, the other for loop bounds.)

How to use AVX instructions to optimize ReLU written in C

There is usually no benefit to pass a pointer by reference (unless you want to modify the pointer itself). Furthermore, you can help your compiler using the (non-standard) __restrict keyword, telling it that no aliasing happens between input and output (of course, this will likely give wrong results, if e.g., Amem == Z_curr+1 -- but Amem == Z_curr should (in this case) be fine).

void relu(double* __restrict Amem, double* Z_curr, int bo)

Using that alone, clang actually is capable of vectorizing your loop using vcmpltpd and masked moves (for some reasons, only using 256bit registers).

If you simplify your expression to std::max(Z_curr[i], 0.1*Z_curr[i]) even gcc easily is capable of vectorizing it: https://godbolt.org/z/eTv4PnMWb

Generally, I would suggest compiling crucial routines of your code with different compilers and different compile options (sometimes trying -ffast-math can show you ways to simplify your expressions) and have a look at the generated code. For portability you could then re-translate the generated code into intrinsics (or leave it as is, if every compiler you care about gives good-enough results).

For completeness, here is a possible manually vectorized implementation using intrinsics:

void relu_avx512(double* __restrict Amem, double* Z_curr, int bo)
{
    int i;
    for (i=0; i<=bo-8; i+=8)
    {
        __m512d z = _mm512_loadu_pd(Z_curr+i);
        __mmask8 positive = _mm512_cmplt_pd_mask (_mm512_setzero_pd(), z);
        __m512d res = _mm512_mask_mul_pd(z, positive, z, _mm512_set1_pd(0.9));
        _mm512_storeu_pd(Amem+i, res);
    }
    // remaining elements in scalar loop
    for (; i<bo; ++i) {
        Amem[i] = 0.0 < Z_curr[i] ? Z_curr[i] : Z_curr[i]*0.1;;
    }
}

Godbolt: https://godbolt.org/z/s6br5KEEc (if you compile this with -O2 or -O3 on clang, it will heavily unroll the cleanup loop, even though it can't have more than 7 iterations. Theoretically, you could do the remaining elements with a masked or overlapping store (or maybe you have use-cases where the size is guaranteed to be a multiple of 8 and you can leave it away).

How to constexpr initialize intrinsic SSE/AVX register?

Not that it would change anything semantically or performance-wise

Correct, just use const __m128i like most code does.

I don't see any benefit to constexpr for this use-case, just pain for no gain.

Maybe if there was a way, it would let you initialize vectors in static storage (global or static) without the usual mess you get if you use _mm_set, where the compiler reserves space in .bss and runs a constructor at run-time to copy from an anonymous constant in .rodata.

(Yes, it's really that bad with gcc/clang/MSVC; godbolt. Don't use static const __m128i or at global scope. Do const __m128i foo = _mm_set_epi32() or whatever inside functions; compilers + linkers will eliminate duplicates, like with string literals. Or use plain arrays with alignas(16) and _mm_load_si128 from them inside functions if that works better.)

just curious why in the year 2022 I can't declare constexpr __m128i

You can declare constexpr __m128i, you just can't portably initialize it¹, because Intel intrinsics like _mm_set_* were defined before the year 2000 (for MMX and then SSE1), and aren't constexpr. (And later intrinsics still follow the same pattern established for SSE1.) Remember, in C / C++ terms they're actual functions that just happen to inline. (Or macros around __builtin functions to get a compile-time constant for an operand that becomes an immediate.)

Foonote 1: In C++20, GCC lets you use constexpr auto y = std::bit_cast<__m128i>(x);, as shown in https://godbolt.org/z/YGMGM69qs. Other compilers accept bit_cast<float> or whatever, but not __m128, so this may be an implementation detail of GCC. In any case, it doesn't save typing, and isn't useful for much even if it was portable to clang and MSVC.

There's little point to it because intrinsic functions like _mm_add_epi32 are also not constexpr, and you can't portably do v1 += v2; in GNU C/C++ that does compile (to a paddq).

Example with non-portable braced initializers; don't do this:

#include <immintrin.h>

__m128i foo() {
    // different meaning in GCC/clang vs. MSVC
    constexpr __m128i v = {1, 2};
    return v;
}

GCC11.2 -O3 asm output (Godbolt) - two long long halves, as per the way GCC/clang define __m128i as typdef long long __m128i __attribute__((vector_size(16),may_alias))

foo():
        movdqa  xmm0, XMMWORD PTR .LC0[rip]
        ret
.LC0:
        .quad   1
        .quad   2

MSVC 19.30 - the first two bytes of 16x int8_t - MSVC defines __m128i as a union of arrays of various element-widths, apparently with the char[16] first.

__xmm@00000000000000000000000000000201 DB 01H, 02H, 00H, 00H, 00H, 00H, 00H
        DB      00H, 00H, 00H, 00H, 00H, 00H, 00H, 00H, 00H

__m128i foo(void) PROC                   ; foo, COMDAT
        movdqa  xmm0, XMMWORD PTR __xmm@00000000000000000000000000000201
        ret     0
__m128i foo(void) ENDP                   ; foo

So you could initialize a vector to {0} and get the same result on gcc/clang as on MSVC, or I guess any {0..255}. But that's still taking advantage of implementation details on each specific compiler, not strictly using Intel's documented intrinsics API.

And MS says you should never directly access those fields of the union (the way MSVC defines __m128i).

GCC does define semantics for GNU C native vectors; GCC / clang implement the Intel intrinsics API (including __m128i) on top of their portable vector extensions which work like a struct or class with operators like + - & | * / [] and so on.

See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? re: what a __m128i object is and how it works.