How to Know If I Can Compile with Fma Instruction Sets

How to programmatically check if fused mul add (FMA) instruction are enabled on the CPU?

I used __cpuid to code my function by modifying the microsoft code. Thank you very much to all for your help.

#include <intrin.h>
#include <vector>
#include <bitset>
#include <array>

bool CheckFMA()
{
    std::array<int, 4> cpui;
    std::bitset<32> ECX;
    int nIds;
    bool fma;

    __cpuid(cpui.data(), 0);
    nIds = cpui[0];

    if (nIds < 1)
    {
        return false;
    }

    __cpuidex(cpui.data(), 1, 0);
    ECX = cpui[2];

    return ECX[12];
}

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

Most compilers will automatically define:

__SSE__
__SSE2__
__SSE3__
__AVX__
__AVX2__

etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this:

$ gcc -msse3 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE_MATH__ 1

or:

$ gcc -mavx2 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

or to just check the pre-defined macros for a default build on your particular platform:

$ gcc -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __SSE2_MATH__ 1
#define __SSE2__ 1
#define __SSE3__ 1
#define __SSE_MATH__ 1
#define __SSE__ 1
#define __SSSE3__ 1

More recent Intel processors support AVX-512, which is not a monolithic instruction set. One can see the support available from GCC (version 6.2) for two examples below.

Here is Knights Landing:

$ gcc -march=knl -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __AVX512CD__ 1
#define __AVX512ER__ 1
#define __AVX512F__ 1
#define __AVX512PF__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

Here is Skylake AVX-512:

$ gcc -march=skylake-avx512 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512F__ 1
#define __AVX512VL__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

Intel has disclosed additional AVX-512 subsets (see ISA extensions). GCC (version 7) supports compiler flags and preprocessor symbols associated with the 4FMAPS, 4VNNIW, IFMA, VBMI and VPOPCNTDQ subsets of AVX-512:

for i in 4fmaps 4vnniw ifma vbmi vpopcntdq ; do echo "==== $i ====" ; gcc -mavx512$i -dM -E - < /dev/null | egrep "AVX512" | sort ; done
==== 4fmaps ====
#define __AVX5124FMAPS__ 1
#define __AVX512F__ 1
==== 4vnniw ====
#define __AVX5124VNNIW__ 1
#define __AVX512F__ 1
==== ifma ====
#define __AVX512F__ 1
#define __AVX512IFMA__ 1
==== vbmi ====
#define __AVX512BW__ 1
#define __AVX512F__ 1
#define __AVX512VBMI__ 1
==== vpopcntdq ====
#define __AVX512F__ 1
#define __AVX512VPOPCNTDQ__ 1

Note that the SSE macros won't work with Visual C++. You have to use _M_IX86_FP instead.

How to check if compiled code uses SSE and AVX instructions?

Under Linux, you could also decompile your binary:

objdump -d YOURFILE > YOURFILE.asm

Then find all SSE instructions:

awk '/[ \t](addps|addss|andnps|andps|cmpps|cmpss|comiss|cvtpi2ps|cvtps2pi|cvtsi2ss|cvtss2s|cvttps2pi|cvttss2si|divps|divss|ldmxcsr|maxps|maxss|minps|minss|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movss|movups|mulps|mulss|orps|rcpps|rcpss|rsqrtps|rsqrtss|shufps|sqrtps|sqrtss|stmxcsr|subps|subss|ucomiss|unpckhps|unpcklps|xorps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|psadbw|pshufw)[ \t]/' YOURFILE.asm

Find only packed SSE instructions (suggested by @Peter Cordes in comments):

awk '/[ \t](addps|andnps|andps|cmpps|cvtpi2ps|cvtps2pi|cvttps2pi|divps|maxps|minps|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movntq|movups|mulps|orps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|pmulhuw|psadbw|pshufw|rcpps|rsqrtps|shufps|sqrtps|subps|unpckhps|unpcklps|xorps)[ \t]/' YOURFILE.asm

Find all SSE2 instructions (except MOVSD and CMPSD, which were first introduced in 80386):

awk '/[ \t](addpd|addsd|andnpd|andpd|cmppd|comisd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvtsd2si|cvtsd2ss|cvtsi2sd|cvtss2sd|cvttpd2dq|cvttpd2pi|cvtps2dq|cvttsd2si|divpd|divsd|maxpd|maxsd|minpd|minsd|movapd|movhpd|movlpd|movmskpd|movupd|mulpd|mulsd|orpd|shufpd|sqrtpd|sqrtsd|subpd|subsd|ucomisd|unpckhpd|unpcklpd|xorpd|movdq2q|movdqa|movdqu|movq2dq|paddq|pmuludq|pshufhw|pshuflw|pshufd|pslldq|psrldq|punpckhqdq|punpcklqdq)[ \t]/' YOURFILE.asm

Find only packed SSE2 instructions:

awk '/[ \t](addpd|andnpd|andpd|cmppd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvttpd2dq|cvttpd2pi|cvttps2dq|divpd|maxpd|minpd|movapd|movapd|movhpd|movhpd|movlpd|movlpd|movmskpd|movntdq|movntpd|movupd|movupd|mulpd|orpd|pshufd|pshufhw|pshuflw|pslldq|psrldq|punpckhqdq|shufpd|sqrtpd|subpd|unpckhpd|unpcklpd|xorpd)[ \t]/' YOURFILE.asm

Find all SSE3 instructions:

awk '/[ \t](addsubpd|addsubps|haddpd|haddps|hsubpd|hsubps|movddup|movshdup|movsldup|lddqu|fisttp)[ \t]/' YOURFILE.asm

Find all SSSE3 instructions:

awk '/[ \t](psignw|psignd|psignb|pshufb|pmulhrsw|pmaddubsw|phsubw|phsubsw|phsubd|phaddw|phaddsw|phaddd|palignr|pabsw|pabsd|pabsb)[ \t]/' YOURFILE.asm

Find all SSE4 instructions:

awk '/[ \t](mpsadbw|phminposuw|pmulld|pmuldq|dpps|dppd|blendps|blendpd|blendvps|blendvpd|pblendvb|pblenddw|pminsb|pmaxsb|pminuw|pmaxuw|pminud|pmaxud|pminsd|pmaxsd|roundps|roundss|roundpd|roundsd|insertps|pinsrb|pinsrd|pinsrq|extractps|pextrb|pextrd|pextrw|pextrq|pmovsxbw|pmovzxbw|pmovsxbd|pmovzxbd|pmovsxbq|pmovzxbq|pmovsxwd|pmovzxwd|pmovsxwq|pmovzxwq|pmovsxdq|pmovzxdq|ptest|pcmpeqq|pcmpgtq|packusdw|pcmpestri|pcmpestrm|pcmpistri|pcmpistrm|crc32|popcnt|movntdqa|extrq|insertq|movntsd|movntss|lzcnt)[ \t]/' YOURFILE.asm

Find most common AVX instructions (including scalar, including AVX2, AVX-512 family and some FMA like vfmadd132pd):

awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmulsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vfnmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub213pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpbroadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherdd|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsllvq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpcmpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcompressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvttpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vcvttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vfixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrsqrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|vpminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatterqq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcastmw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgatherpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0qps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpclassss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|vp4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vpshrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec|vaesdeclast|vaesenc|vaesenclast)[ \t]/' YOURFILE.asm

NOTE: tested with gawk and nawk.

FMA3 in GCC: how to enable

Only answering a very small part of the question here. If you write _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0), gcc-4.9 handles it almost like inline asm and does not optimize it much. If you replace it with areg0*breg0+tmp0, a syntax that is supported by both gcc and clang, then gcc starts optimizing and may use FMA if available. I improved that for gcc-5, _mm256_add_ps for instance is now implemented as an inline function that simply uses +, so the code with intrinsics can be optimized as well.

Preventing GCC from automatically using AVX and FMA instructions when compiled with -mavx and -mfma

What you want to do is compile different object files for each instruction set you are targeting. Then create a cpu dispatcher which asks CPUID for the available instruction set and then jumps to the appropriate version of the function.

I already described this in several different questions and answers

disable-avx2-functions-on-non-haswell-processors
do-i-need-to-make-multiple-executables-for-targetting-different-instruction-set
how-to-check-with-intel-intrinsics-if-avx-extensions-is-supported-by-the-cpu
cpu-dispatcher-for-visual-studio-for-avx-and-sse
create-separate-object-files-from-the-same-source-code-and-link-to-an-executable

Illegal Instruction when compiling with -mfma

To debug illegal instruction you should first of all look at the disassembly, not backtrace or source code. In your case though, even from the source code you can easily see that the offending (illegal) instruction is vfmadd231pd, which is from the FMA instruction set extension. But SandyBridge CPUs, one of which you have, don't support this ISA extension, so by enabling it in the compiler you have shot yourself in the foot.

On Linux you can check whether your CPU supports FMA by this shell command:

grep -q '\<fma\>' /proc/cpuinfo && echo supported || echo not supported

FMA intrinsics not working: is it Hardware or Compiler?

One of the quirks of C is that the language indicates that the compiler is to assume a symbol it's not seen before must return int if you call it like a function. Since you didn't include the header that actually defines the signature for _mm_fmadd_ps, you get the strange error about converting int to __m128.

The original organization of the intrinsics headers was to have a unique header per instruction generations, so you had:

mmintrin.h     The original MMX instruction set (deprecated for x64 native)
mm3dnow.h      The AMD 3D Now! instruction set (deprecated for x64 native)
emmintrin.h    SSE (i.e. single-precision 4-wide SIMD)
xmmintrin.h    SSE2 (i.e. double-precision and integer 4-wide SIMD)

After that, they started using the code names of the processor architecture where the new instructions were introduced.

pmmintrin.h    SSE3 (the p stands for Prescott)
tmmintrin.h    Supplemental SSE3 (the t stands for Tejas)
smmintrin.h    SSE4.1 (not sure what the s is here for.
               They were added for Penryn but p
               was already used for Prescott)
nmmintrin.h    SSE4.2 (the n stands for Nehalem)
wmmintrin.h    AES (the w stands for Westmere)

These days the new instruction sets tend to end up in either ammintrin.h for AMD-originated stuff (ABM, BMI, LWP, TBM, XOP, FMA4, SSE4a, SSE5) or immintrin.h for Intel-originated stuff (AVX, FMA3, F16C, AVX2, etc.). AVX-512 is in zmmintrin.h.

The older system wasn't particularly intuitive, but neither is the new one. A number of AMD instruction subsets are defined in immintrin.h because they are the same instruction. Looking it up in the documentation or the header is really the only way to know which intrinsic is where.

For Intel this website is a good reference. Otherwise you need to see the developer guides for AMD and/or Intel.

You might find this blog series of mine useful.

Do FMA instructions of different CPUs have different intermediate accuracy? If yes, then how does a compiler equalize the floating-point behavior?

Regarding the C++ language, there are no strong requirements and it is mostly implementation-defined as stated in the previous answer pointed out by @Maxpm in the comments.

The main standard for floating-point precision is the IEEE-754. It is generally properly implemented by most vendors nowadays (at least nearly all recent mainstream x86-64 CPUs and most mainstream GPUs). It is not required by the C++ standard, but you can check this with std::numeric_limits<T>::is_iec559.

The IEEE-754 standard requires operations to be correctly computed (ie. error less than 1 ULP) using the correct rounding method. There are different rounding methods supported by the norm but the most common is the round to nearest. The standard also require some operation like FMA to be implemented with the same requirements. As a result, you cannot expect results to be computed with a precision better than 1 ULP per operation with this standard (rounding may help to reach 0.5 ULP in average or even better regarding the actual algorithm used).

In practice, computing units of IEEE-754-compliant hardware vendors use a higher precision internally so to fulfil the requirements whatever the provided input. Still, when results are stored in memory, then they need to be rounded properly the way the IEEE-754. On x86-64 processors, SIMD registers like the one of SSE, AVX and AVX-512 have a well-known fixed size. Each lane are either 16-bit (half-float), 32-bit (float) or 64-bit (double) for floating-point operations. The IEEE-754 compliant rounding should applied for each instructions. While processors could theoretically implement clever optimizations like fusing two FP instructions in one (as long as the precision is <1 ULP), AFAIK none does that yet (though fusing is done for some instructions like conditional branches).

Difference between IEEE-754 platforms can be due to either to the compiler or the configuration of FP units of the hardware vendor.

Regarding the compiler, optimizations can increase the precision while being IEEE-754 compliant. For example, the use of FMA instruction in your code is an optimization that increase the precision of the result but it is not mandatory for the compiler to do that on x86-64 platforms (in fact, not all x86-64 processors support it). Compilers may use separate multiply+add instructions for some reasons (Clang sometime does that). A compiler can pre-compute some constants using a better precision than the target processor (GCC for example operate on FP numbers with a much higher precision to generate compile-time constants). Additionally, different rounding methods could be used to compute constants.

Regarding the hardware vendor, as the default rounding mode can change from one platform to another. In your case, the very small difference may be due to that. The rounding mode could be "Round to nearest, ties to even" on one platform and "Round to nearest, ties away from zero" on the other platform, resulting in a very small, but visible, difference. You can set the rounding mode using the C code provided in this answer. Note also that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more informations) though it makes the results not IEEE-754 compliant. You should check if this is the case.

Put it shortly, difference <1 ULP is totally normal between two IEEE-754-compliant platforms and it is actually quite frequent between very different platforms (eg. Clang on POWER vs GCC on x86-64).