How to Tell If a Linux MAChine Supports Avx/Avx2 Instructions

How to tell if a Linux machine supports AVX/AVX2 instructions?

On linux (or unix machines) the information about your cpu is in /proc/cpuinfo. You can extract information from there by hand, or with a grep command (grep flags /proc/cpuinfo).

Also most compilers will automatically define __AVX2__ so you can check for that too.

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

Most compilers will automatically define:

__SSE__
__SSE2__
__SSE3__
__AVX__
__AVX2__

etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this:

$ gcc -msse3 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE_MATH__ 1

or:

$ gcc -mavx2 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

or to just check the pre-defined macros for a default build on your particular platform:

$ gcc -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __SSE2_MATH__ 1
#define __SSE2__ 1
#define __SSE3__ 1
#define __SSE_MATH__ 1
#define __SSE__ 1
#define __SSSE3__ 1

More recent Intel processors support AVX-512, which is not a monolithic instruction set. One can see the support available from GCC (version 6.2) for two examples below.

Here is Knights Landing:

$ gcc -march=knl -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __AVX512CD__ 1
#define __AVX512ER__ 1
#define __AVX512F__ 1
#define __AVX512PF__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

Here is Skylake AVX-512:

$ gcc -march=skylake-avx512 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512F__ 1
#define __AVX512VL__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

Intel has disclosed additional AVX-512 subsets (see ISA extensions). GCC (version 7) supports compiler flags and preprocessor symbols associated with the 4FMAPS, 4VNNIW, IFMA, VBMI and VPOPCNTDQ subsets of AVX-512:

for i in 4fmaps 4vnniw ifma vbmi vpopcntdq ; do echo "==== $i ====" ; gcc -mavx512$i -dM -E - < /dev/null | egrep "AVX512" | sort ; done
==== 4fmaps ====
#define __AVX5124FMAPS__ 1
#define __AVX512F__ 1
==== 4vnniw ====
#define __AVX5124VNNIW__ 1
#define __AVX512F__ 1
==== ifma ====
#define __AVX512F__ 1
#define __AVX512IFMA__ 1
==== vbmi ====
#define __AVX512BW__ 1
#define __AVX512F__ 1
#define __AVX512VBMI__ 1
==== vpopcntdq ====
#define __AVX512F__ 1
#define __AVX512VPOPCNTDQ__ 1

Note that the SSE macros won't work with Visual C++. You have to use _M_IX86_FP instead.

how verify that operating system support avx2 instructions

If you want to check for the support to a particular set of registers you basically have 2 options:

assembly with CPUid extensions
builtin functions ( if ) provided by your compiler

writing assembly that detects which sets of registers are supported is a tedious, long and potentially error-prone task, not to mention that assembly it's not portable across different OSs, different SoC and different ABIs, there is also the burden of CPUid instructions which are not always implemented with the same pattern in all CPUs, there are different ways to do the reach the same bit of information with different vendors or even different family of CPUs from the same vendor; but this has one big advantage, it's not limited by anything, if you really need to know anything about your CPU/SoC, assembly + CPUid related stuff is the way to go .

Now gcc and other compilers implement something for your basic needs when you have to investigate your cpu capabilities in the form of builtin functions, which means that this special functions will generate the equivalent code in assembly and give you the answer that you want.

Using gcc, the check for AVX2 is as easy as writing

...
if(__builtin_cpu_supports("avx2"))
{
  ...
}
...

docs : http://gcc.gnu.org/onlinedocs/gcc/X86-Built-in-Functions.html

for Visual Studio / msvc there are intrinsics such as __cpuid and __cpuidex that you can use to retrieve the same informations, here it is a link with a complete and working example .

docs : http://msdn.microsoft.com/en-us/library/hskdteyh.aspx

how verify that operating system support avx2 instructions

If you want to check for the support to a particular set of registers you basically have 2 options:

assembly with CPUid extensions
builtin functions ( if ) provided by your compiler

Using gcc, the check for AVX2 is as easy as writing

...
if(__builtin_cpu_supports("avx2"))
{
  ...
}
...

docs : http://gcc.gnu.org/onlinedocs/gcc/X86-Built-in-Functions.html

for Visual Studio / msvc there are intrinsics such as __cpuid and __cpuidex that you can use to retrieve the same informations, here it is a link with a complete and working example .

docs : http://msdn.microsoft.com/en-us/library/hskdteyh.aspx

How to check if compiled code uses SSE and AVX instructions?

Under Linux, you could also decompile your binary:

objdump -d YOURFILE > YOURFILE.asm

Then find all SSE instructions:

awk '/[ \t](addps|addss|andnps|andps|cmpps|cmpss|comiss|cvtpi2ps|cvtps2pi|cvtsi2ss|cvtss2s|cvttps2pi|cvttss2si|divps|divss|ldmxcsr|maxps|maxss|minps|minss|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movss|movups|mulps|mulss|orps|rcpps|rcpss|rsqrtps|rsqrtss|shufps|sqrtps|sqrtss|stmxcsr|subps|subss|ucomiss|unpckhps|unpcklps|xorps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|psadbw|pshufw)[ \t]/' YOURFILE.asm

Find only packed SSE instructions (suggested by @Peter Cordes in comments):

awk '/[ \t](addps|andnps|andps|cmpps|cvtpi2ps|cvtps2pi|cvttps2pi|divps|maxps|minps|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movntq|movups|mulps|orps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|pmulhuw|psadbw|pshufw|rcpps|rsqrtps|shufps|sqrtps|subps|unpckhps|unpcklps|xorps)[ \t]/' YOURFILE.asm

Find all SSE2 instructions (except MOVSD and CMPSD, which were first introduced in 80386):

awk '/[ \t](addpd|addsd|andnpd|andpd|cmppd|comisd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvtsd2si|cvtsd2ss|cvtsi2sd|cvtss2sd|cvttpd2dq|cvttpd2pi|cvtps2dq|cvttsd2si|divpd|divsd|maxpd|maxsd|minpd|minsd|movapd|movhpd|movlpd|movmskpd|movupd|mulpd|mulsd|orpd|shufpd|sqrtpd|sqrtsd|subpd|subsd|ucomisd|unpckhpd|unpcklpd|xorpd|movdq2q|movdqa|movdqu|movq2dq|paddq|pmuludq|pshufhw|pshuflw|pshufd|pslldq|psrldq|punpckhqdq|punpcklqdq)[ \t]/' YOURFILE.asm

Find only packed SSE2 instructions:

awk '/[ \t](addpd|andnpd|andpd|cmppd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvttpd2dq|cvttpd2pi|cvttps2dq|divpd|maxpd|minpd|movapd|movapd|movhpd|movhpd|movlpd|movlpd|movmskpd|movntdq|movntpd|movupd|movupd|mulpd|orpd|pshufd|pshufhw|pshuflw|pslldq|psrldq|punpckhqdq|shufpd|sqrtpd|subpd|unpckhpd|unpcklpd|xorpd)[ \t]/' YOURFILE.asm

Find all SSE3 instructions:

awk '/[ \t](addsubpd|addsubps|haddpd|haddps|hsubpd|hsubps|movddup|movshdup|movsldup|lddqu|fisttp)[ \t]/' YOURFILE.asm

Find all SSSE3 instructions:

awk '/[ \t](psignw|psignd|psignb|pshufb|pmulhrsw|pmaddubsw|phsubw|phsubsw|phsubd|phaddw|phaddsw|phaddd|palignr|pabsw|pabsd|pabsb)[ \t]/' YOURFILE.asm

Find all SSE4 instructions:

awk '/[ \t](mpsadbw|phminposuw|pmulld|pmuldq|dpps|dppd|blendps|blendpd|blendvps|blendvpd|pblendvb|pblenddw|pminsb|pmaxsb|pminuw|pmaxuw|pminud|pmaxud|pminsd|pmaxsd|roundps|roundss|roundpd|roundsd|insertps|pinsrb|pinsrd|pinsrq|extractps|pextrb|pextrd|pextrw|pextrq|pmovsxbw|pmovzxbw|pmovsxbd|pmovzxbd|pmovsxbq|pmovzxbq|pmovsxwd|pmovzxwd|pmovsxwq|pmovzxwq|pmovsxdq|pmovzxdq|ptest|pcmpeqq|pcmpgtq|packusdw|pcmpestri|pcmpestrm|pcmpistri|pcmpistrm|crc32|popcnt|movntdqa|extrq|insertq|movntsd|movntss|lzcnt)[ \t]/' YOURFILE.asm

Find most common AVX instructions (including scalar, including AVX2, AVX-512 family and some FMA like vfmadd132pd):

awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmulsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vfnmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub213pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpbroadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherdd|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsllvq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpcmpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcompressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvttpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vcvttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vfixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrsqrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|vpminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatterqq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcastmw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgatherpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0qps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpclassss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|vp4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vpshrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec|vaesdeclast|vaesenc|vaesenclast)[ \t]/' YOURFILE.asm

NOTE: tested with gawk and nawk.

What happens when I compile on machine that supports avx2 and run the binary on another machine that only supports avx?

It's not rare to have AVX2 instructions in a binary that uses CPU detection to make sure it only runs them on CPUs that support them. (e.g. via cpuid and setting function pointers).

If the AVX2 instruction actually executed on a CPU without AVX2 support, it raises #UD, so the OS delivers SIGILL (illegal instruction) to your process, or the Windows equivalent.

There are a few cases where an instruction like lzcnt decodes as rep bsr, which runs as bsr on CPUs without BMI1. (Giving a different answer). But VEX-coded AVX2 instructions just fault on older CPUs.

Are the xgetbv and CPUID checks sufficient to guarantee AVX2 support?

Yes, CPUID + checking those XCR0 bits is sufficient, assuming an OS that isn't broken (and follows the expected conventions).

And assuming a virtual machine or emulator's CPUID instruction doesn't lie and tell you AVX2 is available but then actually fault. But if either of those things happen, it's the OS or VM's fault, not your program's.

(For compat with quite old CPUs, you need to use CPUID to check whether XGETBV is even supported before using it, otherwise that will fault. A good AVX detection function will do this.

See also Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) - my answer there focuses mostly on the latter and isn't Windows specific.)

If you just checked CPUID, you'd find that the CPU supported AVX2 even if that CPU was running an old OS that didn't know about AVX, and only saved/restored XMM registers on context-switch, not YMM.

Intel designed things so the failure mode would be an illegal-instruction fault (#UD) in that case, rather than silently corrupting user-space state if multiple threads / processes used YMM or ZMM registers. (Because that would be horrible.)

(Every task is supposed to have its own private register state, including integer and FP/SIMD registers. Context switching without saving/restore the high halves of the YMM registers would effectively be asynchronously corrupting registers, if you look at program-order execution for a single thread.)

The mechanism is that the OS has to set some bits in XCR0 (extended control-register 0), which user-space can check via xgetbv. If those bits are set, it's effectively a promise that the OS is AVX-aware and will save/restore YMM regs. And that it will set other control-register bits so SSE and AVX instructions actually work without faulting.

I'm not sure if these bits actually affect the CPU behaviour at all, or if they only exist as a communication mechanism for the kernel to advertise AVX support to user-space.

(YMM registers were new with AVX1, and XMM were new with SSE1. The OS doesn't need to know about SSE4.x or AVX2, just how to save the new architectural state. So AVX-512 is the next SIMD extension that needed new OS support.)

Update: I think those XCR0 bits actually do control whether AVX1/2 and AVX-512 instructions will #UD. MacOS's Darwin kernel apparently only does on-demand AVX-512 support, so the first usage will fault (but then the kernel handles it and re-runs, transparently to user-space I hope). See the source:

// darwin-xnu .../i386/fpu.c#L176
 * On-demand AVX512 support
 * ------------------------
 * On machines with AVX512 support, by default, threads are created with
 * AVX512 masked off in XCR0 and an AVX-sized savearea is used. However, AVX512
 * capabilities are advertised in the commpage and via sysctl. If a thread
 * opts to use AVX512 instructions, the first will result in a #UD exception.
 * Faulting AVX512 intructions are recognizable by their unique prefix.
 * This exception results in the thread being promoted to use an AVX512-sized
 * savearea and for the AVX512 bit masks being set in its XCR0. The faulting
 * instruction is re-driven and the thread can proceed to perform AVX512
 * operations.
 *
 * In addition to AVX512 instructions causing promotion, the thread_set_state()
 * primitive with an AVX512 state flavor result in promotion.
 *
 * AVX512 promotion of the first thread in a task causes the default xstate
 * of the task to be promoted so that any subsequently created or subsequently
 * DNA-faulted thread will have AVX512 xstate and it will not need to fault-in
 * a promoted xstate.
 *
 * Two savearea zones are used: the default pool of AVX-sized (832 byte) areas
 * and a second pool of larger AVX512-sized (2688 byte) areas.
 *
 * Note the initial state value is an AVX512 object but that the AVX initial
 * value is a subset of it.
 */

So on MacOS, it seems XGETBV + checking XCR0 might not be a guaranteed way to detect usability of AVX-512 instruction! The comment says "capabilities are advertised in the commpage and via sysctl", so you need some OS-specific way.

But that's AVX-512; probably AVX1/2 is always enabled so checking XCR0 for that will work everywhere, including MacOS.

Lazy context switches used to be a thing

Some OSes used to use "lazy" context switches, not actually saving/restoring the x87, XMM, and maybe even YMM registers until the new process actually used them. This was done by using a separate control-register bit that made those types of instructions fault if executed; in that fault handler, the OS would save state from the previous task on this core, and load state from the new task. Then change the control bit and return to user-space to rerun the instruction.

But these days most processes use XMM (and YMM) registers all over the place, in memcpy and other libc functions, and for copying/initializing structs. So a lazy strategy isn't worth it, and is just a lot of extra complexity, especially on an SMP system. That's why modern kernels don't do that anymore.

The control-register bits that a kernel would use to make x87, xmm, or ymm instructions fault is separate from the XCR0 bit we're checking, so even on a system using lazy context switching, your detection won't be fooled by the OS temporarily having the CPU set up so vaddps xmm0, xmm1, xmm2 would fault.

When SSE1 was new, there was no user-space-visible bit for detecting SSE-aware OSes without using an OS-specific API, but Intel learned from that mistake for AVX. (With SSE, the failure mode is still faulting, not corruption, though. The CPU boots up with SSE instructions set to fault: How do I enable SSE for my freestanding bootable code?)

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2

What is this warning about?

Modern CPUs provide a lot of low-level instructions, besides the usual arithmetic and logic, known as extensions, e.g. SSE2, SSE4, AVX, etc. From the Wikipedia:

Advanced Vector Extensions (AVX) are extensions to the x86 instruction
set architecture for microprocessors from Intel and AMD proposed by
Intel in March 2008 and first supported by Intel with the Sandy
Bridge processor shipping in Q1 2011 and later on by AMD with the
Bulldozer processor shipping in Q3 2011. AVX provides new features,
new instructions and a new coding scheme.

In particular, AVX introduces fused multiply-accumulate (FMA) operations, which speed up linear algebra computation, namely dot-product, matrix multiply, convolution, etc. Almost every machine-learning training involves a great deal of these operations, hence will be faster on a CPU that supports AVX and FMA (up to 300%). The warning states that your CPU does support AVX (hooray!).

I'd like to stress here: it's all about CPU only.

Why isn't it used then?

Because tensorflow default distribution is built without CPU extensions, such as SSE4.1, SSE4.2, AVX, AVX2, FMA, etc. The default builds (ones from pip install tensorflow) are intended to be compatible with as many CPUs as possible. Another argument is that even with these extensions CPU is a lot slower than a GPU, and it's expected for medium- and large-scale machine-learning training to be performed on a GPU.

What should you do?

If you have a GPU, you shouldn't care about AVX support, because most expensive ops will be dispatched on a GPU device (unless explicitly set not to). In this case, you can simply ignore this warning by

# Just disables the warning, doesn't take advantage of AVX/FMA to run faster
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

... or by setting export TF_CPP_MIN_LOG_LEVEL=2 if you're on Unix. Tensorflow is working fine anyway, but you won't see these annoying warnings.

If you don't have a GPU and want to utilize CPU as much as possible, you should build tensorflow from the source optimized for your CPU with AVX, AVX2, and FMA enabled if your CPU supports them. It's been discussed in this question and also this GitHub issue. Tensorflow uses an ad-hoc build system called bazel and building it is not that trivial, but is certainly doable. After this, not only will the warning disappear, tensorflow performance should also improve.

How to Tell If a Linux MAChine Supports Avx/Avx2 Instructions