Cpu Dispatcher for Visual Studio for Avx and Sse

Forcing AVX intrinsics to use SSE instructions instead

Use Agner Fog's Vector Class Library and add this to the command line in Visual Studio: -D__SSE4_2__ -D__XOP__.

Then use an AVX sized vector such as Vec8f for eight floats. When you compile without AVX enable it will use the file vectorf256e.h which emulates AVX with two SSE registers. For example Vec8f inherits from Vec256fe which starts like this:

class Vec256fe {
protected:
__m128 y0; // low half
__m128 y1; // high half

If you compile with /arch:AVX -D__XOP__ the VCL will instead use the file vectorf256.h and one AVX register. Then your code works for AVX and SSE with only a compiler switch change.

If you don't want to use XOP don't use -D__XOP__.


As Peter Cordes pointed out in his answer, if you your goal is only to avoid 256-bit load/stores then you may still want VEX encoded instructions (though it's not clear this will make a difference except in some special cases). You can do that with the vector class like this

Vec8f a;
Vec4f lo = a.get_low(); // a is a Vec8f type
Vec4f hi = a.get_high();
lo.store(&b[0]); // b is a float array
hi.store(&b[4]);

then compile with /arch:AVX -D__XOP__.

Another option would be be one source file that uses Vecnf and then do

//foo.cpp
#include "vectorclass.h"
#if SIMDWIDTH == 4
typedef Vec4f Vecnf;
#else
typedef Vec8f Vecnf;
#endif

and compile like this

cl /O2 /DSIMDWIDTH=4                     foo.cpp /Fofoo_sse
cl /O2 /DSIMDWIDTH=4 /arch:AVX /D__XOP__ foo.cpp /Fofoo_avx128
cl /O2 /DSIMDWIDTH=8 /arch:AVX foo.cpp /Fofoo_avx256

This would create three executables with one source file. Instead of linking them you could just compile them with /c and them make a CPU dispatcher. I used XOP with avx128 because I don't think there is a good reason to use avx128 except on AMD.

What's the proper way to use different versions of SSE intrinsics in GCC?

I think that the Mystical's tip is fine, but if you really want to do it in the one file, you can use proper pragmas, for instance:

#pragma GCC target("sse4.1")

GCC 4.4 is needed, AFAIR.

Does anyone know of a fix for an MSVC compiler bug/annoyance where SIMD Extension settings get stuck on AVX?

I figured this out (it's simple and boring). For the incremental object files I'm compiling 3 .obj files from the same .cpp (the .cpp with the vector code). When the MSVC SIMD settings are changed in the project level Properties, they may or may not get inherited in the .cpp file Properties. This is where the project gets "stuck" on AVX (sometimes, not always). Just need to check the .cpp file properties and make sure they are correct.

BTW I'm using VS 2019, /std:c++17 and the context above is the 32-bit build.

Checking if SSE is supported at runtime

GCC has a way of doing this that starts by calling __builtin_cpu_init then calling __builtin_cpu_is and __builtin_cpu_supports to check features. https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/X86-Built-in-Functions.html

On x86, when using the C++ frontend, GCC supports "function multiversioning", which allows you to write multiple versions of the function, specify the target it should be used on, and let GCC take care of making sure it is called. https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Function-Multiversioning.html

Using AVX CPU instructions: Poor performance without /arch:AVX

2021 update: Modern versions of MSVC don't need manual use of _mm256_zeroupper() even when compiling AVX intrinsics without /arch:AVX. VS2010 did.


The behavior that you are seeing is the result of expensive state-switching.

See page 102 of Agner Fog's manual:

http://www.agner.org/optimize/microarchitecture.pdf

Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.

When you compile without /arch:AVX, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)

Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX tells the compiler to use all AVX.

It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX.
For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX and one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.

If you need to mix SSE and AVX, be sure to use _mm256_zeroupper() or _mm256_zeroall() appropriately to avoid the state-switching penalties.

Do I need to use _mm256_zeroupper in 2021?

TL:DR: Don't use the _mm256_zeroupper() intrinsic manually, compilers understand SSE/AVX transition stuff and emit vzeroupper where needed for you. (Including when auto-vectorizing or expanding memcpy/memset/whatever with YMM regs.)


"Some Intel processors" being all except Xeon Phi.

Xeon Phi (KNL / KNM) don't have a state optimized for running legacy SSE instructions because they're purely designed to run AVX-512. Legacy SSE instructions probably always have false dependencies merging into the destination.

On mainstream CPUs with AVX or later, there are two different mechanisms: saving dirty uppers (SnB through Haswell, and Ice Lake) or false dependencies (Skylake). See Why is this SSE code 6 times slower without VZEROUPPER on Skylake? the two different styles of SSE/AVX penalty

Related Q&As about the effects of asm vzeroupper (in the compiler-generated machine code):

  • Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?
  • Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?


Intrinsics in C or C++ source

You should pretty much never use _mm256_zeroupper() in C/C++ source code. Things have settled on having the compiler insert a vzeroupper instruction automatically where it might be needed, which is pretty much the only sensible way for compilers to be able to optimize functions containing intrinsics and still reliably avoid transition penalties. (Especially when considering inlining). All the major compilers can auto-vectorize and/or inline memcpy/memset/array init with YMM registers, so need to keep track of using vzeroupper after that.

The convention is to have the CPU in clean-uppers state when calling or returning, except when calling functions that take __m256 / __m256i/d args by value (in registers or at all), or when returning such a value. The target function (callee or caller) inherently must be AVX-aware and expecting a dirty-upper state because a full YMM register is in-use as part of the calling convention.

x86-64 System V passes vectors in vector regs. Windows vectorcall does, too, but the original Windows x64 convention (now named "fastcall" to distinguish from "vectorcall") passes vectors by value in memory via hidden pointer. (This optimizes for variadic functions by making every arg always fit in an 8-byte slot.) IDK how compilers compiling Windows non-vectorcall calls handle this, whether they assume the function probably looks at its args or at least is still responsible for using a vzeroupper at some point even if it doesn't. Probably yes, but if you're writing your own code-gen back-end, or hand-written asm, have a look at what some compilers you care about actually do if this case is relevant for you.

Some compilers optimize by also omitting vzeroupper before returning from a function that took vector args, because clearly the caller is AVX-aware. And crucially, apparently compilers shouldn't expect that calling a function like void foo(__m256i) will leave the CPU in clean-upper state, so the callee does still need a vzeroupper after such a function, before call printf or whatever.



Compilers have options to control vzeroupper usage

For example, GCC -mno-vzeroupper / clang -mllvm -x86-use-vzeroupper=0. (The default is -mvzeroupper to do the behaviour described above, using when it might be needed.)

This is implied by -march=knl (Knight's Landing) because it's not needed and very slow on Xeon Phi CPUs (thus should actively be avoided).

Or you might possibly want it if you build libc (and any other libraries you use) with -mavx -mno-veroupper. glibc has some hand-written asm for functions like strlen, but most of those have AVX2 versions. So as long as you're not on an AVX1-only CPU, legacy-SSE versions of string functions might not get used at all.

For MSVC, you should definitely prefer using -arch:AVX when compiling code that uses AVX intrinsics. I think some versions of MSVC could generate code that caused transition penalties if you mixed __m128 and __m256 without /arch:AVX. But beware that that option will make even 128-bit intrinsics like _mm_add_ps use the AVX encoding (vaddps) instead of legacy SSE (addps), though, and will let the compiler auto-vectorize with AVX. There is undocumented switch /d2vzeroupper to enable automatic vzeroupper generation (default), /d2vzeroupper- disables it - see What is the /d2vzeroupper MSVC compiler optimization flag doing?



Related Topics



Leave a reply



Submit