How to Monitor the Amount of Simd Instruction Usage

How do I monitor the amount of SIMD instruction usage

I think the only reliable way to count all SIMD instructions (not just FP math) is dynamic instrumentation (e.g. via something like Intel PIN / SDE).

See How to characterize a workload by obtaining the instruction type breakdown? and How do I determine the number of x86 machine instructions executed in a C program? specifically sde64 -mix -- ./my_program to print the instruction mix for your program for that run, example output in libsvm compiled with AVX vs no AVX

I don't think there's a good way to make this like top / htop, if it's even possible to safely attach to already-running processes, especially multi-threaded once.

It might also be possible to get dynamic instruction counts using last-branch-record stuff to record / reconstruct the path of execution and count everything, but I don't know of tools for that. In theory that could attach to already-running programs without much danger, but it would take a lot of computation (disassembling and counting instructions) to do it on the fly for all running processes. Not like just asking the kernel for CPU usage stats that it tracks anyway on context switches.

You'd need hardware instruction-counting support for this to be really efficient the way top is.

For SIMD floating point math specifically (not FP shuffles, just real FP math like vaddps), there are perf counter events.

e.g. from perf list output:

fp_arith_inst_retired.128b_packed_single
[Number of SSE/AVX computational 128-bit packed single precision
floating-point instructions retired. Each count represents 4
computations. Applies to SSE* and AVX* packed single precision
floating-point instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT SQRT
DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as
they perform multiple calculations per element]

So it's not even counting uops, it's counting FLOPS. There are other events for ...pd packed double, and 256-bit versions of each. (I assume on CPUs with AVX512, there are also 512-bit vector versions of these events.)

You can use perf to count their execution globally across processes and on all cores. Or for a single process

## count math instructions only, not SIMD integer, load/store, or anything else
perf stat -e cycles:u,instructions:u,fp_arith_inst_retired.{128,256}b_packed_{double,single}:u  ./my_program
# fixme: that brace-expansion doesn't expand properly; it separates with spaces not commas.

(Intentionally omitting fp_arith_inst_retired.scalar_{double,single} because you only asked about SIMD and scalar instructions on XMM registers don't count, IMO.)

(You can attach perf to a running process by using -p PID instead of a command. Or use perf top as suggested in
See Ubuntu - how to tell if AVX or SSE, is current being used by CPU app?

You can run perf stat -a to monitor globally across all cores, regardless of what process is executing. But again, this only counts FP math, not SIMD in general.

Still, it is hardware-supported and thus could be cheap enough for something like htop to use without wasting a lot of CPU time if you leave it running long-term.

Does the using of SIMD load main CPU registers?

Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.

So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).

Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)

Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)

#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps

void computeThing(float8 a[], float8 b[], float8 c[], int count)
{
    for(int i = 0; i < count; ++i)
    {
        a[i] = mul8f(a[i], b[i]);
        b[i] = mul8f(b[i], c[i]);
    }
}

Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).

The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?

It just so happens that we are also making use of the SIMD units to do a couple of multiplications...

Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.

i.e. https://godbolt.org/z/xTcLrf

Population count in AVX512

_mm512_popcnt_epi64 is part of AVX512-VPOPCNTDQ. The 256 and 128-bit versions also require AVX512VL to use AVX512 instructions with 128 or 256-bit vectors.

Mainstream AVX512 CPUs all have AVX512-VL. Xeon Phi CPUs don't have AVX512-VL.

(_mm512_popcnt_epi8 and epi16 are also new in Ice Lake, as part of AVX512-BITALG)

Perhaps you forgot to enable the compiler options necessary (like GCC -march=native to enable everything the machine you're compiling on can do), or you're compiling for a target that doesn't have both features. If so, then the compiler won't have a definition for _m256_popcnt_epi64 as an intrinsic, so in C it will assume its and undeclared function and emit a call to it. (Which will of course be not found at link time.) And/or it will warn or error (C or C++) about a prototype not being found.

Very few CPUs currently have AVX512-VPOPCNTDQ (wikipedia AVX512 feature vs. CPU matrix):

Knight's Mill (final-generation Xeon Phi): only AVX512-VPOPCNTDQ, no AVX512VL and no BITALG. So only the __m512i versions are available for gcc -O3 -march=knm. You should definitely be using 512-bit vectors on Xeon Phi unless data layout works perfectly for 256 and would take extra shuffling for 512-bit. But beware that it's slow for some AVX / AVX2 instructions that it doesn't have 512-bit versions of, like shuffles with elements smaller than 32-bit. (No AVX512 BW)
Ice Lake / Tiger Lake: has AVX512 VPOPCNTDQ, BITALG, and AVX512 VL, so _mm256_popcnt_epi64 and epi8 are supported when compiling for this target microarchitecture, e.g. gcc -O3 -march=icelake-client. (Assuming your compiler's headers are correct).
GCC8.3 and earlier have a bug where -march=icelake-client / icelake-server doesn't enable -mavx512vpopcntdq. (GCC7 doesn't know about -march=icelake-client). It's fixed in GCC8.4, so either upgrade to the latest GCC8, or better upgrade to the latest stable GCC; a couple more years of development should usually help GCC make better code with new ISA extensions like AVX-512, especially with mask registers. Or just manually use -march=icelake-client -mavx512vpopcntdq; that does work: https://godbolt.org/z/a7bhcjdhr

Choosing between 256 vs. 512-bit vectors on Ice Lake is a tradeoff like on Skylake-x: when 512-bit vector uops are in flight, the vector ALUs on port 1 don't get used. And max turbo clock speed may be lowered. SIMD instructions lowering CPU frequency. So if you don't get much speedup from wider vectors (e.g. because of a memory bottleneck, or your SIMD loops are only a tiny part of a larger program), it can hurt overall performance to use 512-bit vectors in one loop.

But note that Icelake Client CPUs aren't affected much, and I'm not sure if vpopcnt instructions even count as "heavy", maybe not reducing max turbo as much, if at all on client CPUs. Most integer SIMD instructions don't count. See discussion on LLVM [X86] Prefer 512-bit vectors on Ice/Rocket/TigerLake (PR48336). The vector ALU part of port 1 still shuts down while 512-bit uops are in flight, though.

Other CPUs don't have hardware SIMD popcnt support at all, and no form of _mm512_popcnt_epi64 is available.

Even if you only have AVX2, not AVX512 at all, SIMD popcnt is a win vs. scalar popcnt, over non-tiny arrays on modern CPUs with fast vpshufb (_mm256_shuffle_epi8). https://github.com/WojciechMula/sse-popcount/ has AVX2, and AVX512 versions that use vpternlogd for Harley-Seal accumulation to reduce the amount of SIMD LUT lookups for popcounting.

Also on Stack Overflow Counting 1 bits (population count) on large data using AVX-512 or AVX-2 shows some code copied from that repo a couple years ago.

If you need counts for separate elements separately, just use the standard unpack for vpshufb and vpsadbw against a zero vector to hsum into 64-bit qword chunks.

If you need positional popcount (separate sum for each bit-position), see https://github.com/mklarqvist/positional-popcount.

Why does SIMD have single data instructions when it's called SIMD?

It isn't a SIMD instruction in that sense

vaddss is a scalar FP math instruction that operates on data in the FP/SIMD registers (XMM0..15). It exists because x87 is not a very convenient compiler target with its stack-based registers that often need fxch, and other quirks. Intel added a new way to do scalar FP math along with SSE1 (float) and SSE2 (double), which is fortunately baseline for x86-64 so everyone can just use it.

People who call that a SIMD instruction are talking about one of:

Which registers it operates on. (XMM0 is 16 bytes wide and clearly a SIMD register, even when you only care about the low element holding a scalar value.)
The fact that it's an AVX instruction, so it was introduced with an ISA extension that was primarily aimed at SIMD usage, and thus is called a SIMD extension or instruction set.
Which also means it uses the MXCSR for rounding mode and FP exception recording / unmasking, and the kinds of exceptions it can take are the same as other SSE/AVX instructions which Intel documents as "SIMD Floating-Point Exceptions" as concise terminology to distinguish it from legacy x87.
Or they're talking about the use-case of doing something to just the low element when the high elements have actual data. (Quite rare, but something you could do. Maybe more likely with sd scalar double, where the low double is one half of an XMM register.)

Or they're just plain wrong if they actually mean it in terms of Flynn's taxonomy of SISD vs. SIMD vs. MIMD etc. I highly doubt anyone would actually mean that, though. The ss and sd scalar FP math instructions are SISD, single-instruction single-data. And BTW, they only exist for FP math; x86 already has instructions like add eax, ecx for scalar integer math, and doesn't have scalar versions of paddb or even xorps.

One reason for having separate scalar FP math instructions is that using addps would also operate on whatever garbage might be in the high elements of XMM registers. This can raise extra FP exceptions (usually masked, so only recorded in MXCSR (fenv.h), but if unmasked would trap to the OS.)

With the upper elements all 0.0 (which isn't required by the calling convention, BTW), addps wouldn't raise any extra exceptions, but divps would divide by zero.

With non-zero garbage like small integers, it might be a bit-pattern for a subnormal float, or a result might be subnormal, causing huge slowdowns (factor of ~100) as the CPU takes a microcode assist to get handle subnormal input or output in many cases (or when SSE1 was new in Pentium III, probably all cases of subnormals). Unless you set FTZ and DAZ (flush to zero, denormal are zero) like gcc -ffast-math does.

For instructions like xorps or paddq which don't do actual FP math, no FP exceptions or microcode assists are possible. You can just use them even if you only care about the low 32 or 64 bits of an XMM.

MMX or SSE2 had occasional uses in 32-bit code for doing scalar 64-bit integer math, with zeros or garbage in the upper bytes. MMX paddq mm0, mm1 is a SISD instruction, but SSE2 paddq xmm0, xmm1 is a SIMD instruction.

SSE1 was new in Pentium 3, where the SIMD execution units and registers were only 64 bits wide. addps decoded to 2 uops; addss decoded to 1. So there was a performance motivation, too, even in the best case.

This is also likely the reason for Intel's unfortunate design where sqrtss and cvtsi2ss and others merge into the destination, requiring either spending extra front-end bandwidth on xor-zeroing, or risking false dependencies: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? . It's a short-sighted design decision to make them single-uop on Pentium 3, which they unfortunately followed in SSE2 for double precision, and stuck to for AVX and AVX-512 when they had a chance to introduce better versions with different semantics. At least the AVX versions take a 2nd source register to merge with, so you can pick a "cold" reg as a workaround, see my answer on the linked duplicate.

It's normal for scalar FP to share registers with SIMD

It isn't necessary or useful to have yet another set of registers for scalar FP, and sharing with the x87 FPU or the general-purpose integer registers would each be worse for separate reasons.

It's totally normal on other ISAs for the SIMD registers to overlap or be the same as the scalar FP registers; Some ISAs (like ARM) that didn't have weirdo designs like x87 didn't need new architectural state to introduce SIMD. e.g. ARM's NEON q0..q15 16-byte registers map to pairs of d0..d31 double-precision FP registers that existed with VFPv3.

(I'm not sure if the partial-register aliasing was actually common in SIMD extensions for other ISAs, though. Probably some introduced new architectural state, or just used FP double-precision registers as 64-bit integer SIMD instead of 128-bit.)

In an OS kernel you often talk about saving "FPU state" on context switch (as opposed to just the general-purpose integer registers), and these days that's short-hand for FPU and SIMD state. e.g. in the Linux kernel, you need to use kernel_fpu_begin() before running instructions that use XMM/YMM/ZMM registers. (e.g. in the RAID5 / RAID6 drivers).

How to check if compiled code uses SSE and AVX instructions?

Under Linux, you could also decompile your binary:

objdump -d YOURFILE > YOURFILE.asm

Then find all SSE instructions:

awk '/[ \t](addps|addss|andnps|andps|cmpps|cmpss|comiss|cvtpi2ps|cvtps2pi|cvtsi2ss|cvtss2s|cvttps2pi|cvttss2si|divps|divss|ldmxcsr|maxps|maxss|minps|minss|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movss|movups|mulps|mulss|orps|rcpps|rcpss|rsqrtps|rsqrtss|shufps|sqrtps|sqrtss|stmxcsr|subps|subss|ucomiss|unpckhps|unpcklps|xorps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|psadbw|pshufw)[ \t]/' YOURFILE.asm

Find only packed SSE instructions (suggested by @Peter Cordes in comments):

awk '/[ \t](addps|andnps|andps|cmpps|cvtpi2ps|cvtps2pi|cvttps2pi|divps|maxps|minps|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movntq|movups|mulps|orps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|pmulhuw|psadbw|pshufw|rcpps|rsqrtps|shufps|sqrtps|subps|unpckhps|unpcklps|xorps)[ \t]/' YOURFILE.asm

Find all SSE2 instructions (except MOVSD and CMPSD, which were first introduced in 80386):

awk '/[ \t](addpd|addsd|andnpd|andpd|cmppd|comisd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvtsd2si|cvtsd2ss|cvtsi2sd|cvtss2sd|cvttpd2dq|cvttpd2pi|cvtps2dq|cvttsd2si|divpd|divsd|maxpd|maxsd|minpd|minsd|movapd|movhpd|movlpd|movmskpd|movupd|mulpd|mulsd|orpd|shufpd|sqrtpd|sqrtsd|subpd|subsd|ucomisd|unpckhpd|unpcklpd|xorpd|movdq2q|movdqa|movdqu|movq2dq|paddq|pmuludq|pshufhw|pshuflw|pshufd|pslldq|psrldq|punpckhqdq|punpcklqdq)[ \t]/' YOURFILE.asm

Find only packed SSE2 instructions:

awk '/[ \t](addpd|andnpd|andpd|cmppd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvttpd2dq|cvttpd2pi|cvttps2dq|divpd|maxpd|minpd|movapd|movapd|movhpd|movhpd|movlpd|movlpd|movmskpd|movntdq|movntpd|movupd|movupd|mulpd|orpd|pshufd|pshufhw|pshuflw|pslldq|psrldq|punpckhqdq|shufpd|sqrtpd|subpd|unpckhpd|unpcklpd|xorpd)[ \t]/' YOURFILE.asm

Find all SSE3 instructions:

awk '/[ \t](addsubpd|addsubps|haddpd|haddps|hsubpd|hsubps|movddup|movshdup|movsldup|lddqu|fisttp)[ \t]/' YOURFILE.asm

Find all SSSE3 instructions:

awk '/[ \t](psignw|psignd|psignb|pshufb|pmulhrsw|pmaddubsw|phsubw|phsubsw|phsubd|phaddw|phaddsw|phaddd|palignr|pabsw|pabsd|pabsb)[ \t]/' YOURFILE.asm

Find all SSE4 instructions:

awk '/[ \t](mpsadbw|phminposuw|pmulld|pmuldq|dpps|dppd|blendps|blendpd|blendvps|blendvpd|pblendvb|pblenddw|pminsb|pmaxsb|pminuw|pmaxuw|pminud|pmaxud|pminsd|pmaxsd|roundps|roundss|roundpd|roundsd|insertps|pinsrb|pinsrd|pinsrq|extractps|pextrb|pextrd|pextrw|pextrq|pmovsxbw|pmovzxbw|pmovsxbd|pmovzxbd|pmovsxbq|pmovzxbq|pmovsxwd|pmovzxwd|pmovsxwq|pmovzxwq|pmovsxdq|pmovzxdq|ptest|pcmpeqq|pcmpgtq|packusdw|pcmpestri|pcmpestrm|pcmpistri|pcmpistrm|crc32|popcnt|movntdqa|extrq|insertq|movntsd|movntss|lzcnt)[ \t]/' YOURFILE.asm

Find most common AVX instructions (including scalar, including AVX2, AVX-512 family and some FMA like vfmadd132pd):

awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmulsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vfnmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub213pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpbroadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherdd|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsllvq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpcmpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcompressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvttpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vcvttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vfixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrsqrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|vpminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatterqq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcastmw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgatherpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0qps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpclassss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|vp4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vpshrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec|vaesdeclast|vaesenc|vaesenclast)[ \t]/' YOURFILE.asm

NOTE: tested with gawk and nawk.

C++ Centralizing SIMD usage

I do exactly this with a fractal project. It works with vector sizes of 1, 2, 4, 8, and 16 for float and 1, 2, 4, 8 for double. I use a CPU dispatcher at run-time to select the following instructions sets: SSE2, SSE4.1, AVX, AVX+FMA, and AVX512.

The reason I use a vector size of 1 is to test performance. There is already a SIMD library that does all this: Agner Fog's Vector Class Library. He even includes example code for a CPU dispatcher.

The VCL emulates hardware such as AVX on systems that only have SSE (or even AVX512 for SSE). It just implements AVX twice (for four times for AVX512) so in most cases you can just use the largest vector size you want to target.

//#include "vectorclass.h"
void Memcopy(void *dst, void *src, size_t size)
{
    Vec8f v; //eight floats using AVX hardware or AVX emulated with SSE twice.
    for(int i = 0; i < size; i +=v.size())
    {
        v.load(src);
        v.store(dst);
        dst += v.size();
        src += v.size();
    }
}

(however, writing an efficient memcpy is complicating. For large sizes you should consider non temroal stores and on IVB and above use rep movsb instead). Notice that that code is identical to what you asked for except I changed the word vector to Vec8f.

Using the VLC, as CPU dispatcher, templating, and macros you can write your code/kernel so that it looks nearly identical to scalar code without source code duplication for every different instruction set and vector size. It's your binaries which will be bigger not your source code.

I have described CPU dispatchers several times. You can also see some example using templateing and macros for a dispatcher here: alias of a function template

Edit: Here is an example of part of my kernel to calculate the Mandelbrot set for a set of pixels equal to the vector size. At compile time I set TYPE to float, double, or doubledouble and N to 1, 2, 4, 8, or 16. The type doubledouble is described here which I created and added to the VCL. This produces Vector types of Vec1f, Vec4f, Vec8f, Vec16f, Vec1d, Vec2d, Vec4d, Vec8d, doubledouble1, doubledouble2, doubledouble4, doubledouble8.

template<typename TYPE, unsigned N>
static inline intn calc(floatn const &cx, floatn const &cy, floatn const &cut, int32_t maxiter) {
    floatn x = cx, y = cy;
    intn n = 0; 
    for(int32_t i=0; i<maxiter; i++) {
        floatn x2 = square(x), y2 = square(y);
        floatn r2 = x2 + y2;
        booln mask = r2<cut;
        if(!horizontal_or(mask)) break;
        add_mask(n,mask);
        floatn t = x*y; mul2(t);
        x = x2 - y2 + cx;
        y = t + cy;
    }
    return n;
}

So my SIMD code for several several different data types and vector sizes is nearly identical to the scalar code I would use. I have not included the part of my kernel which loops over each super-pixel.

My build file looks something like this

g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse2          -Ivectorclass  kernel.cpp -okernel_sse2.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse4.1        -Ivectorclass  kernel.cpp -okernel_sse41.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx           -Ivectorclass  kernel.cpp -okernel_avx.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx2 -mfma    -Ivectorclass  kernel.cpp -okernel_avx2.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx2 -mfma    -Ivectorclass  kernel_fma.cpp -okernel_fma.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx512f -mfma -Ivectorclass  kernel.cpp -okernel_avx512.o
g++ -m64 -Wall -Wextra -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse2 -Ivectorclass frac.cpp vectorclass/instrset_detect.cpp kernel_sse2.o kernel_sse41.o kernel_avx.o kernel_avx2.o kernel_avx512.o kernel_fma.o -o frac

Then the dispatcher looks something like this

int iset = instrset_detect();
fp_float1  = NULL; 
fp_floatn  = NULL;
fp_double1 = NULL;
fp_doublen = NULL;
fp_doublefloat1  = NULL;
fp_doublefloatn  = NULL;
fp_doubledouble1 = NULL;
fp_doubledoublen = NULL;
fp_float128 = NULL;
fp_floatn_fma = NULL;
fp_doublen_fma = NULL;

if (iset >= 9) {
    fp_float1  = &manddd_AVX512<float,1>;
    fp_floatn  = &manddd_AVX512<float,16>;
    fp_double1 = &manddd_AVX512<double,1>;
    fp_doublen = &manddd_AVX512<double,8>;
    fp_doublefloat1  = &manddd_AVX512<doublefloat,1>;
    fp_doublefloatn  = &manddd_AVX512<doublefloat,16>;
    fp_doubledouble1 = &manddd_AVX512<doubledouble,1>;
    fp_doubledoublen = &manddd_AVX512<doubledouble,8>;
}
else if (iset >= 8) {
    fp_float1  = &manddd_AVX<float,1>;
    fp_floatn  = &manddd_AVX2<float,8>;
    fp_double1 = &manddd_AVX2<double,1>;
    fp_doublen = &manddd_AVX2<double,4>;
    fp_doublefloat1  = &manddd_AVX2<doublefloat,1>;
    fp_doublefloatn  = &manddd_AVX2<doublefloat,8>;
    fp_doubledouble1 = &manddd_AVX2<doubledouble,1>;
    fp_doubledoublen = &manddd_AVX2<doubledouble,4>;
}
....

This sets function pointers to each of the different possible datatype vector combination for the instruction set found at runtime. Then I can call whatever function I'm interested.

Scan binary for CPU feature usage

There are two good approaches:

Run under a debugger and look at instruction that caused an illegal-instruction fault
Run under a simulator/emulator that can show you an instruction mix, like SDE.

But your idea, statically scanning the binary, can't distinguish code in functions that are only called after checking cpuid.

Using a debugger to look at the faulting instruction

Pick any debugger. GDB is easy to install on any Linux distro, and probably also on Windows or Mac (or lldb there). Or pick any other debugger, e.g. one with a GUID.

Run the program. Once it faults, use the debugger to examine the faulting instruction.

Look it up in Intel or AMD's x86 asm reference manual, e.g. https://www.felixcloutier.com/x86/ is an HTML scrape of Intel's PDFs. See which ISA extension this form of this instruction requires.

For example, this source can compile to use AVX-512 instructions if you let the compiler do so, but only needs SSE2 to compile in the first place.

#include <immintrin.h>
// stores to global vars typically aren't optimized out, even without volatile
int buf[16];
int main(int argc, char **argv)
{
        __m128i v = _mm_set1_epi32(argc);     // broadcast scalar to vector
        _mm_storeu_si128((__m128i*)buf, v);
}

(See it on Godbolt with different compile options.)

Build with gcc -march=skylake-avx512 -O3 ill.c.

Then try to run it, e.g. on my Skylake-client (non-AVX512) GNU/Linux desktop. (I also used strip a.out to remove the symbol table (function names), like a binary-only software release).

$ ./a.out 
Illegal instruction (core dumped)
$ gdb a.out
...
(gdb) run
Starting program: /tmp/a.out 

Program received signal SIGILL, Illegal instruction.
0x0000555555555020 in ?? ()

(gdb) disas
No function contains program counter for selected frame.

(gdb) disas /r $pc,+20            # from current program counter to +20 bytes
Dump of assembler code from 0x555555555020 to 0x555555555034:
=> 0x0000555555555020:  62 f2 7d 08 7c c7       vpbroadcastd xmm0,edi
   0x0000555555555026:  c5 f9 7f 05 32 30 00 00 vmovdqa XMMWORD PTR [rip+0x3032],xmm0        # 0x555555558060
   0x000055555555502e:  31 c0   xor    eax,eax
   0x0000555555555030:  c3      ret    
   0x0000555555555031:  66 2e 0f 1f 84 00 00 00 00 00   cs nop WORD PTR [rax+rax*1+0x0]
End of assembler dump.

The => indicates the current program counter (RIP in x86-64, but GDB portably defines $pc as an alias on any ISA.)

So we faulted on vpbroadcastd xmm0,edi. (The way GCC implemented _mm_set1_epi32(argc) when we told it AVX512 was available.)