How to Test If Your Linux Support Sse2

How to test if your Linux Support SSE2

It's both. The compiler/assembler need to be able to emit/handle SSE2 instructions, and then the CPU needs to support them. If your binary has SSE2 instructions with no conditions attached and you try to run it on a Pentium II you are out of luck.
The best way is to check your GCC manual. For example my GCC manpage refers to the -msse2 option which will allow you to explicitly enable SSE2 instructions in the binaries. Any relatively recent GCC or ICC should support it. As for your cpu, check the flags line in /proc/cpuinfo.

It would be best, though, to have checks in your code using cpuid etc, so that SSE2 sections can be disabled in CPUs that do not support it and your code can fall back on a more common instruction set.

EDIT:

Note that your compiler needs to either be a native compiler running on a x86 system, or a cross-compiler for x86. Otherwise it will not have the necessary options to compile binaries for x86 processors, which includes anything with SSE2.

In your case the CPU does not support x86 at all. Depending on your Linux distribution, there might be packages with the Intel IA32EL emulation layer for x86-software-on-IA64, which may allow you to run x86 software.

Therefore you have the following options:

Use a cross-compiler that will run on IA64 and produce binaries for x86. Cross-compiler toolchains are not an easy thing to setup though, because you need way more than just the compiler (binutils, libraries etc).
Use Intel IA32EL to run a native x86 compiler. I don't know how you would go about installing a native x86 toolchain and all the libraries that your project needs in your distributions does not support it directly. Perhaps a full-blown chroot'ed installation of an x86 distribution ?

Then if you want to test your build on this system you have to install Intel's IA32EL for Linux.

EDIT2:

I suppose you could also run a full x86 linux distribution on an emulator like Bochs or QEMU (with no virtualization of course). You are definitely not going to be dazzled by the resulting speeds though.

How to probe a computer if it supports SSE2 in Delphi 32?

You can do that without assembler as well. Works with Windows XP and newer only though.

function IsProcessorFeaturePresent(ProcessorFeature: DWORD): BOOL; stdcall;
  external kernel32 name 'IsProcessorFeaturePresent';

const
  PF_XMMI64_INSTRUCTIONS_AVAILABLE = 10;

function HasSSE2: boolean;
begin
  result := IsProcessorFeaturePresent(PF_XMMI64_INSTRUCTIONS_AVAILABLE);
end;

Determine processor support for SSE2?

Call CPUID with eax = 1 to load the feature flags in to edx. Bit 26 is set if SSE2 is available. Some code for demonstration purposes, using MSVC++ inline assembly (only for x86 and not portable!):

inline unsigned int get_cpu_feature_flags()
{
    unsigned int features;

    __asm
    {
        // Save registers
        push    eax
        push    ebx
        push    ecx
        push    edx

        // Get the feature flags (eax=1) from edx
        mov     eax, 1
        cpuid
        mov     features, edx

        // Restore registers
        pop     edx
        pop     ecx
        pop     ebx
        pop     eax
    }

    return features;
}

// Bit 26 for SSE2 support
static const bool cpu_supports_sse2 = (cpu_feature_flags & 0x04000000)!=0;

How to tell if a Linux machine supports AVX/AVX2 instructions?

On linux (or unix machines) the information about your cpu is in /proc/cpuinfo. You can extract information from there by hand, or with a grep command (grep flags /proc/cpuinfo).

Also most compilers will automatically define __AVX2__ so you can check for that too.

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

Most compilers will automatically define:

__SSE__
__SSE2__
__SSE3__
__AVX__
__AVX2__

etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this:

$ gcc -msse3 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE_MATH__ 1

or:

$ gcc -mavx2 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

or to just check the pre-defined macros for a default build on your particular platform:

$ gcc -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __SSE2_MATH__ 1
#define __SSE2__ 1
#define __SSE3__ 1
#define __SSE_MATH__ 1
#define __SSE__ 1
#define __SSSE3__ 1

More recent Intel processors support AVX-512, which is not a monolithic instruction set. One can see the support available from GCC (version 6.2) for two examples below.

Here is Knights Landing:

$ gcc -march=knl -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __AVX512CD__ 1
#define __AVX512ER__ 1
#define __AVX512F__ 1
#define __AVX512PF__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

Here is Skylake AVX-512:

$ gcc -march=skylake-avx512 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512F__ 1
#define __AVX512VL__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

Intel has disclosed additional AVX-512 subsets (see ISA extensions). GCC (version 7) supports compiler flags and preprocessor symbols associated with the 4FMAPS, 4VNNIW, IFMA, VBMI and VPOPCNTDQ subsets of AVX-512:

for i in 4fmaps 4vnniw ifma vbmi vpopcntdq ; do echo "==== $i ====" ; gcc -mavx512$i -dM -E - < /dev/null | egrep "AVX512" | sort ; done
==== 4fmaps ====
#define __AVX5124FMAPS__ 1
#define __AVX512F__ 1
==== 4vnniw ====
#define __AVX5124VNNIW__ 1
#define __AVX512F__ 1
==== ifma ====
#define __AVX512F__ 1
#define __AVX512IFMA__ 1
==== vbmi ====
#define __AVX512BW__ 1
#define __AVX512F__ 1
#define __AVX512VBMI__ 1
==== vpopcntdq ====
#define __AVX512F__ 1
#define __AVX512VPOPCNTDQ__ 1

Note that the SSE macros won't work with Visual C++. You have to use _M_IX86_FP instead.

How can I check if my installed numpy is compiled with SSE/SSE2 instruction set?

Take a look at:

import numpy.distutils.system_info as sysinfo
sysinfo.show_all()

This will print out all of the information about what numpy was compiled against.

Detect the availability of SSE/SSE2 instruction set in Visual Studio

From the documentation:

_M_IX86_FP
Expands to a value indicating which /arch compiler option was used:
0 if /arch:IA32 was used.
1 if /arch:SSE was used.
2 if /arch:SSE2 was used. This value is the default if /arch was not specified.

I don't see any mention of _SSE_.

How to check if a CPU supports the SSE3 instruction set?

I've created a GitHub repro that will detect CPU and OS support for all the major x86 ISA extensions: https://github.com/Mysticial/FeatureDetector

Here's a shorter version:

First you need to access the CPUID instruction:

#ifdef _WIN32

//  Windows
#define cpuid(info, x)    __cpuidex(info, x, 0)

#else

//  GCC Intrinsics
#include <cpuid.h>
void cpuid(int info[4], int InfoType){
    __cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
}

#endif

Then you can run the following code:

//  Misc.
bool HW_MMX;
bool HW_x64;
bool HW_ABM;      // Advanced Bit Manipulation
bool HW_RDRAND;
bool HW_BMI1;
bool HW_BMI2;
bool HW_ADX;
bool HW_PREFETCHWT1;

//  SIMD: 128-bit
bool HW_SSE;
bool HW_SSE2;
bool HW_SSE3;
bool HW_SSSE3;
bool HW_SSE41;
bool HW_SSE42;
bool HW_SSE4a;
bool HW_AES;
bool HW_SHA;

//  SIMD: 256-bit
bool HW_AVX;
bool HW_XOP;
bool HW_FMA3;
bool HW_FMA4;
bool HW_AVX2;

//  SIMD: 512-bit
bool HW_AVX512F;    //  AVX512 Foundation
bool HW_AVX512CD;   //  AVX512 Conflict Detection
bool HW_AVX512PF;   //  AVX512 Prefetch
bool HW_AVX512ER;   //  AVX512 Exponential + Reciprocal
bool HW_AVX512VL;   //  AVX512 Vector Length Extensions
bool HW_AVX512BW;   //  AVX512 Byte + Word
bool HW_AVX512DQ;   //  AVX512 Doubleword + Quadword
bool HW_AVX512IFMA; //  AVX512 Integer 52-bit Fused Multiply-Add
bool HW_AVX512VBMI; //  AVX512 Vector Byte Manipulation Instructions

int info[4];
cpuid(info, 0);
int nIds = info[0];

cpuid(info, 0x80000000);
unsigned nExIds = info[0];

//  Detect Features
if (nIds >= 0x00000001){
    cpuid(info,0x00000001);
    HW_MMX    = (info[3] & ((int)1 << 23)) != 0;
    HW_SSE    = (info[3] & ((int)1 << 25)) != 0;
    HW_SSE2   = (info[3] & ((int)1 << 26)) != 0;
    HW_SSE3   = (info[2] & ((int)1 <<  0)) != 0;

    HW_SSSE3  = (info[2] & ((int)1 <<  9)) != 0;
    HW_SSE41  = (info[2] & ((int)1 << 19)) != 0;
    HW_SSE42  = (info[2] & ((int)1 << 20)) != 0;
    HW_AES    = (info[2] & ((int)1 << 25)) != 0;

    HW_AVX    = (info[2] & ((int)1 << 28)) != 0;
    HW_FMA3   = (info[2] & ((int)1 << 12)) != 0;

    HW_RDRAND = (info[2] & ((int)1 << 30)) != 0;
}
if (nIds >= 0x00000007){
    cpuid(info,0x00000007);
    HW_AVX2   = (info[1] & ((int)1 <<  5)) != 0;

    HW_BMI1        = (info[1] & ((int)1 <<  3)) != 0;
    HW_BMI2        = (info[1] & ((int)1 <<  8)) != 0;
    HW_ADX         = (info[1] & ((int)1 << 19)) != 0;
    HW_SHA         = (info[1] & ((int)1 << 29)) != 0;
    HW_PREFETCHWT1 = (info[2] & ((int)1 <<  0)) != 0;

    HW_AVX512F     = (info[1] & ((int)1 << 16)) != 0;
    HW_AVX512CD    = (info[1] & ((int)1 << 28)) != 0;
    HW_AVX512PF    = (info[1] & ((int)1 << 26)) != 0;
    HW_AVX512ER    = (info[1] & ((int)1 << 27)) != 0;
    HW_AVX512VL    = (info[1] & ((int)1 << 31)) != 0;
    HW_AVX512BW    = (info[1] & ((int)1 << 30)) != 0;
    HW_AVX512DQ    = (info[1] & ((int)1 << 17)) != 0;
    HW_AVX512IFMA  = (info[1] & ((int)1 << 21)) != 0;
    HW_AVX512VBMI  = (info[2] & ((int)1 <<  1)) != 0;
}
if (nExIds >= 0x80000001){
    cpuid(info,0x80000001);
    HW_x64   = (info[3] & ((int)1 << 29)) != 0;
    HW_ABM   = (info[2] & ((int)1 <<  5)) != 0;
    HW_SSE4a = (info[2] & ((int)1 <<  6)) != 0;
    HW_FMA4  = (info[2] & ((int)1 << 16)) != 0;
    HW_XOP   = (info[2] & ((int)1 << 11)) != 0;
}

Note that this only detects whether the CPU supports the instructions. To actually run them, you also need to have operating system support.

Specifically, operating system support is required for:

x64 instructions. (You need a 64-bit OS.)
Instructions that use the (AVX) 256-bit ymm registers. See Andy Lutomirski's answer for how to detect this.
Instructions that use the (AVX512) 512-bit zmm and mask registers. Detecting OS support for AVX512 is the same as with AVX, but using the flag 0xe6 instead of 0x6.

sse2 instruction set not enabled

-msse2 is the specific option, so passing that to GCC will work, if you get your build scripts set up to actually do that. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#x86-Options

Or better, use -march=native to enable everything your CPU has, if you're building for local use, not for distributing a binary that might have to work on an old-but-not-ancient CPU. (Of course, if you care about performance, it's weird to be building for 32-bit mode. SSE2 is baseline for x86-64. Unless your CPU is too old to support SSE2, e.g. a Pentium III. Or for example, there are embedded x86 CPUs without SSE, like AMD Geode. In that case, a binary built (successfully) with -msse2 will probably crash with an illegal instruction on such a CPU.)

-mfpmath=sse just tells GCC to use SSE for scalar FP math assuming that SSE is available; unrelated to telling GCC to assume the target CPU does support SSE2. It can be good to use it as well for performance, but it's not going to matter in getting your code to compile.

And yes, SSE1/2 intrinsic types like __m128i will only get defined when SSE is enabled, so error: ‘__m128i’ does not name a type is a clear sign that -msse wasn't enabled

If using autoconf or something, maybe use this:

./configure CPPFLAGS="-O3 -march=native -fno-math-errno"

If you have .c files as well as .cpp, set CFLAGS as well as CPPFLAGS. More options like -flto can be helpful for optimization (cross-file inlining at link time), if you get those added to your LD options. As well as any other optimization options like -ffast-math if you want to use it. Or at least -fno-trapping-math helps some, and GCC already did optimizations that violated the semantics trapping-math was supposed to provide. See this Q&A re: -fno-trapping-math -fno-math-errno being safe to use basically everywhere, even in code that depends on strict FP like Kahan summation.

How to Test If Your Linux Support Sse2