Get Member of _M128 by Index

Get member of __m128 by index?

A union is probably the most portable way to do this:

union {
    __m128 v;    // SSE 4 x float vector
    float a[4];  // scalar array of 4 floats
} U;

float vectorGetByIndex(__m128 V, unsigned int i)
{
    U u;

    assert(i <= 3);
    u.v = V;
    return u.a[i];
}

Why can't Clang get __m128's data by index in constexpr function

Regardless of constexpr, a[pos] is only valid as a GNU C extension, not portable to MSVC. Storing to an array, or C++20 std::bit_cast to a struct might work. bit_cast is constexpr-compatible, unlike other type-punning methods. Although I'd be worried about how efficiently that would compile across compilers for runtime-variable pos

bit_cast does compile ok with clang, and works in a constexpr function. But compiles inefficiently for GCC.

Correction: clang compiles this, but rejects it if called in a context that requires it to be constant-evaluated. note: constexpr bit_cast involving type '__attribute__((__vector_size__(4 * sizeof(float)))) float const' (vector of 4 'float' values) is not yet supported.

Other failed attempts with current clang in a constexpr context:

_mm_store_ps - not supported. Nor is *(__m128*)f = a; because it's a reinterpret_cast.
f[0] = vec[0] etc. initializers: no, even literal constant indexing of a GNU C native vector isn't supported in clang in constexpr.
union type punning: reading an inactive member not allowed in a constexpr context
_mm_cvtss_f32(vec) - non-constexpr function unusable, so no chance of using if constexpr for separate shuffles and returns.

Not-working answer, may work at some point in the future but not with clang trunk pre 15.0

#include <cstddef>
#include <immintrin.h>
#include <bit>

// portable, but inefficient with GCC
constexpr float get_data(__m128 a, std::size_t pos) {
    struct foo { float f[4]; } s;
    s = std::bit_cast<foo>(a);
    return s.f[pos];
}

float test_idx2(__m128 a){
    return get_data(a, 2);
}

float test_idxvar(__m128 a, size_t pos){
    return get_data(a, pos);
}

These compile to decent asm on Godbolt, the same you'd get from clang with a[pos]. I used -O3 -march=haswell -std=gnu++20

# clang 14 -O3 -march=haswell -std=gnu++20
# get_data has no asm output; constexpr is like inline in that respect

test_idx2(float __vector(4)):
        vpermilpd       xmm0, xmm0, 1           # xmm0 = xmm0[1,0]
        ret
test_idxvar(float __vector(4), unsigned long):
        vmovups xmmword ptr [rsp - 16], xmm0
        vmovss  xmm0, dword ptr [rsp + 4*rdi - 16] # xmm0 = mem[0],zero,zero,zero
        ret

Store/reload is a sensible strategy for a runtime-variable index, although vmovd / vpermilps would be an option since AVX introduced a variable-control shuffle that uses dword indices. An out-of-range index is UB so the compiler doesn't have any requirement to return any specific data in that case.

Using vpermilpd for the constant index 2 is a waste of code-size vs. vmovhlps xmm0, xmm0, xmm0 or vunpckhpd. It costs a longer VEX prefix and an immediate, so 2 bytes of machine-code size, but otherwise same performance on most CPUs.

Unfortunately GCC doesn't do such a good job

We get a store/reload even for the fixed index of 2, and even worse, reload by bouncing through a GP-integer register. This is a missed optimization, but IDK how quickly it would get fixed if reported. So if you're going to do this, perhaps #ifdef __clang__ or #ifdef __llvm__ for bit_cast, and #ifdef __GNUC__ for a[pos]. (Clang defines __GNUC__ so check for that after special-casing clang.)

# gcc12 -O3 -march=haswell -std=gnu++20
test_idx2(float __vector(4)):
        vmovaps XMMWORD PTR [rsp-24], xmm0
        mov     rax, QWORD PTR [rsp-16]
        vmovd   xmm0, eax              # slow: should have loaded directly from mem
        ret

test_idxvar(float __vector(4), unsigned long):
        vmovdqa XMMWORD PTR [rsp-24], xmm0
        vmovss  xmm0, DWORD PTR [rsp-24+rdi*4]   # this is fine, same as clang
        ret

Interestingly the runtime-variable version didn't have the same anti-optimization for GCC.

accessing __m128 fields across compilers

To load a __m128, you can write _mm_setr_ps(1.f, 2.f, 3.f, 4.f), which is supported by GCC, ICC, MSVC and clang.

So far as I know, clang and recent versions of GCC support accessing __m128 fields by index. I don't know how to do this in ICC or MSVC. I guess _mm_extract_ps works for all 4 compilers but its return type is insane making it painful to use.

Accessing the fields of a __m128i variable in a portable way

_mm_extract_epi16 for a compile-time known index.

For the first element _mm_cvtsi128_si32 gives more efficient instructions. This would work, given that:

_mm_sad_epu8 fills the the bits 16 thru 63 to zero
you truncate the result to 16 bits via uint16_t return type

Compilers may be able to do this optimization on their own, based on either of the reasons, but not all of them, so it is better to use _mm_cvtsi128_si32.

How to find the max member in a __m128(F32vec4)

There's no easy way to do this. SSE isn't particularly meant for horizontal operations. So you have to shuffle...

Here's one approach:

__m128 a = _mm_set_ps(10,9,7,8);

__m128 b = _mm_shuffle_ps(a,a,78);  //  {a,b,c,d} -> {c,d,a,b}
a = _mm_max_ps(a,b);

b = _mm_shuffle_ps(a,a,177);        //  {a,b,c,d} -> {b,a,d,c}
a = _mm_max_ss(a,b);

float out;
_mm_store_ss(&out,a);

I note that the final store isn't really supposed to be a store. It's just a hack to get the value into the float datatype.

In reality no instruction is needed because float types will be stored in the same SSE registers. (It's just that the top 3 values are ignored.)

Broadcast one arbitrary element of __m128 vector

I think you have to see to _mm_shuffle_epi32(). Its using will be easy with next helper function:

#include <emmintrin.h>

template <int index> inline __m128 Broadcast(const __m128 & a)
{
    return _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(a), index * 0x55));
}

int main()
{
    __m128 a = {a0, a1, a2, a3};
    __m128 b = Broadcast<1>(a);
    return 0;
}

How to compare __m128 types?

You should probably use _mm_cmpneq_ps. However the interpretation of comparisons is a little different with SIMD code than with scalar code. Do you want to test for any corresponding element not being equal ? Or all corresponding elements not being equal ?

To test the results of the 4 comparisons from _mm_cmpneq_ps you can use _mm_movemask_epi8.

Note that comparing floating point values for equality or inequality is usually a bad idea, except in very specific cases.

__m128i vcmp = (__m128i)_mm_cmpneq_ps(a, b); // compare a, b for inequality
uint16_t test = _mm_movemask_epi8(vcmp); // extract results of comparison
if (test == 0xffff)
    // *all* elements not equal
else if (test != 0)
    // *some* elements not equal
else
    // no elements not equal, i.e. all elements equal

For documentation you want these two volumes from Intel:

Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 2A: Instruction Set Reference, A-M

Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 2B: Instruction Set Reference, N-Z

Get Member of _M128 by Index