Most Efficient Way to Check If All _M128I Components Are 0 [Using <= Sse4.1 Intrinsics]

Most efficient way to check if all __m128i components are 0 [using = SSE4.1 intrinsics]

You can use the PTEST instuction via the _mm_testz_si128 intrinsic (SSE4.1), like this:

#include "smmintrin.h" // SSE4.1 header

if (!_mm_testz_si128(xor, xor))
{
// rectangle has changed
}

Note that _mm_testz_si128 returns 1 if the bitwise AND of the two arguments is zero.

Accessing the fields of a __m128i variable in a portable way

_mm_extract_epi16 for a compile-time known index.

For the first element _mm_cvtsi128_si32 gives more efficient instructions. This would work, given that:

  • _mm_sad_epu8 fills the the bits 16 thru 63 to zero
  • you truncate the result to 16 bits via uint16_t return type

Compilers may be able to do this optimization on their own, based on either of the reasons, but not all of them, so it is better to use _mm_cvtsi128_si32.

How to efficiently convert from two __m128d to one __m128i in MSVC?

If you got a linker error, you're probably ignoring a warning about an undeclared intrinsic function.

Your current code has a high risk of compiling to terrible asm. If it compiled to a vector-shift and an OR, it already is compiling to sub-optimal code. (Update: that's not what it compiles to, IDK where you got that idea.)

Use 2x _mm_cvtpd_epi32 to get two __m128i vectors with ints you want in the low 2 elements of each. Use _mm_unpacklo_epi64 to combine those two low halves into one vector with all 4 elements you want.


Compiler output from clang3.8.1 on the Godbolt compiler explorer. (Xcode uses clang by default, I think).

#include <immintrin.h>

// the good version
__m128i pack_double_to_int(__m128d a, __m128d b) {
return _mm_unpacklo_epi64(_mm_cvtpd_epi32(a), _mm_cvtpd_epi32(b));
}
cvtpd2dq xmm0, xmm0
cvtpd2dq xmm1, xmm1
punpcklqdq xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
ret

// the original
__m128i pack_double_to_int_badMMX(__m128d a, __m128d b) {
return _mm_set_epi64(_mm_cvtpd_pi32(b), _mm_cvtpd_pi32(a));
}
cvtpd2pi mm0, xmm1
cvtpd2pi mm1, xmm0
movq2dq xmm1, mm0
movq2dq xmm0, mm1
punpcklqdq xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
# note the lack of EMMS, because of not using the intrinsic for it
ret

MMX is almost totally useless when SSE2 and later is available; just avoid it. See the sse tag wiki for some guides.

Is an __m128i variable zero?

In SSE2 you can do:

__m128i zero = _mm_setzero_si128();
if(_mm_movemask_epi8(_mm_cmpeq_epi32(x,zero)) == 0xFFFF)
{
//the code...
}

this will test four int's vs zero then return a mask for each byte, so your bit-offsets of each corresponding int would be at 0, 4, 8 & 12, but the above test will catch if any bit is set, then if you preserve the mask you can work with the finer grained parts directly if need be.

Fastest way to initialize a __m128i constant with intrinsics?

This answer is purely about the case of constant C. If you have non-constant inputs, it matters where they're coming from (memory, registers, a recent computation that you could maybe do in vector registers in the first place?) and potentially what you're doing to do with the resulting vector. Shuffling separate scalar variables into / out of SIMD vectors kinda sucks, with a tradeoff between ALU port bottlenecks vs. latency and throughput of store/reload (and the store forwarding stall for scalar -> vector). Store/reload is good in asm for getting lots of small elements out of a SIMD vector when you do want them all, though.


For constant C_a and C_b, even MSVC does a good job at constant-propagation through that _mm_set. So there's no advantage to writing an implementation-specific initializer like SSE Error - Using m128i_i32 to define fields of a __m128i variable

Remember that the real determiner of performance is the assembly you can coax the compiler into producing, not really which intrinsics you use to do that.

#include <immintrin.h>

__m128i xor_const(__m128i v) {
return _mm_xor_si128(v, _mm_set_epi64x(0x789abc, 0x123456));
}

Compiled (on Godbolt) with x64 MSVC -O2 Gv (to use vectorcall so we can see what it does when a vector is already in a register, like when this inlines), we get this fairly stupid asm which hopefully wouldn't be this bad in a larger function after inlining:

;; MSVC 19.10
;; this is in the .rdata section; godbolt just filters directives that aren't interesting
;; "everyone knows" that compilers put data in the right sections
__xmm@0000000000789abc0000000000123456 DB 'V4', 012H, 00H, 00H, 00H, 00H, 00H
DB 0bcH, 09aH, 'x', 00H, 00H, 00H, 00H, 00H

xor_const@@16 PROC ; COMDAT
movdqa xmm1, XMMWORD PTR __xmm@0000000000789abc0000000000123456
pxor xmm1, xmm0
movdqa xmm0, xmm1
ret 0
xor_const@@16 ENDP

We can see that the _mm_set intrinsic compiled to a 16-byte constant in static storage, like we want. Failure to use pxor xmm0, xmm1 is surprising, but MSVC is well known for asm that's often not quite as good compared to GCC and/or clang. Again, as part of a large function when it has a choice of registers, we'd probably have no extra movdqa. And if the xor was in a loop, loading once outside a loop is what we want anyway. This wasn't the most recent MSVC version; Godbolt only has the most up-to-date MSVC versions installed for C++, not C, but you tagged this C.


By comparison, GCC9.2 -O3 compiles to the expected memory-source PXOR that's efficient on all CPUs.

xor_const:
pxor xmm0, XMMWORD PTR .LC0[rip]
ret

.section .rodata # Godbolt strips out stuff like section directive; re-added manually
.LC0:
.quad 1193046
.quad 7903932

You could probably get a compiler to emit the same asm with a static alignas(16) array holding the constant, and _mm_load_si128() from that. But why bother?

One thing to avoid is writing static const __m128i C = _mm_set... - compilers are super dumb with this and will not fold the _mm_set into a static constant initializer for the __m128i. C compilers will refuse to compile a non-constant static initializer. C++ compilers will reserve some BSS space and run a constructor-like function to copy from a read-only constant into that BSS space, because _mm_set doesn't fully optimize away in that case.

Check XMM register for all zeroes

_mm_testz_si128 is SSE4.1 which isn't supported on some CPUs (e.g. Intel Atom, AMD Phenom)

Here is an SSE2-compatible variant

inline bool isAllZeros(__m128i xmm) {
return _mm_movemask_epi8(_mm_cmpeq_epi8(xmm, _mm_setzero_si128())) == 0xFFFF;
}

Equal zero instruction in SSE

If it is SSE 4.1, you can use _mm_testz_si128, e.g.

_mm_testz_si128(idata, _mm_set1_epi32(0x0000))

Probably look also into Check XMM register for all zeroes for a SSE2 compatible solution.



Related Topics



Leave a reply



Submit