Check Xmm Register for All Zeroes

Check XMM register for all zeroes

_mm_testz_si128 is SSE4.1 which isn't supported on some CPUs (e.g. Intel Atom, AMD Phenom)

Here is an SSE2-compatible variant

inline bool isAllZeros(__m128i xmm) {
    return _mm_movemask_epi8(_mm_cmpeq_epi8(xmm, _mm_setzero_si128())) == 0xFFFF;
}

Faster way to test if xmm/ymm register is zero?

Rather than being "quite slow" your existing approach is reasonable, actually.

Sure each individual test has a latency of 4 cycles¹, but it if you want the result in a general purpose register you are usually going to pay a 3 cycle latency for that move anyway (e.g., movmskb also has a latency of 3). In any case, you want to test 8 registers, and you don't simply add the latencies because each one is mostly independent, so uop count and port use will likely end up being more important that the latency to test a single register as most of the latencies will overlap with other work.

An approach that is likely to be a bit faster on Intel hardware is using successive PCMPEQ instructions, to test several vectors, and then folding the results together (e.g., if you use PCMPEQQ you effectively have 4 quadword results and need to and-fold them into 1). You can either fold before or after the PCMPEQ, but it would help to know more about how/where you want to results to come up with something better. Here's an untested sketch for 8 registers, xmm1-8 with xmm0 assumed zero, and xmm14 being the pblendvb mask to select alternate bytes used in the last instruction.

# test the 2 qwords in each vector against zero
vpcmpeqq xmm11, xmm1, xmm0
vpcmpeqq xmm12, xmm3, xmm0
vpcmpeqq xmm13, xmm5, xmm0
vpcmpeqq xmm14, xmm7, xmm0

# blend the results down into xmm10   word origin
vpblendw xmm10, xmm11, xmm12, 0xAA   # 3131 3131
vpblendw xmm13, xmm13, xmm14, 0xAA   # 7575 7575
vpblendw xmm10, xmm10, xmm13, 0xCC   # 7531 7531

# test the 2 qwords in each vector against zero
vpcmpeqq xmm11, xmm2, xmm0
vpcmpeqq xmm12, xmm4, xmm0
vpcmpeqq xmm13, xmm6, xmm0
vpcmpeqq xmm14, xmm8, xmm0

# blend the results down into xmm11   word origin
vpblendw xmm11, xmm11, xmm12, 0xAA   # 4242 4242
vpblendw xmm13, xmm13, xmm14, 0xAA   # 8686 8686
vpblendw xmm11, xmm11, xmm13, 0xCC   # 8642 8642

# blend xmm10 and xmm11 together int xmm100, byte-wise
#         origin bytes
# xmm10 77553311 77553311
# xmm11 88664422 88664422
# res   87654321 87654321 
vpblendvb xmm10, xmm10, xmm11, xmm15

# move the mask bits into eax
vpmovmskb eax, xmm10
and al, ah

The intuition is that you test each QWORD in each xmm against zero, giving 16 results for the 8 registers, and then you blend the results together into xmm10 ending up with one result per byte, in order (with all high-QWORD results before all the low-QWORD results). Then you move those 16 byte masks as 16-bits into eax with movmskb and finally combine the high and low QWORD bits for each register inside eax.

That looks to me like 16 uops total, for 8 registers, so about 2 uops per register. The total latency is reasonable since it largely a "reduce" type parallel tree. A limiting factor would be the 6 vpblendw operations which all go only to port 5 on modern Intel. It would be better to replace 4 of those with VPBLENDD which is the one "blessed" blend that goes to any of p015. That should be straightforward.

All the ops are simple and fast. The final and al, ah is a partial register write, but if you mov it after into eax perhaps there is no penalty. You could also do that last line a couple of different ways if that's an issue...

This approach also scales naturally to ymm registers, with slightly different folding in eax at the end.

EDIT

A slightly faster ending uses packed shifts to avoid two expensive instructions:

;combine bytes of xmm10 and xmm11 together into xmm10, byte wise
; xmm10 77553311 77553311
; xmm11 88664422 88664422   before shift
; xmm10 07050301 07050301
; xmm11 80604020 80604020   after shift
;result 87654321 87654321   combined
vpsrlw xmm10,xmm10,8
vpsllw xmm11,xmm11,8
vpor xmm10,xmm10,xmm11

;combine the low and high dqword to make sure both are zero. 
vpsrldq xmm12,xmm10,64
vpand xmm10,xmm12
vpmovmskb eax,xmm10

This saves 2 cycles by avoiding the 2 cycle vpblendvb and the partial write penalty of or al,ah, it also fixes the dependency on the slow vpmovmskb if don't need to use the result of that instruction right away.

¹Actually it seems to be only on Skylake that PTEST has a latency of three cycles, before that it seems to be 2. I'm also not sure about the 1 cycle latency you listed for rcl eax, 1: according to Agner, it seems to be 3 uops and 2 cycles latency/recip throughput on modern Intel.

Test if any byte in an xmm register is 0

You can use _mm_movemask_epi8 (pmovmskb instruction) to obtain a bit mask from the result of comparison (the resulting mask contains the most significant bits of each byte in the vector). Then, testing for whether any of the bytes are zero means testing if any of the 16 bits in the mask are non-zero.

pxor xmm4, xmm4
pcmpeqb xmm4, [rdi]
pmovmskb eax, xmm4
test eax, eax          ; ZF=0 if there are any set bits = any matches
jnz .found_a_zero

After finding a vector with any matches, you can find the first match position with bsf eax,eax to get the bit-index in the bitmask, which is also the byte index in the 16-byte vector.

Alternatively, you can check for all bytes matching (e.g. like you'd do in memcmp / strcmp) with pcmpeqb / pmovmskb / cmp eax, 0xffff to check that all bits are set, instead of checking for at least 1 bit set.

Why doesn't gcc zero the upper values of an XMM register when only using the lower value with SS/SD instructions?

The return value is the same width as the args so no extension is needed. The parts of registers outside the type widths are allowed to hold garbage in x86 and x86-64 calling conventions. (This applies to both GP integer and vector registers.)
Except for an undocumented extension which clang depends on, where callers extend narrow args to 32-bit; clang will skip the movsx instructions in your char example. https://godbolt.org/z/Gv5e4h3Eh
Is a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI? covers both the high garbage and the unofficial extension to the calling convention.
Since you asked about false dependencies, note that compilers do use movaps xmm,xmm to copy a scalar. (e.g. in GCC's missed optimizations in (a-b) + (a-d) we need to subtract from a twice. It's non-commutative so we need a copy: https://godbolt.org/z/Tvx19raa3

Indeed, movss xmm1, xmm0 has a dependency on XMM1 where movaps doesn't, and it would be a false dependency if you didn't actually care about merging with the old high bytes.
(Tuning for Pentium III or Pentium M might make sense to use movss because it was single-uop there, but current GCC -O3 -m32 -mtune=pentium3 -mfpmath=sse uses movaps, spending the 2nd uop to avoid a false dependency. It wasn't until Core2 that the SIMD execution units widened to 128-bit for P6 family, matching Pentium 4.)
C integer promotion rules mean that a+b for narrow inputs is equivalent to (int)a + (int)b. In all x86 / x86-64 ABIs, char is a signed type (unlike on ARM for example), so it needs to be sign extended to int width, not zero extended. And definitely not truncated.
If you truncated the result again by returning a char, compilers could if they wanted just do 8-bit adds. But actually they'll use 32-bit adds and leave whatever high garbage there: https://godbolt.org/z/hGdbecPqv. It's not doing this for dep-breaking / performance, just correctness.
As far as performance, GCC's behaviour of reading the 32-bit reg for a char is good if the caller wrote the full register (which the unofficial extension to the calling convention requires anyway), or on CPUs that don't rename low 8 separately from the rest of the reg (everything other than P6-family: SnB-family only renames high-8 regs, except for original Sandybridge itself. Why doesn't GCC use partial registers?)

PS: there's no such instruction as movd xmm0, xmm0, only a different form of movq xmm0, xmm0 which yes would zero-extend the low 64 bits of an XMM register into the full reg.

If you want to see various compiler attempts to zero-extend the low dword, with/without SSE4.1 insertps, look at asm for __m128 foo(float f) { return _mm_set_ss(f); } in the Godbolt link above. e.g. with just SSE2, zero a register with pxor, then movss xmm1, xmm0. Otherwise, insertps or xor-zero and blendps.

Checking if TWO SSE registers are not both zero without destroying them

I learned something useful from this question. Let's first look at some scalar code

extern foo2(int x, int y);
void foo(int x, int y) {
    if((x || y)!=0) foo2(x,y);
}

Compile this like this gcc -O3 -S -masm=intel test.c and the important assembly is

 mov       eax, edi   ; edi = x, esi = y -> copy x into eax
 or        eax, esi   ; eax = x | y and set zero flag in FLAGS if zero
 jne       .L4        ; jump not zero

Now let's look at testing SIMD registers for zero. Unlike scalar code there is no SIMD FLAGS register. However, with SSE4.1 there are SIMD test instructions which can set the zero flag (and carry flag) in the scalar FLAGS register.

extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
    __m128i z = _mm_or_si128(x,y);
    if (!_mm_testz_si128(z,z)) foo2(x,y);
}

Compile with c99 -msse4.1 -O3 -masm=intel -S test_SSE.c and the the important assembly is

movdqa      xmm2, xmm0 ; xmm0 = x, xmm1 = y, copy x into xmm2
por         xmm2, xmm1 ; xmm2 = x | y
ptest       xmm2, xmm2 ; set zero flag if zero
jne         .L4        ; jump not zero

Notice that this takes one more instruction because the packed bit-wise OR does not set the zero flag. Notice also that both the scalar version and the SIMD version need to use an additional register (eax in the scalar case and xmm2 in the SIMD case). ~~So to answer your question your current solution is the best you can do.~~

~~However, if you did not have a processor with SSE4.1 or better you would have to use _mm_movemask_epi8.~~ Another alternative which only needs SSE2 is to use _mm_movemask_epi8

extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
    if (_mm_movemask_epi8(_mm_or_si128(x,y))) foo2(x,y);   
}

The important assembly is

movdqa      xmm2, xmm0
por         xmm2, xmm1
pmovmskb    eax, xmm2
test        eax, eax
jne         .L4

Notice that this needs one more instruction then with the SSE4.1 ptest instruction.

Until now I have been using the pmovmaskb instruction because the latency is better on pre Sandy Bridge processors than with ptest. However, I realized this before Haswell. On Haswell the latency of pmovmaskb is worse than the latency of ptest. They both have the same throughput. But in this case this is not really important. What's important (which I did not realize before) is that pmovmaskb does not set the FLAGS register and so it requires another instruction. So now I'll be using ptest in my critical loop. Thank you for your question.

Edit: as suggested by the OP there is a way this can be done without using another SSE register.

extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
    if (_mm_movemask_epi8(x) | _mm_movemask_epi8(y)) foo2(x,y);    
}

The relevant assembly from GCC is:

pmovmskb    eax, xmm0
pmovmskb    edx, xmm1
or          edx, eax
jne         .L4

Instead of using another xmm register this uses two scalar registers.

Note that fewer instructions does not necessarily mean better performance. Which of these solutions is best? You have to test each of them to find out.

Can I check the values of XMM or YMM registers in Visual C++ breakpoint conditions?

No.. XMM register cannot be compared directly in a breakpoint. Nor does an expression like XMM0 != XMM1 work as a breakpoint expression.

SSE2 test xmm bitmask directly without using 'pmovmskb'

It's generally not worth using SSE4.1 ptest xmm0,xmm0 on a pcmpeqb result, especially not if you're branching.

pmovmskb is 1 uop, and cmp or test can macro-fuse with jnz into another single uop on both Intel and AMD CPUs. Total of 2 uops to branch on a pcmpeqb result with pmovmsk + test/jcc

But ptest is 2 uops, and its 2nd uop can't macro-fuse with a following branch. Total of 3 uops to branch on a vector with ptest + jcc.

It's break-even when you can use ptest directly, without needing a pcmp, e.g. testing any / all bits in the whole vector (or with a mask, some bits). And actually a win if you use it for cmov or setcc instead of a branch. It's also a win for code-size, even though same number of uops.

You can amortize the checking over multiple vectors. e.g. por some vectors together and then check that all of the bytes zero. Or pminub some vectors together and then check for any zeros. (glibc string functions like strlen and strchr use this trick to check a whole cache-line of vectors in parallel, before sorting out where it came from after leaving the loop.)

You can combine pcmpeq results instead of raw inputs, e.g. for memchr. In that case you can use pand instead of pminub to get a zero in an element where any input has a zero. Some CPUs run pand on more ports than pminub, so less competition for vector ALU.

Also note that pmovmskb zero-extends into EAX; you can test eax,eax instead of wasting a prefix byte to only test AX.

XMM register 0 not being used

You are misunderstanding something, probably the placeholders in the manual. When an instruction description says xmm1 or xmm2 it usually means any xmm register, the number just indicates operand numbering.

For example, ADDPS xmm1, xmm2/m128 can add two arbitrary xmm registers or add a memory operand to an arbitrary xmm register.

Check Xmm Register for All Zeroes