Check XMM register for all zeroes
_mm_testz_si128 is SSE4.1 which isn't supported on some CPUs (e.g. Intel Atom, AMD Phenom)
Here is an SSE2-compatible variant
inline bool isAllZeros(__m128i xmm) {
return _mm_movemask_epi8(_mm_cmpeq_epi8(xmm, _mm_setzero_si128())) == 0xFFFF;
}
Faster way to test if xmm/ymm register is zero?
Rather than being "quite slow" your existing approach is reasonable, actually.
Sure each individual test has a latency of 4 cycles1, but it if you want the result in a general purpose register you are usually going to pay a 3 cycle latency for that move anyway (e.g., movmskb
also has a latency of 3). In any case, you want to test 8 registers, and you don't simply add the latencies because each one is mostly independent, so uop count and port use will likely end up being more important that the latency to test a single register as most of the latencies will overlap with other work.
An approach that is likely to be a bit faster on Intel hardware is using successive PCMPEQ
instructions, to test several vectors, and then folding the results together (e.g., if you use PCMPEQQ you effectively have 4 quadword results and need to and-fold them into 1). You can either fold before or after the PCMPEQ
, but it would help to know more about how/where you want to results to come up with something better. Here's an untested sketch for 8 registers, xmm1-8
with xmm0
assumed zero, and xmm14
being the pblendvb
mask to select alternate bytes used in the last instruction.
# test the 2 qwords in each vector against zero
vpcmpeqq xmm11, xmm1, xmm0
vpcmpeqq xmm12, xmm3, xmm0
vpcmpeqq xmm13, xmm5, xmm0
vpcmpeqq xmm14, xmm7, xmm0
# blend the results down into xmm10 word origin
vpblendw xmm10, xmm11, xmm12, 0xAA # 3131 3131
vpblendw xmm13, xmm13, xmm14, 0xAA # 7575 7575
vpblendw xmm10, xmm10, xmm13, 0xCC # 7531 7531
# test the 2 qwords in each vector against zero
vpcmpeqq xmm11, xmm2, xmm0
vpcmpeqq xmm12, xmm4, xmm0
vpcmpeqq xmm13, xmm6, xmm0
vpcmpeqq xmm14, xmm8, xmm0
# blend the results down into xmm11 word origin
vpblendw xmm11, xmm11, xmm12, 0xAA # 4242 4242
vpblendw xmm13, xmm13, xmm14, 0xAA # 8686 8686
vpblendw xmm11, xmm11, xmm13, 0xCC # 8642 8642
# blend xmm10 and xmm11 together int xmm100, byte-wise
# origin bytes
# xmm10 77553311 77553311
# xmm11 88664422 88664422
# res 87654321 87654321
vpblendvb xmm10, xmm10, xmm11, xmm15
# move the mask bits into eax
vpmovmskb eax, xmm10
and al, ah
The intuition is that you test each QWORD
in each xmm
against zero, giving 16 results for the 8 registers, and then you blend the results together into xmm10
ending up with one result per byte, in order (with all high-QWORD results before all the low-QWORD results). Then you move those 16 byte masks as 16-bits into eax
with movmskb
and finally combine the high and low QWORD
bits for each register inside eax
.
That looks to me like 16 uops total, for 8 registers, so about 2 uops per register. The total latency is reasonable since it largely a "reduce" type parallel tree. A limiting factor would be the 6 vpblendw
operations which all go only to port 5 on modern Intel. It would be better to replace 4 of those with VPBLENDD
which is the one "blessed" blend that goes to any of p015
. That should be straightforward.
All the ops are simple and fast. The final and al, ah
is a partial register write, but if you mov
it after into eax
perhaps there is no penalty. You could also do that last line a couple of different ways if that's an issue...
This approach also scales naturally to ymm
registers, with slightly different folding in eax
at the end.
EDIT
A slightly faster ending uses packed shifts to avoid two expensive instructions:
;combine bytes of xmm10 and xmm11 together into xmm10, byte wise
; xmm10 77553311 77553311
; xmm11 88664422 88664422 before shift
; xmm10 07050301 07050301
; xmm11 80604020 80604020 after shift
;result 87654321 87654321 combined
vpsrlw xmm10,xmm10,8
vpsllw xmm11,xmm11,8
vpor xmm10,xmm10,xmm11
;combine the low and high dqword to make sure both are zero.
vpsrldq xmm12,xmm10,64
vpand xmm10,xmm12
vpmovmskb eax,xmm10
This saves 2 cycles by avoiding the 2 cycle vpblendvb
and the partial write penalty of or al,ah
, it also fixes the dependency on the slow vpmovmskb
if don't need to use the result of that instruction right away.
1Actually it seems to be only on Skylake that PTEST
has a latency of three cycles, before that it seems to be 2. I'm also not sure about the 1 cycle latency you listed for rcl eax, 1
: according to Agner, it seems to be 3 uops and 2 cycles latency/recip throughput on modern Intel.
Test if any byte in an xmm register is 0
You can use _mm_movemask_epi8
(pmovmskb
instruction) to obtain a bit mask from the result of comparison (the resulting mask contains the most significant bits of each byte in the vector). Then, testing for whether any of the bytes are zero means testing if any of the 16 bits in the mask are non-zero.
pxor xmm4, xmm4
pcmpeqb xmm4, [rdi]
pmovmskb eax, xmm4
test eax, eax ; ZF=0 if there are any set bits = any matches
jnz .found_a_zero
After finding a vector with any matches, you can find the first match position with bsf eax,eax
to get the bit-index in the bitmask, which is also the byte index in the 16-byte vector.
Alternatively, you can check for all bytes matching (e.g. like you'd do in memcmp / strcmp) with pcmpeqb
/ pmovmskb
/ cmp eax, 0xffff
to check that all bits are set, instead of checking for at least 1 bit set.
Why doesn't gcc zero the upper values of an XMM register when only using the lower value with SS/SD instructions?
The return value is the same width as the args so no extension is needed. The parts of registers outside the type widths are allowed to hold garbage in x86 and x86-64 calling conventions. (This applies to both GP integer and vector registers.)
Except for an undocumented extension which clang depends on, where callers extend narrow args to 32-bit; clang will skip the
movsx
instructions in yourchar
example. https://godbolt.org/z/Gv5e4h3EhIs a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI? covers both the high garbage and the unofficial extension to the calling convention.
Since you asked about false dependencies, note that compilers do use
movaps xmm,xmm
to copy a scalar. (e.g. in GCC's missed optimizations in(a-b) + (a-d)
we need to subtract froma
twice. It's non-commutative so we need a copy: https://godbolt.org/z/Tvx19raa3
Indeed,movss xmm1, xmm0
has a dependency on XMM1 wheremovaps
doesn't, and it would be a false dependency if you didn't actually care about merging with the old high bytes.(Tuning for Pentium III or Pentium M might make sense to use
movss
because it was single-uop there, but current GCC-O3 -m32 -mtune=pentium3 -mfpmath=sse
uses movaps, spending the 2nd uop to avoid a false dependency. It wasn't until Core2 that the SIMD execution units widened to 128-bit for P6 family, matching Pentium 4.)C integer promotion rules mean that
a+b
for narrow inputs is equivalent to(int)a + (int)b
. In all x86 / x86-64 ABIs,char
is a signed type (unlike on ARM for example), so it needs to be sign extended toint
width, not zero extended. And definitely not truncated.If you truncated the result again by returning a
char
, compilers could if they wanted just do 8-bit adds. But actually they'll use 32-bit adds and leave whatever high garbage there: https://godbolt.org/z/hGdbecPqv. It's not doing this for dep-breaking / performance, just correctness.As far as performance, GCC's behaviour of reading the 32-bit reg for a
char
is good if the caller wrote the full register (which the unofficial extension to the calling convention requires anyway), or on CPUs that don't rename low 8 separately from the rest of the reg (everything other than P6-family: SnB-family only renames high-8 regs, except for original Sandybridge itself. Why doesn't GCC use partial registers?)
PS: there's no such instruction as movd xmm0, xmm0
, only a different form of movq xmm0, xmm0
which yes would zero-extend the low 64 bits of an XMM register into the full reg.
If you want to see various compiler attempts to zero-extend the low dword, with/without SSE4.1 insertps
, look at asm for __m128 foo(float f) { return _mm_set_ss(f); }
in the Godbolt link above. e.g. with just SSE2, zero a register with pxor, then movss xmm1, xmm0
. Otherwise, insertps
or xor-zero and blendps
.
Checking if TWO SSE registers are not both zero without destroying them
I learned something useful from this question. Let's first look at some scalar code
extern foo2(int x, int y);
void foo(int x, int y) {
if((x || y)!=0) foo2(x,y);
}
Compile this like this gcc -O3 -S -masm=intel test.c
and the important assembly is
mov eax, edi ; edi = x, esi = y -> copy x into eax
or eax, esi ; eax = x | y and set zero flag in FLAGS if zero
jne .L4 ; jump not zero
Now let's look at testing SIMD registers for zero. Unlike scalar code there is no SIMD FLAGS register. However, with SSE4.1 there are SIMD test instructions which can set the zero flag (and carry flag) in the scalar FLAGS register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
__m128i z = _mm_or_si128(x,y);
if (!_mm_testz_si128(z,z)) foo2(x,y);
}
Compile with c99 -msse4.1 -O3 -masm=intel -S test_SSE.c
and the the important assembly is
movdqa xmm2, xmm0 ; xmm0 = x, xmm1 = y, copy x into xmm2
por xmm2, xmm1 ; xmm2 = x | y
ptest xmm2, xmm2 ; set zero flag if zero
jne .L4 ; jump not zero
Notice that this takes one more instruction because the packed bit-wise OR does not set the zero flag. Notice also that both the scalar version and the SIMD version need to use an additional register (eax
in the scalar case and xmm2
in the SIMD case). So to answer your question your current solution is the best you can do.
However, if you did not have a processor with SSE4.1 or better you would have to use Another alternative which only needs SSE2 is to use _mm_movemask_epi8
._mm_movemask_epi8
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(_mm_or_si128(x,y))) foo2(x,y);
}
The important assembly is
movdqa xmm2, xmm0
por xmm2, xmm1
pmovmskb eax, xmm2
test eax, eax
jne .L4
Notice that this needs one more instruction then with the SSE4.1 ptest
instruction.
Until now I have been using the pmovmaskb
instruction because the latency is better on pre Sandy Bridge processors than with ptest
. However, I realized this before Haswell. On Haswell the latency of pmovmaskb
is worse than the latency of ptest
. They both have the same throughput. But in this case this is not really important. What's important (which I did not realize before) is that pmovmaskb
does not set the FLAGS register and so it requires another instruction. So now I'll be using ptest
in my critical loop. Thank you for your question.
Edit: as suggested by the OP there is a way this can be done without using another SSE register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(x) | _mm_movemask_epi8(y)) foo2(x,y);
}
The relevant assembly from GCC is:
pmovmskb eax, xmm0
pmovmskb edx, xmm1
or edx, eax
jne .L4
Instead of using another xmm register this uses two scalar registers.
Note that fewer instructions does not necessarily mean better performance. Which of these solutions is best? You have to test each of them to find out.
Can I check the values of XMM or YMM registers in Visual C++ breakpoint conditions?
No.. XMM register cannot be compared directly in a breakpoint. Nor does an expression like XMM0 != XMM1 work as a breakpoint expression.
SSE2 test xmm bitmask directly without using 'pmovmskb'
It's generally not worth using SSE4.1 ptest xmm0,xmm0
on a pcmpeqb
result, especially not if you're branching.
pmovmskb
is 1 uop, and cmp
or test
can macro-fuse with jnz
into another single uop on both Intel and AMD CPUs. Total of 2 uops to branch on a pcmpeqb result with pmovmsk + test/jcc
But ptest
is 2 uops, and its 2nd uop can't macro-fuse with a following branch. Total of 3 uops to branch on a vector with ptest
+ jcc.
It's break-even when you can use ptest
directly, without needing a pcmp
, e.g. testing any / all bits in the whole vector (or with a mask, some bits). And actually a win if you use it for cmov or setcc instead of a branch. It's also a win for code-size, even though same number of uops.
You can amortize the checking over multiple vectors. e.g. por
some vectors together and then check that all of the bytes zero. Or pminub
some vectors together and then check for any zeros. (glibc string functions like strlen and strchr use this trick to check a whole cache-line of vectors in parallel, before sorting out where it came from after leaving the loop.)
You can combine pcmpeq results instead of raw inputs, e.g. for memchr. In that case you can use pand
instead of pminub
to get a zero in an element where any input has a zero. Some CPUs run pand
on more ports than pminub
, so less competition for vector ALU.
Also note that pmovmskb zero-extends into EAX; you can test eax,eax
instead of wasting a prefix byte to only test AX.
XMM register 0 not being used
You are misunderstanding something, probably the placeholders in the manual. When an instruction description says xmm1
or xmm2
it usually means any xmm register, the number just indicates operand numbering.
For example, ADDPS xmm1, xmm2/m128
can add two arbitrary xmm registers or add a memory operand to an arbitrary xmm register.
Related Topics
What Changes Introduced in C++14 Can Potentially Break a Program Written in C++11
Compiling Code Containing Dynamic Parallelism Fails
Symbol Not Found When Using Template Defined in a Library
Better Variable Exploring When Debugging C++ Code with Eclipse/Cdt
Fibonacci Sequence Overflow, C++
Why How to Define Structures and Classes Within a Function in C++
Sizeof an Array Passed as Function Argument
Why Does Rand() Return the Same Value Using Srand(Time(Null)) in This for Loop
C++11 Std::Async Doesn't Work in Mingw
How to Know the Right Max Size of Vector? Max_Size()? But No
How to Typedef a Type and the Same Type's Pointer
Compiler Cannot Recognize My Class in C++ - Cyclic Dependency