Vs: Unexpected Optimization Behavior with _Bitscanreverse64 Intrinsic

VS: unexpected optimization behavior with _BitScanReverse64 intrinsic

AFAICT, the intrinsic leaves garbage in index when the input is zero, weaker than the behaviour of the asm instruction. This is why it has a separate boolean return value and integer output operand.

Despite the index arg being taken by reference, the compiler treats it as output-only.

unsigned char _BitScanReverse64 (unsigned __int32* index, unsigned __int64 mask)
Intel's intrinsics guide documentation for the same intrinsic seems clearer than the Microsoft docs you linked, and sheds some light on what the MS docs are trying to say. But on careful reading, they do both seem to say the same thing, and describe a thin wrapper around the bsr instruction.

Intel documents the BSR instruction as producing an "undefined value" when the input is 0, but setting the ZF in that case. But AMD documents it as leaving the destination unchanged:

AMD's BSF entry in AMD64 Architecture
Programmer’s Manual
Volume 3:
General-Purpose and
System Instructions

... If the second operand contains 0, the instruction sets ZF
to 1 and does not change the contents of the destination register. ...

On current Intel hardware, the actual behaviour matches AMD's documentation: it leaves the destination register unmodified when the src operand is 0. Perhaps this is why MS describes it as only setting Index when the input is non-zero (and the intrinsic's return value is non-zero).

On Intel (but maybe not AMD), this goes as far as not even truncating a 64-bit register to 32-bit. e.g. mov rax,-1 ; bsf eax, ecx (with zeroed ECX) leaves RAX=-1 (64-bit), not the 0x00000000ffffffff you'd get from xor eax, 0. But with non-zero ECX, bsf eax, ecx has the usual effect of zero-extending into RAX, leaving for example RAX=3.

IDK why Intel still hasn't documented it. Perhaps a really old x86 CPU (like original 386?) implements it differently? Intel and AMD frequently go above and beyond what's documented in the x86 manuals in order to not break existing widely-used code (e.g. Windows), which might be how this started.

At this point it seems unlikely that Intel will ever drop that output dependency and leave actual garbage or -1 or 32 for input=0, but the lack of documentation leaves that option open.

Skylake dropped the false dependency for lzcnt and tzcnt (and a later uarch dropped the false dep for popcnt) while still preserving the dependency for bsr/bsf. (Why does breaking the "output dependency" of LZCNT matter?)

Of course, since MSVC optimized away your index = 0 initialization, presumably it just uses whatever destination register it wants, not necessarily the register that held the previous value of the C variable. So even if you wanted to, I don't think you could take advantage of the dst-unmodified behaviour even though it's guaranteed on AMD.

So in C++ terms, the intrinsic has no input dependency on index. But in asm, the instruction does have an input dependency on the dst register, like an add dst, src instruction. This can cause unexpected performance issues if compilers aren't careful.

Unfortunately on Intel hardware, the popcnt / lzcnt / tzcnt asm instructions also have a false dependency on their destination, even though the result never depends on it. Compilers work around this now that it's known, though, so you don't have to worry about it when using intrinsics (unless you have a compiler more than a couple years old, since it was only recently discovered).

You need to check it to make sure index is valid, unless you know the input was non-zero. e.g.

if(_BitScanReverse64(&idx, input)) {
    // idx is valid.
    // (MS docs say "Index was set")
} else {
    // input was zero, idx holds garbage.
    // (MS docs don't say Index was even set)
    idx = -1;     // might make sense, one lower than the result for bsr(1)
}

If you want to avoid this extra check branch, you can use the lzcnt instruction via different intrinsics if you're targeting new enough hardware (e.g. Intel Haswell or AMD Bulldozer IIRC). It "works" even when the input is all-zero, and actually counts leading zeros instead of returning the index of the highest set bit.

How can a literal 0 and 0 as a variable yield different behavior with the function __builtin_clz?

When you compile with optimization disabled, the compiler doesn't do constant-propagation across statements. That part is a duplicate of Why does integer division by -1 (negative one) result in FPE? - read my answer there, and/or Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?

This is why a literal zero can be different from a variable with value = 0.
Only the variable with optimization disabled results in runtime bsr+xor $31, %reg.

As documented in the GCC manual for __builtin_clz

Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

This allows clz / ctz to compile to 31-bsr or bsf instructions respectively, on x86. 31-bsr is implemented with bsr+xor $31,%reg thanks to the magic of 2's complement. (BSR produces the index of the highest set bit, not the leading-zero count).

Note that it only says result, not behaviour. It's not C++ UB (whole program could do absolutely anything), it's limited to that result, just like in x86 asm. But anyway, it seems when the input is a compile-time constant 0, GCC produces the type-width like x86 lzcnt, and like clz instructions on other ISAs. (This probably happens in target-independent GIMPLE tree optimization where constant-propagation through operations including builtins is done.)

Intel documents bsf/bsr as If the content source operand is 0, the content of the destination operand is undefined. In real life, Intel hardware implements the same behaviour AMD documents: leave the destination unmodified in that case.

But since Intel refuses to document it, compilers won't let you make code which takes advantage of it. GCC does not know or care about that behaviour, and provides no way to take advantage of it. Neither does MSVC even though its intrinsic that takes an output pointer arg so could easily have worked that way. See VS: unexpected optimization behavior with _BitScanReverse64 intrinsic

With -march=native, GCC can use BMI1 lzcnt directly, which is well-defined for every possible input bit pattern including 0. It directly produces a leading-zero count instead of the index of the first set bit.

(This is why BSR/BSF don't make sense for input=0; there is no index for them to find. Fun fact: bsr %eax, %eax does "work" for eax=0. In asm, the instructions also set ZF according to whether the input was zero so you can detect when the output is "undefined" instead of a separate test+branch before bsr. Or on AMD and everything else in real life, that it left the destination unmodified.)

Find position of first (lowest) set bit in 32-bit number

#ifdef __GNUC__, use __builtin_ctz(unsigned) to Count Trailing Zeros (GCC manual). GCC, clang, and ICC all support it on all target ISAs. (On ISAs where there's no native instruction, it will call a GCC helper function.)

Leading vs. Trailing is when written in printing order, MSB-first, like 8-bit binary 00000010 has 6 leading zeros and one trailing zero. (And when cast to 32-bit binary, will have 24+6 = 30 leading zeros.)

For 64-bit integers, use __builtin_ctzll(unsigned long long). It's unfortunate that GNU C bitscan builtins don't take fixed-width types (especially the leading zeros versions), but unsigned is always 32-bit on GNU C for x86 (although not for AVR or MSP430). unsigned long long is always uint64_t on all GNU C targets I'm aware of.

On x86, it will compile to bsf or tzcnt depending on tuning + target options. tzcnt is a single uop with 3 cycle latency on modern Intel, and only 2 uops with 2 cycle latency on AMD (perhaps a bit-reverse to feed an lzcnt uop?) https://agner.org/optimize/ / https://uops.info/. Either way it's directly supported by fast hardware, and is much faster than anything you can do in pure C++. About the same cost as x * 1234567 (on Intel CPUs, bsf/tzcnt has the same cost as imul r, r, imm, in front-end uops, back-end port, and latency.)

The builtin has undefined behaviour for inputs with no bits set, allowing it to avoid any extra checks if it might run as bsf.

In other compilers (specifically MSVC), you might want an intrinsic for TZCNT, like _mm_tzcnt_32 from immintrin.h. (Intel intrinsics guide). Or you might need to include intrin.h (MSVC) or x86intrin.h for non-SIMD intrinsics.

Unlike GCC/clang, MSVC doesn't stop you from using intrinsics for ISA extensions you haven't enabled for the compiler to use on its own.

MSVC also has _BitScanForward / _BitScanReverse for actual BSF/BSR, but the leave-destination-unmodified behaviour that AMD guarantees (and Intel also implements) is still not exposed by these intrinsics, despite their pointer-output API.

VS: unexpected optimization behavior with _BitScanReverse64 intrinsic - pointer-output is assumed to always be written :/
_BitScanForward _BitScanForward64 missing (VS2017) Snappy - correct headers
How to use MSVC intrinsics to get the equivalent of this GCC code?

TZCNT decode as BSF on CPUs without BMI1 because its machine-code encoding is rep bsf. They give identical results for non-zero inputs, so compilers can and do always just use tzcnt because that's much faster on AMD. (They're the same speed on Intel so no downside. And on Skylake and later, tzcnt has no false output dependency. BSF does because it leaves its output unmodified for input=0).

(The situation is less convenient for bsr vs. lzcnt: bsr returns the bit-index, lzcnt returns the leading-zero count. So for best performance on AMD, you need to know that your code will only run on CPUs supporting BMI1 / TBM so the compiler can use lzcnt)

Note that with exactly 1 bit set, scanning from either direction will find the same bit. So 31 - lzcnt = bsr is the same in this case as bsf = tzcnt. Possibly useful if porting to another ISA that only has leading-zero count and no bit-reverse instruction.

Why does breaking the "output dependency" of LZCNT matter? modern compilers generally know to break the false dependency for lzcnt/tzcnt/popcnt. bsf/bsr have one, too, and I think GCC is also smart about that, but ironically might not be.
How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows? - the pseudocode is not the hardware implementation.
https://en.wikipedia.org/wiki/Find_first_set has more about bitscan functions across ISAs. Including POSIX ffs() which returns a 1-based index and has to do extra work to account for the possibility of the input being 0.

Compilers do recognize ffs() and inline it like a builtin (like they do for memcpy or sqrt), but don't always manage to optimize away all the work their canned sequence does to implement it when you actually want a 0-based index. It's especially hard to tell the compiler there's only 1 bit set.

_BitScanReverse returns 0 when index is 1 which means according to MS no set bits were found

You're misreading the documentation. The return value is 1 or 0, depending on if there are nonzero bits in mask. The index of the set bit is returned in *Index. No confusion arises.

Can a C++ Compiler Eliminate a Volatile Local Var that is not Read

So according to the C++ standard, can compiler eliminates the volatile int c?

No. volatile-qualified objects are used for reading from or writing to the hardware, and the side-effects of assigning a volatile object are observable.

Therefore, in compliance with the constraints on optimizations placed by the so-called "as-if" rule, conforming implementations are not allowed to optimize away c. The "as-if" rule is formally introduced in Paragraph 1.9/1 of the C++11 Standard:

The semantic descriptions in this International Standard define a parameterized nondeterministic abstract
machine. This International Standard places no requirement on the structure of conforming implementations.
In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming
implementations are required to emulate (only) the observable behavior of the abstract machine as explained
below

And the concept of "observable behavior" is defined in paragraph 1.9/8:

The least requirements on a conforming implementation are:

— Access to volatile objects are evaluated strictly according to the rules of the abstract machine.

— At program termination, all data written into files shall be identical to one of the possible results that
execution of the program according to the abstract semantics would have produced.

— The input and output dynamics of interactive devices shall take place in such a fashion that prompting
output is actually delivered before a program waits for input. What constitutes an interactive device
is implementation-defined.

These collectively are referred to as the observable behavior of the program. [...]

Since access to volatile objects must be evaluated strictly according to the rules of the abstract machine, compilers are not allowed to optimize away c and the corresponding assignment operation.

Find most significant bit (left-most) that is set in a bit array

GCC has __builtin_clz that translates to BSR on x86/x64, CLZ on ARM, etc. and emulates the instruction if the hardware does not implement it.

Visual C++ 2005 and up has _BitScanReverse.

Vs: Unexpected Optimization Behavior with _Bitscanreverse64 Intrinsic