Position of Least Significant Bit That Is Set

Position of least significant bit that is set

Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:

unsigned int v;  // find the number of trailing zeros in 32-bit v 
int r;           // result goes here
static const int MultiplyDeBruijnBitPosition[32] = 
{
  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 
  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];

Helpful references:

"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming

How does clearing least significant bit repeatedly work in this algorithm?

c - 1 unsets the least significant bit in the binary representation of c and sets all the unset bits to the right of that bit.

When you binary and c - 1 with c you effectively unset all the bits to the right of the least significant set bit and also the least significant set bit. In other words the least significant set bit in c and everything to its right become zeros.

You count this as one, and rightly so. Because it was just one set bit fomr a ^ b.

Now, you continue this operation until c becomes zero and the number of operations is the number of set bits in c which is the number of different bits between a and b.

To give you an example for what c - 1 does to the binary representation of c:

c = 6, in binary 110
c-1 = 5, in binary 101
(c-1) & (c) = 110 & 101 = 100
The above is one iteration of the loop

Next iteration
c = 4, in binary 100
c-1 = 3, in binary 011
(c-1) & (c) = 100 & 101 = 0

The above successfully counts the number of set bits in 6.

This optimization helps you to improve the algorithm compared to when you right shift the number at each iteration and check whether the least significant bit is set. In the former case you operate in the number of positions where the most significant set bit is. Say for a power of 2 number, 2^7, you iterate 8 times until the number becomes zero. While with the optimized method you iterate according to the number of set bits. For 2^7 it would be just one iteration.

How do I find the position of the least significant bit in which 2 numbers differ in python?

you can XOR the two numbers and XOR will return 0 when both bits are the same and 1 when both bits are different. so the result of 01110 ^ 10010 will be 11100 we can then parse this as a string using rindex to look for the first 1 we encounter from the right hand side as this will be the first place the bits were different. This will give us the index from the left. we then subtract this from the length of the xor string to get the bit count from the right where the difference occurs.

def sdb(a: int, b: int) -> int:
    try:
        xor_string = f'{a^b:b}'
        index = len(xor_string) - xor_string.rindex('1')
    except ValueError as ve:
        #if no bits are different return index 0
        index = 0
    return index

num1 = int('01110',2)
num2 = int('10010',2)
print(sdb(num1, num2))

How to get position of right most set bit in C

Try this

int set_bit = n ^ (n&(n-1));

Explanation:

As noted in this answer, n&(n-1) unsets the last set bit.

So, if we unset the last set bit and xor it with the number; by the nature of the xor operation, the last set bit will become 1 and the rest of the bits will return 0

return index of least significant bit in Python

It is available in the gmpy wrapper for the GNU Multi-Precision library. On my system, it is about 4x faster than the ctypes solution.

>>> import gmpy
>>> gmpy.scan1(136)
3
>>> bin(136)
'0b10001000'

What is the fastest/most efficient way to find the least significant set bit in an integer in R?

If you have long vectors and want to go to C++ then the following code might help you (together with Rcpp and the ffs function from strings.h):

#include <Rcpp.h>
#include <strings.h>
using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::IntegerVector lsb(const IntegerVector x)
{
  IntegerVector res(x.size());
  std::transform(x.begin(), x.end(), res.begin(), ffs);
  return(res-1);  # To start from 0
}

Save the code above as a file, say lsb.cpp, and compile it using sourceCpp("lsb.cpp") from the Rcpp package.

It is slightly faster - at least for longer input vectors where the overhead becomes negligible

> x <- floor(runif(10000,1,2^31))
> microbenchmark::microbenchmark(f(x), g(x), lsb(x))
Unit: microseconds
   expr       min         lq        mean    median         uq       max neval
   f(x)   121.771   129.6360   168.91273   133.241   151.0110  1294.667   100
   g(x) 36165.757 40508.1740 50371.45183 42608.686 60460.5270 94664.255   100
 lsb(x)    25.767    26.8015    34.58856    33.035    35.2385   156.852   100

Efficiently find least significant set bit in a large array?

The best way to find the first set bit within a whole vector (AFAIK) involves finding the first non-zero SIMD element (e.g. a byte or dword), then using a bit-scan on that. (__builtin_ctz / bsf / tzcnt / ffs-1) . As such, ctz(vector) is not itself a useful building block for searching an array, only for after the loop.

Instead you want to loop over the array searching for a non-zero vector, using a whole-vector check involving SSE4.1 ptest xmm0,xmm0 / jz .loop (3 uops), or with SSE2 pcmpeqd v, zero / pmovmskb / cmp eax, 0xffff / je .loop (3 uops after cmp/jcc macro-fusion). https://uops.info/

Once you do find a non-zero vector, pcmpeqb / movmskps / bsf on that to find a dword index, then load that dword and bsf it. Add the start-bit position (CHAR_BIT*4*dword_idx) to the bsf bit-position within that element. This is a fairly long dependency chain for latency, including an integer L1d load latency. But since you just loaded the vector, at least you can be fairly confident you'll hit in cache when you load it again with integer. (If the vector was generated on the fly, then probably still best to store / reload it and let store-forwarding work, instead of trying to generate a shuffle control for vpermilps/movd or SSSE3 pshufb/movd/movzx ecx, al.)

The loop problem is very much like strlen or memchr, except we're rejecting a single value (0) and looking for anything else. Still, we can take inspiration from hand-optimized asm strlen / memchr implementations like glibc's, for example loading multiple vectors and doing one check to see if any of them have what they're looking for. (For strlen, combine with pminub to get a 0 if any element is 0. For pcmpeqb compare results, OR for memchr). For our purposes, the reduction operation we want is OR - any non-zero input will make the output non-zero, and bitwise boolean ops can run on any vector ALU port.

(If the expected first-bit-position isn't very high, it's not worth being too aggressive with this: if the first set bit is in the first vector, sorting things out between 2 vectors you've loaded will be slower. 5000 bits is only 625 bytes, or 19.5 AVX2 __m256i vectors. And the first set bit is probably not always right at the end)

AVX2 version:

This checks pairs of 32-byte vectors (i.e. whole cache lines) for non-zero, and if found then sorts that out into one 64-bit bitmap for a single CTZ operation. That extra shift/OR costs latency in the critical path, but the hope is that we get to the first 1 bit sooner.

Combining 2 vectors down to one with OR means it's not super useful to know which element of the OR result was non-zero. We basically redo the work inside the if. That's the price we pay for keeping the amount of uops low for the actual search part.

(The if body ends with a return, so in the asm it's actually like an if()break, or actually an if()goto out of the loop since it goes to a difference place than the not-found return -1 from falling through out of the loop.)

// untested, especially the pointer end condition, but compiles to asm that looks good
// Assumes len is a multiple of 64 bytes

#include <immintrin.h>
#include <stdint.h>
#include <string.h>

// aliasing-safe: p can point to any C data type
int bitscan_avx2(const char *p, size_t len /* in bytes */)
{
    //assert(len % 64 == 0);
    //optimal if p is 64-byte aligned, so we're checking single cache-lines
    const char *p_init = p;
    const char *endp = p + len - 64;
    do {
        __m256i v1 = _mm256_loadu_si256((const __m256i*)p);
        __m256i v2 = _mm256_loadu_si256((const __m256i*)(p+32));
        __m256i or = _mm256_or_si256(v1,v2);
        if (!_mm256_testz_si256(or, or)){        // find the first non-zero cache line
            __m256i v1z = _mm256_cmpeq_epi32(v1, _mm256_setzero_si256());
            __m256i v2z = _mm256_cmpeq_epi32(v2, _mm256_setzero_si256());
            uint32_t zero_map = _mm256_movemask_ps(_mm256_castsi256_ps(v1z));
            zero_map |= _mm256_movemask_ps(_mm256_castsi256_ps(v2z)) << 8;

            unsigned idx = __builtin_ctz(~zero_map);  // Use ctzll for GCC, because GCC is dumb and won't optimize away a movsx
            uint32_t nonzero_chunk;
            memcpy(&nonzero_chunk, p+4*idx, sizeof(nonzero_chunk));  // aliasing / alignment-safe load

            return (p-p_init + 4*idx)*8 + __builtin_ctz(nonzero_chunk);
        }
        p += 64;
    }while(p < endp);
    return -1;
}

On Godbolt with clang 12 -O3 -march=haswell:

bitscan_avx2:
        lea     rax, [rdi + rsi]
        add     rax, -64                 # endp
        xor     ecx, ecx
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm1, ymmword ptr [rdi]  # do {
        vmovdqu ymm0, ymmword ptr [rdi + 32]
        vpor    ymm2, ymm0, ymm1
        vptest  ymm2, ymm2
        jne     .LBB0_2                       # if() goto out of the inner loop
        add     ecx, 512                      # bit-counter incremented in the loop, for (p-p_init) * 8
        add     rdi, 64
        cmp     rdi, rax
        jb      .LBB0_1                  # }while(p<endp)

        mov     eax, -1               # not-found return path
        vzeroupper
        ret

.LBB0_2:
        vpxor   xmm2, xmm2, xmm2
        vpcmpeqd        ymm1, ymm1, ymm2
        vmovmskps       eax, ymm1
        vpcmpeqd        ymm0, ymm0, ymm2
        vmovmskps       edx, ymm0
        shl     edx, 8
        or      edx, eax             # mov ah,dl  would be interesting, but compilers won't do it.
        not     edx                  # one_positions = ~zero_positions
        xor     eax, eax                # break false dependency
        tzcnt   eax, edx             # dword_idx
        xor     edx, edx
        tzcnt   edx, dword ptr [rdi + 4*rax]   # p[dword_idx]
        shl     eax, 5               # dword_idx * 4 * CHAR_BIT
        add     eax, edx
        add     eax, ecx
        vzeroupper
        ret

This is probably not optimal for all CPUs, e.g. maybe we could use a memory-source vpcmpeqd for at least one of the inputs, and not cost any extra front-end uops, only back-end. As long as compilers keep using pointer-increments, not indexed addressing modes that would un-laminate. That would reduce the amount of work needed after the branch (which probably mispredicts).

To still use vptest, you might have to take advantage of the CF result from the CF = (~dst & src == 0) operation against a vector of all-ones, so we could check that all elements matched (i.e. the input was all zeros). Unfortunately, Can PTEST be used to test if two registers are both zero or some other condition? - no, I don't think we can usefully use vptest without a vpor.

Clang decided not to actually subtract pointers after the loop, instead to do more work in the search loop. :/ The loop is 9 uops (after macro-fusion of cmp/jb), so unfortunately it can only run a bit less than 1 iteration per 2 cycles. So it's only managing less than half of L1d cache bandwidth.

But apparently a single array isn't your real problem.

Without AVX

16-byte vectors mean we don't have to deal with the "in-lane" behaviour of AVX2 shuffles. So instead of OR, we can combine with packssdw or packsswb. Any set bits in the high half of a pack input will signed-saturate the result to 0x80 or 0x7f. (So signed saturation is key, not unsigned packuswb which will saturate signed-negative inputs to 0.)

However, shuffles only run on port 5 on Intel CPUs, so beware of throughput limits. ptest on Skylake for example is 2 uops, p5 and p0, so using packsswb + ptest + jz would limit to one iteration per 2 clocks. But pcmpeqd + pmovmskb don't.

Unfortunately, using pcmpeq on each input separately before packing / combining would cost more uops. But would reduce the amount of work left for the cleanup, and if the loop-exit usually involves a branch mispredict, that might reduce overall latency.

2x pcmpeqd => packssdw => pmovmskb => not => bsf would give you a number you have to multiply by 2 to use as a byte offset to get to the non-zero dword. e.g. memcpy(&tmp_u32, p + (2*idx), sizeof(tmp_u32));. i.e. bsf eax, [rdi + rdx*2].

With AVX-512:

You mentioned 512-bit vectors, but none of the CPUs you listed support AVX-512. Even if so, you might want to avoid 512-bit vectors because SIMD instructions lowering CPU frequency, unless your program spends a lot of time doing this, and your data is hot in L1d cache so you can truly benefit instead of still bottlenecking on L2 cache bandwidth. But even with 256-bit vectors, AVX-512 has new instructions that are useful for this:

integer compares (vpcmpb/w/d/q) have a choice of predicate, so you can do not-equal instead of having to invert later with NOT. Or even test-into-register vptestmd so you don't need a zeroed vector to compare against.
compare-into-mask is sort of like pcmpeq + movmsk, except the result is in a k register, still need a kmovq rax, k0 before you can tzcnt.
kortest - set FLAGS according to the OR of two mask registers being non-zero. So the search loop could do vpcmpd k0, ymm0, [rdi] / vpcmpd k1, ymm0, [rdi+32] / kortestw k0, k1

ANDing multiple input arrays

You mention your real problem is that you have up-to-20 arrays of bits, and you want to intersect them with AND and find the first set bit in the intersection.

You may want to do this in blocks of a few vectors, optimistically hoping that there will be a set bit somewhere early.

AND groups of 4 or 8 inputs, accumulating across results with OR so you can tell if there were any 1s in this block of maybe 4 vectors from each input. (If there weren't any 1 bits, do another block of 4 vectors, 64 or 128 bytes while you still have the pointers loaded, because the intersection would definitely be empty if you moved on to the other inputs now). Tuning these chunk sizes depends on how sparse your 1s are, e.g. maybe always work in chunks of 6 or 8 vectors. Power-of-2 numbers are nice, though, because you can pad your allocations out to a multiple of 64 or 128 bytes so you don't have to worry about stopping early.)

(For odd numbers of inputs, maybe pass the same pointer twice to a function expecting 4 inputs, instead of dispatching to special versions of the loop for every possible number.)

L1d cache is 8-way associative (before Ice Lake with 12-way), and a limited number of integer/pointer registers can make it a bad idea to try to read too many streams at once. You probably don't want a level of indirection that makes the compiler loop over an actual array in memory of pointers either.

How can I get the value of the least significant bit in a number?

x &= -x; /* clears all but the lowest bit of x */

Position of Least Significant Bit That Is Set