What Is the Efficient Way to Count Set Bits At a Position or Lower

What is the efficient way to count set bits at a position or lower?

This C++ gets g++ to emit very good x86 ASM (godbolt compiler explorer). I expect it will compile efficiently on other 64bit architectures, too (if there's a HW popcount for std::bitset::count to use, otherwise that'll always be the slow part; e.g. sure to use g++ -march=nehalem or higher, or -mpopcnt if you don't want to enable anything else, if you can limit your code to only running on CPUs that support that x86 instruction):

#include <bitset>

int popcount_subset(std::bitset<64> A, int pos) {
  int high_bits_to_eliminate = 63 - pos;
  A <<= (high_bits_to_eliminate & 63);  // puts A[pos] at A[63].

  return (A[63]? ~0ULL : 0) & A.count();  // most efficient way: great code with gcc and clang
  // see the godbolt link for some #ifdefs with other ways to do the check, like
    // return A[BSET_SIZE-1] ? A.count() : 0;
}

This probably isn't optimal on 32bit architectures, so compare other alternatives if you need to make a 32bit build.

This will work for other sizes of bitset, as long as you do something about the hard-coded 63s, and change the & 63 mask for the shift count into a more general range-check. For optimal performance with strange size bitsets, make a template function with a specialization for size <= register width of the target machine. In that case, extract the bitset to an unsigned type of the appropriate width, and shift to the top of the register instead of the top of the bitset.

You'd expect this to also generate ideal code for bitset<32>, but it doesn't quite. gcc/clang still use 64bit registers on x86-64.

For large bitsets, shifting the whole thing will be slower than just popcounting the words below the one containing pos, and using this on that word. (This is where a vectorized popcount really shines on x86 if you can assume SSSE3 but not the popcnt insn hardware support, or for 32bit targets. AVX2 256bit pshufb is the fastest way to do bulk popcounts, but without AVX2 I think 64bit popcnt is pretty close to a 128-bit pshufb implementation. See the comments for more discussion.)

If you have an array of 64-bit elements, and want to count bits below a certain position in each one separately, then you definitely should use SIMD. The shift parts of this algorithm vectorize, not just the popcnt part. Use psadbw against an all-zero register to horizontal-sum bytes in 64-bit chunks after a pshufb-based popcnt that produces counts for the bits in each byte separately. SSE/AVX doesn't have 64-bit arithmetic right shift, but you can use a different technique to blend on the high bit of each element.

How I came up with this:

The asm instructions you want to get the compiler to output will:

remove the unwanted bits from the 64bit value
test the highest of the wanted bits.
popcount it.
return 0 or popcount, depending on the result of the test. (Branchless or branching implementations both have advantages. If the branch is predictable, a branchless implementation tends to be slower.)

The obvious way to do 1 is to generate a mask ((1<<(pos+1)) -1) and & it. A more efficient way is to left-shift by 63-pos, leaving the bits you want packed at the top of a register.

This also has the interesting side-effect of putting the bit you want to test as the top bit in the register. Testing the sign bit, rather than any other arbitrary bit, takes slightly fewer instructions. An arithmetic right shift can broadcast the sign bit to the rest of the register, allowing for more-efficient-than-usual branchless code.

Doing the popcount is a much-discussed problem, but is actually the trickier part of the puzzle. On x86, there is extremely efficient hardware support for it, but only on recent-enough hardware. On Intel CPUs, the popcnt instruction is only available on Nehalem and newer. I forget when AMD added support.

So to use it safely, you need to either do CPU dispatching with a fallback that doesn't use popcnt. Or, make separate binaries that do/don't depend on some CPU features.

popcount without the popcnt instruction can be done a few ways. One uses SSSE3 pshufb to implement a 4-bit LUT. This is most effective when used on a whole array, rather than a single 64b at a time, though. Scalar bithacks might be best here, and wouldn't require SSSE3 (and so would be compatible with ancient AMD CPUs that have 64bit but not pshufb.)

The Bitbroadcast:

(A[63]? ~0ULL : 0) asks the compiler to broadcast the high bit to all other bit positions, allowing it to be used as an AND-mask to zero (or not) the popcount result. Note that even for large bitset sizes, it's still only masking the output of popcnt, not the bitset itself, so ~0ULL is fine I used ULL to make sure wasn't ever asking the compiler to broadcast the bit only to the low 32b of a register (with UL on Windows, for example).

This broadcast can be done with an arithmetic right shift by 63, which shifts in copies of the high bit.

clang generated this code from the original version. After some prodding from Glenn about different implementations for 4, I realized that I could lead gcc towards clang's optimal solution by writing the source more like the ASM I want. The obvious ((int64_t)something) >> 63 to more directly request an arithmetic right shift would not be strictly portable, because signed right-shifts are implementation-defined as either arithmetic or logical. The standard doesn't provide any portable arithmetic right-shift operator. (It's not undefined behaviour, though.) Anyway, fortunately compilers are smart enough: gcc sees the best way once you give it enough of a hint.

This source makes great code on x86-64 and ARM64 with gcc and clang. Both simply use an arithmetic right shift on the input to popcnt (so the shift can run in parallel with the popcnt). It also compiles great on 32bit x86 with gcc, because the masking only happens to a 32bit variable (after multiple popcnt results are added). It's the rest of the function that's nasty on 32bit (when the bitset is larger than a register).

Original ternary-operator version with gcc

Compiled with gcc 5.3.0 -O3 -march=nehalem -mtune=haswell (older gcc, like 4.9.2, also still emit this):

; the original ternary-operator version.  See below for the optimal version we can coax gcc into emitting.
popcount_subset(std::bitset<64ul>, int):
    ; input bitset in rdi, input count in esi (SysV ABI)
    mov     ecx, esi    ; x86 variable-count shift requires the count in cl
    xor     edx, edx    ; edx=0 
    xor     eax, eax    ; gcc's workaround for popcnt's false dependency on the old value of dest, on Intel
    not     ecx         ; two's complement bithack for 63-pos (in the low bits of the register)
    sal     rdi, cl     ; rdi << ((63-pos) & 63);  same insn as shl (arithmetic == logical left shift)
    popcnt  rdx, rdi
    test    rdi, rdi    ; sets SF if the high bit is set.
    cmovs   rax, rdx    ; conditional-move on the sign flag
    ret

See How to prove that the C statement -x, ~x+1, and ~(x-1) yield the same results? for background on gcc's use of the -x == ~x + 1 two's complement identity. (And Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? which tangentially mentions that shl masks the shift count, so we only need the low 6 bits of ecx to hold 63 - pos. Mostly linking that because I wrote it recently and anyone still reading this paragraph might find it interesting.)

Some of those instructions will go away when inlining. (e.g. gcc would generate the count in ecx in the first place.)

With Glenn's multiply instead of ternary operator idea (enabled by USE_mul), gcc does

    shr     rdi, 63
    imul    eax, edi

at the end instead of xor / test / cmovs.

Haswell perf analysis, using microarch data from Agner Fog (Multiply version):

mov r,r: 1 fused-domain uop, 0 latency, no execution unit
xor-zeroing: 1 fused-domain uop, no execution unit
not: 1 uop for p0/p1/p5/p6, 1c latency, 1 per 0.25c throughput
shl (aka sal) with count in cl: 3 uops for p0/p6: 2c latency, 1 per 2c throughput. (Agner Fog's data indicates that IvyBridge only takes 2 uops for this, strangely.)
popcnt: 1 uop for p1, 3c latency, 1 per 1c throughput
shr r,imm: 1 uop for p0/p6, 1c latency. 1 per 0.5c throughput.
imul r,r: 1uop for p1, 3c latency.
not counting the ret

Totals:

9 fused-domain uops, can issue in 2.25 cycles (in theory; uop cache-line effects usually bottleneck the frontend slightly).
4 uops (shifts) for p0/p6. 2 uops for p1. 1 any-ALU-port uop. Can execute at one per 2c (saturating the shift ports), so the frontend is the worst bottleneck.

Latency: Critical path from when the bitset is ready to when the result is: shl(2) -> popcnt(3) -> imul(3). Total 8 cycles. Or 9c from when pos is ready, because the not is an extra 1c latency for it.

The optimal bitbroadcast version replaces shr with sar (same perf), and imul with and (1c latency instead of 3c, runs on any port). So the only perf change is reducing the critical path latency to 6 cycles. Throughput is still bottlenecked on the frontend. and being able to run on any port doesn't make a difference, unless you're mixing this with code that bottlenecks on port1 (instead of looking at the throughput for running just this code in a tight loop).

cmov (ternary operator) version: 11 fused-domain uops (frontend: one per 2.75c). execution units: still bottlenecked on the shift ports (p0/p6) at one per 2c. Latency: 7c from bitset to result, 8c from pos to result. (cmov is 2c latency, 2 uops for any of p0/p1/p5/p6.)

Clang has some different tricks up its sleeve: Instead of test/cmovs, it generates a mask of either all-ones or all-zeros by using an arithmetic right-shift to broadcast the sign bit to all positions of a register. I love it: Using and instead of cmov is more efficient on Intel. It still has the data-dependency and does the work for both sides of the branch (which is the main downside to cmov in general), though. Update: with the right source code, gcc will use this method, too.

clang 3.7 -O3 -Wall -march=nehalem -mtune=haswell

popcount_subset(std::bitset<64ul>, int):
    mov     ecx, 63
    sub     ecx, esi      ; larger code size, but faster on CPUs without mov-elimination
    shl     rdi, cl       ; rdi << ((63-pos) & 63)
    popcnt  rax, rdi      ; doesn't start a fresh dep chain before this, like gcc does
    sar     rdi, 63       ; broadcast the sign bit
    and     eax, edi      ; eax = 0 or its previous value
    ret

sar / and replaces xor / test / cmov, and cmov is a 2-uop instruction on Intel CPUs, so that's really nice. (For the ternary-operator version).

Clang still does the sar / and trick instead of an actual imul when using the multiply source version, or the "bitbroadcast" source version. So those help gcc without hurting clang. (sar/and is definitely better than shr/imul: 2c less latency on the critical path.) The pow_of_two_sub version does hurt clang (see the first godbolt link: omitted from this answer to avoid clutter with ideas that didn't pan out).

The mov ecx, 63 / sub ecx, esi is actually faster on CPUs without mov-elimination for reg,reg moves (zero latency and no execution port, handled by register renaming). This includes Intel pre-IvyBridge, but not more recent Intel and AMD CPUs.

Clang's mov imm / sub method puts only one cycle of latency for pos onto the critical path (beyond the bitset->result latency), instead of two for a mov ecx, esi / not ecx on CPUs where mov r,r has 1c latency.

With BMI2 (Haswell and later), an optimal ASM version can save a mov to ecx. Everything else works the same, because shlx masks its shift-count input register down to the operand-size, just like shl.

x86 shift instructions have crazy CISC semantics where if the shift count is zero, the flags aren't affected. So variable-count shift instructions have a (potential) dependency on the old value of the flags. "Normal" x86 shl r, cl decodes to 3 uops on Haswell, but BMI2 shlx r, r, r is only 1. So it's too bad that gcc still emits sal with -march=haswell, instead of using shlx (which it does use in some other cases).

// hand-tuned BMI2 version using the NOT trick and the bitbroadcast
popcount_subset(std::bitset<64ul>, int):
    not     esi           ; The low 6 bits hold 63-pos.  gcc's two-s complement trick
    xor     eax, eax      ; break false dependency on Intel.  maybe not needed when inlined.
    shlx    rdi, rdi, rsi ; rdi << ((63-pos) & 63)
    popcnt  rax, rdi
    sar     rdi, 63       ; broadcast the sign bit: rdi=0 or -1
    and     eax, edi      ; eax = 0 or its previous value
    ret

Perf analysis for Intel Haswell: 6 fused-domain uops (frontend: one per 1.5c). Execution units: 2 p0/p6 shift uops. 1 p1 uop. 2 any-port uops: (one per 1.25c from total execution port limits). Critical path latency: shlx(1) -> popcnt(3) -> and(1) = 5c bitset->result. (or 6c from pos->result).

Note that when inlining, a human (or smart compiler) could avoid the need for the xor eax, eax. It's only there because of popcnt's false dependency on the output register (on Intel), and we need the output in eax (which the caller may have used recently for a long dep chain). With -mtune=bdver2 or something, gcc won't zero the register it's going to use for popcnt output.

When inlining, we could use an output register that already has to be ready at least as early as popcnt's source reg to avoid the problem. Compilers will do an in-place popcnt rdi,rdi when the source isn't needed later, but that's not the case here. Instead, we can pick another register that already has to be ready before the source. popcnt's input depends on 63-pos, and we can clobber it, so popcnt rsi,rdi's dependency on rsi can't delay it. Or if we had 63 in a register, we could popcnt rsi,rdi / sarx rax, rsi, reg_63 / and eax, esi. Or BMI2 3-operand shift instructions would also let us not clobber inputs in case they're needed afterwards.

This is so light-weight that loop overhead and setting up the input operands / storing the results are going to be major factors. (And the 63-pos can optimize away with a compile-time constant, or into wherever a variable count comes from.)

The Intel compiler amusingly shoots itself in the foot and doesn't take advantage of the fact that A[63] is the sign bit. shl / bt rdi, 63 / jc. It even sets up the branches in a really dumb way. It could zero eax, and then jump over popcnt or not based on the sign flag set by shl.

An optimal branching implementation, starting from ICC13 output from -O3 -march=corei7 on godbolt:

   // hand-tuned, not compiler output
        mov       ecx, esi    ; ICC uses neg/add/mov :/
        not       ecx
        xor       eax, eax    ; breaks the false dep, or is the return value in the taken-branch case
        shl       rdi, cl
        jns    .bit_not_set
        popcnt    rax, rdi
.bit_not_set:
        ret

That's pretty much optimal: The A[pos] == true case has one not-taken branch. It doesn't save very much over the branchless method, though.

If the A[pos] == false case is more common: jump over a ret instruction, to a popcnt / ret. (Or after inlining: jump to a block at the end that does the popcnt and jumps back).

Can someone explains why this works to count set bits in an unsigned integer?

Walkthrough

Let's walk through the loop with an example : let's set v = 42 which is 0010 1010 in binary.

First iteration: c=0, v=42 (0010 1010).
Now v-1 is 41 which is 0010 1001 in binary.
Let's compute v & v-1:
```
   0010 1010
 & 0010 1001
   .........
   0010 1000
```
Now v&v-1's value is 0010 1000 in binary or 40 in decimal. This value is stored into v.
Second iteration : c=1, v=40 (0010 1000). Now v-1 is 39 which is 0010 0111 in binary. Let's compute v & v-1:
```
   0010 1000
 & 0010 0111
   .........
   0010 0000
```
Now v&v-1's value is 0010 0000 which is 32 in decimal. This value is stored into v.
Third iteration :c=2, v=32 (0010 0000). Now v-1 is 31 which is 0001 1111 in binary. Let's compute v & v-1:
```
   0010 0000
 & 0001 1111
   .........
   0000 0000
```
Now v&v-1's value is 0.
Fourth iteration : c=3, v=0. The loop terminates. c contains 3 which is the number of bits set in 42.

Why it works

You can see that the binary representation of v-1 sets the least significant bit or LSB (i.e. the rightmost bit that is a 1) from 1 to 0 and all the bits right of the LSB from 0 to 1.

When you do a bitwise AND between v and v-1, the bits left from the LSB are the same in v and v-1 so the bitwise AND will leave them unchanged. All bits right of the LSB (including the LSB itself) are different and so the resulting bits will be 0.

In our original example of v=42 (0010 1010) the LSB is the second bit from the right. You can see that v-1 has the same bits as 42 except the last two : the 0 became a 1 and the 1 became a 0.

Similarly for v=40 (0010 1000) the LSB is the fourth bit from the right. When computing v-1 (0010 0111) you can see that the left four bits remain unchanged while the right four bits became inverted (zeroes became ones and ones became zeroes).

The effect of v = v & v-1 is therefore to set the least significant bit of v to 0 and leave the rest unchanged. When all bits have been cleared this way, v is 0 and we have counted all bits.

How to count the number of set bits in a 32-bit integer?

This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.

Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. Instructions like x86's popcnt (on CPUs where it's supported) will almost certainly be fastest for a single integer. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle (citation needed - hardware popcount is normally fast if it exists at all.).

The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.

Your compiler may know how to do something that's good for the specific CPU you're compiling for, e.g. C++20 std::popcount(), or C++ std::bitset<32>::count(), as a portable way to access builtin / intrinsic functions (see another answer on this question). But your compiler's choice of fallback for target CPUs that don't have hardware popcnt might not be optimal for your use-case. Or your language (e.g. C) might not expose any portable function that could use a CPU-specific popcount when there is one.

Portable algorithms that don't need (or benefit from) any HW support

A pre-populated table lookup method can be very fast if your CPU has a large cache and you are doing lots of these operations in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory. (Look up each byte separately to keep the table small.) If you want popcount for a contiguous range of numbers, only the low byte is changing for groups of 256 numbers, making this very good.

If you know that your bytes will be mostly 0's or mostly 1's then there are efficient algorithms for these scenarios, e.g. clearing the lowest set with a bithack in a loop until it becomes zero.

I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):

GCC10 and clang 10.0 can recognize this pattern / idiom and compile it to a hardware popcnt or equivalent instruction when available, giving you the best of both worlds. (https://godbolt.org/z/qGdh1dvKK)

int numberOfSetBits(uint32_t i)
{
     // Java: use int, and use >>> instead of >>. Or use Integer.bitCount()
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);        // add pairs of bits
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);  // quads
     i = (i + (i >> 4)) & 0x0F0F0F0F;        // groups of 8
     return (i * 0x01010101) >> 24;          // horizontal sum of bytes
}

For JavaScript: coerce to integer with |0 for performance: change the first line to i = (i|0) - ((i >> 1) & 0x55555555);

This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it. (Its performance is not data-dependent on normal CPUs where all integer operations including multiply are constant-time. It doesn't get any faster with "simple" inputs, but it's still pretty decent.)

References:

https://graphics.stanford.edu/~seander/bithacks.html
https://en.wikipedia.org/wiki/Hamming_weight
http://gurmeet.net/puzzles/fast-bit-counting-routines/
http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count)

How this SWAR bithack works:

i = i - ((i >> 1) & 0x55555555);

The first step is an optimized version of masking to isolate the odd / even bits, shifting to line them up, and adding. This effectively does 16 separate additions in 2-bit accumulators (SWAR = SIMD Within A Register). Like (i & 0x55555555) + ((i>>1) & 0x55555555).

The next step takes the odd/even eight of those 16x 2-bit accumulators and adds again, producing 8x 4-bit sums. The i - ... optimization isn't possible this time so it does just mask before / after shifting. Using the same 0x33... constant both times instead of 0xccc... before shifting is a good thing when compiling for ISAs that need to construct 32-bit constants in registers separately.

The final shift-and-add step of (i + (i >> 4)) & 0x0F0F0F0F widens to 4x 8-bit accumulators. It masks after adding instead of before, because the maximum value in any 4-bit accumulator is 4, if all 4 bits of the corresponding input bits were set. 4+4 = 8 which still fits in 4 bits, so carry between nibble elements is impossible in i + (i >> 4).

So far this is just fairly normal SIMD using SWAR techniques with a few clever optimizations. Continuing on with the same pattern for 2 more steps can widen to 2x 16-bit then 1x 32-bit counts. But there is a more efficient way on machines with fast hardware multiply:

Once we have few enough "elements", a multiply with a magic constant can sum all the elements into the top element. In this case byte elements. Multiply is done by left-shifting and adding, so a multiply of x * 0x01010101 results in x + (x<<8) + (x<<16) + (x<<24). Our 8-bit elements are wide enough (and holding small enough counts) that this doesn't produce carry into that top 8 bits.

A 64-bit version of this can do 8x 8-bit elements in a 64-bit integer with a 0x0101010101010101 multiplier, and extract the high byte with >>56. So it doesn't take any extra steps, just wider constants. This is what GCC uses for __builtin_popcountll on x86 systems when the hardware popcnt instruction isn't enabled. If you can use builtins or intrinsics for this, do so to give the compiler a chance to do target-specific optimizations.

With full SIMD for wider vectors (e.g. counting a whole array)

This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction. (e.g. x86-64 code that has to run on any CPU, not just Nehalem or later.)

However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel. (The 4 bits index a 16 entry table held in a vector register).

On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3 PSHUFB bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right. Otherwise SSE can come out significantly ahead. Newer compiler versions are aware of the popcnt false dependency problem on Intel.

https://github.com/WojciechMula/sse-popcount state-of-the-art x86 SIMD popcount for SSSE3, AVX2, AVX512BW, AVX512VBMI, or AVX512 VPOPCNT. Using Harley-Seal across vectors to defer popcount within an element. (Also ARM NEON)
Counting 1 bits (population count) on large data using AVX-512 or AVX-2
related: https://github.com/mklarqvist/positional-popcount - separate counts for each bit-position of multiple 8, 16, 32, or 64-bit integers. (Again, x86 SIMD including AVX-512 which is really good at this, with vpternlogd making Harley-Seal very good.)

Position of nearest and farthest set bit, Count of set bits of an integer

The number of bits set is fastest found using a built-in function or inline assembler if your CPU supports it: e.g. __builting_popcount() (GCC for x86 population count instruction).

Whether - in your question - you intended "set" to apply to MSB or LSB is unclear. Trivially, bits are usually numbered from 0 as the LSB through sizeof(Type)*8-1 for the MSB. That makes the most significant bit in a 32 bit int bit "31", and the least significant always bit 0.

If you mean the most significant set bit: the fastest way to find it is similarly with a built-in, e.g. __builtin_clz() (GCC for x86 CLZ count-leading-zeros instruction): just subtract the result from the index of your most significant bit (i.e. likely 31). Similarly, __builtin_ctz can count trailing zeros to tell you the index of the least significant set bit, if any.

Those builtins won't be available for python, but there are likely to be alternatives for at least some, e.g.

int.bitcount() gives you a population count.
int.bit_length() tells you the number of bits needed to represent a value, which will be one more than the 0-based position of the most significant set bit

Python answers will be different if you're interested in arbitrary length integers...

Efficiently finding the position of the k'th set bit in a bitset

If you can efficiently query the number of bits set in every range, you can perform binary search on #set_bits(0,i) to find the first index where this value equals k.

It will take O(log(n)*f(n)), where f(n) is the complexity of #set_bits(0,i) op.

Set all bits in A which are set in B between given Lth position to Rth Position

Using bitwise operators:
A | ((1 << R) - (1 << (L-1)) & B)

Explanation: Right shift 1 by R bits to get 2^R. Subtract it by 2^(L-1) to get a number with of all 1s between L and R including. Bitwise-and with B to get a bitmask of B. Bitwise-OR to set As bits.

Demo

What Is the Efficient Way to Count Set Bits At a Position or Lower