How Can Std::Bitset Be Faster Than Std::Vector<Bool>

How can std::bitset be faster than std::vectorbool?

Measurements on Visual Studio 2010 show that std::bitset is not generally faster than std::vector<bool>. What the exact reason for this is I cannot say -- only that bitset is implemented significantly different from the std::vector full specialization.

std::bitset stores it's full content in the object via a

template<size_t _Bits>
    class bitset .....

    _Ty _Array[_Words + 1]; // the set of bits
    };

array and that makes large bitset unsuitable to be put on the stack -- which isn't a performance argument per se.

vector<bool> doesn't suffer from the stack problem, and testing with a size of 1e6 and 1e7 it seems that on my box here querying values in a loop is actually 2x faster with a vector.

Well. I guess the usual timing caveats apply and YMMV, but here's the test code I used should anyone care to try himself:

The output on my box is:

1
vector<bool> loop with a size of 10000000 and 10 iterations*n: 11187 ms
bitset<10000000> loop with 10 iterations*n: 22719 ms
101250010
Press any key to continue . . .

BitMap.cpp

#include "stdafx.h"
#include "BitMap.h"

using namespace std;

// Global var to prevent optimizer from messing things up
volatile size_t ext;

volatile clock_t t1;
volatile clock_t t2;
double delta1;
double delta2;

int main(int argc, _TCHAR* argv[])
{
  ext = 1;
  printf("%d\n", ext);

  vb_t *const vec = new vb_t(bssz);
  bs_t *const bits = new bs_t(); // must put large bitset on heap

  const int iter = 10;
  delta1=0;
  delta2=0;
  for(int o=0; o<5; ++o) {
    t1 = clock();
    for(int i=0; i!=5; ++i)
      bs_loop(iter, *vec);
    t2 = clock();
    delta1 += t2-t1;
    t1 = clock();
    for(int i=0; i!=5; ++i)
      bs_loop(iter, *bits);
    t2 = clock();
    delta2 += t2-t1;
  }

  delta1 /= CLOCKS_PER_SEC;
  delta2 /= CLOCKS_PER_SEC;
  delta1 *= 1000;
  delta2 *= 1000;

  cout << "vector<bool> loop with a size of " << bssz << " and " << iter << " iterations*n: " << delta1 << " ms\n";
  cout << "bitset<" << bssz << "> loop with " << iter << " iterations*n: " << delta2 << " ms\n";

  printf("%d\n", ext);
  delete vec;
  delete bits;
  return 0;
}

BitMap.h

#pragma once
#include <vector>
#include <bitset>

extern volatile size_t ext;
const size_t bssz = size_t(1e7); // 1e7 ca 10m

using namespace std; // Test code, using here is OK.
typedef vector<bool> vb_t;
typedef bitset<bssz> bs_t;

template<class COLL>
void bs_loop(const int iterations, COLL const& v);

bs_loop.cpp

#include "stdafx.h"
#include "BitMap.h"

template<class COLL>
void bs_loop(const int iterations, COLL const& v)
{
  ext = sizeof(COLL);
  for(size_t i=0; i!=iterations; ++i) {
    ++ext;
    for(size_t j=0, e=v.size(); j!=e; ++j) {
      if(v[j]) {
        --ext;
      }
      else {
        ++ext;
      }
    }
  }
}

template
void bs_loop(const int iterations, vb_t const& v);

template
void bs_loop(const int iterations, bs_t const& v);

Compiler command line:

/Zi /nologo /W3 /WX- /O2 /Oi /Oy- /D "WIN32" /D "NDEBUG"
/D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS /Gy 
/fp:precise /Zc:wchar_t /Zc:forScope /Yu"StdAfx.h" /Fp"Release\BitMap.pch" 
/Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd /analyze- 
/errorReport:queue

note the /O2 and the missing /GL (no whole prg opt).

What is the performance of std::bitset?

Update

It's been ages since I posted this one, but:

I already know that it is easier and clearer than bit-fiddling on an
integer, but is it as fast?

If you are using bitset in a way that does actually make it clearer and cleaner than bit-fiddling, like checking for one bit at a time instead of using a bit mask, then inevitably you lose all those benefits that bitwise operations provide, like being able to check to see if 64 bits are set at one time against a mask, or using FFS instructions to quickly determine which bit is set among 64-bits.

I'm not sure that bitset incurs a penalty to use in all ways possible (ex: using its bitwise operator&), but if you use it like a fixed-size boolean array which is pretty much the way I always see people using it, then you generally lose all those benefits described above. We unfortunately can't get that level of expressiveness of just accessing one bit at a time with operator[] and have the optimizer figure out all the bitwise manipulations and FFS and FFZ and so forth going on for us, at least not since the last time I checked (otherwise bitset would be one of my favorite structures).

Now if you are going to use bitset<N> bits interchangeably with like, say, uint64_t bits[N/64] as in accessing both the same way using bitwise operations, it might be on par (haven't checked since this ancient post). But then you lose many of the benefits of using bitset in the first place.

for_each method

In the past I got into some misunderstandings, I think, when I proposed a for_each method to iterate through things like vector<bool>, deque, and bitset. The point of such a method is to utilize the internal knowledge of the container to iterate through elements more efficiently while invoking a functor, just as some associative containers offer a find method of their own instead of using std::find to do a better than linear-time search.

For example, you can iterate through all set bits of a vector<bool> or bitset if you had internal knowledge of these containers by checking for 64 elements at a time using a 64-bit mask when 64 contiguous indices are occupied, and likewise use FFS instructions when that's not the case.

But an iterator design having to do this type of scalar logic in operator++ would inevitably have to do something considerably more expensive, just by the nature in which iterators are designed in these peculiar cases. bitset lacks iterators outright and that often makes people wanting to use it to avoid dealing with bitwise logic to use operator[] to check each bit individually in a sequential loop that just wants to find out which bits are set. That too is not nearly as efficient as what a for_each method implementation could do.

Double/Nested Iterators

Another alternative to the for_each container-specific method proposed above would be to use double/nested iterators: that is, an outer iterator which points to a sub-range of a different type of iterator. Client code example:

for (auto outer_it = bitset.nbegin(); outer_it != bitset.nend(); ++outer_it)
{
     for (auto inner_it = outer_it->first; inner_it != outer_it->last; ++inner_it)
          // do something with *inner_it (bit index)
}

While not conforming to the flat type of iterator design available now in standard containers, this can allow some very interesting optimizations. As an example, imagine a case like this:

bitset<64> bits = 0x1fbf; // 0b1111110111111;

In that case, the outer iterator can, with just a few bitwise iterations ((FFZ/or/complement), deduce that the first range of bits to process would be bits [0, 6), at which point we can iterate through that sub-range very cheaply through the inner/nested iterator (it would just increment an integer, making ++inner_it equivalent to just ++int). Then when we increment the outer iterator, it can then very quickly, and again with a few bitwise instructions, determine that the next range would be [7, 13). After we iterate through that sub-range, we're done. Take this as another example:

bitset<16> bits = 0xffff;

In such a case, the first and last sub-range would be [0, 16), and the bitset could determine that with a single bitwise instruction at which point we can iterate through all set bits and then we're done.

This type of nested iterator design would map particularly well to vector<bool>, deque, and bitset as well as other data structures people might create like unrolled lists.

I say that in a way that goes beyond just armchair speculation, since I have a set of data structures which resemble the likes of deque which are actually on par with sequential iteration of vector (still noticeably slower for random-access, especially if we're just storing a bunch of primitives and doing trivial processing). However, to achieve the comparable times to vector for sequential iteration, I had to use these types of techniques (for_each method and double/nested iterators) to reduce the amount of processing and branching going on in each iteration. I could not rival the times otherwise using just the flat iterator design and/or operator[]. And I'm certainly not smarter than the standard library implementers but came up with a deque-like container which can be sequentially iterated much faster, and that strongly suggests to me that it's an issue with the standard interface design of iterators in this case which come with some overhead in these peculiar cases that the optimizer cannot optimize away.

Old Answer

I'm one of those who would give you a similar performance answer, but I'll try to give you something a bit more in-depth than "just because". It is something I came across through actual profiling and timing, not merely distrust and paranoia.

One of the biggest problems with bitset and vector<bool> is that their interface design is "too convenient" if you want to use them like an array of booleans. Optimizers are great at obliterating all that structure you establish to provide safety, reduce maintenance cost, make changes less intrusive, etc. They do an especially fine job with selecting instructions and allocating the minimal number of registers to make such code run as fast as the not-so-safe, not-so-easy-to-maintain/change alternatives.

The part that makes the bitset interface "too convenient" at the cost of efficiency is the random-access operator[] as well as the iterator design for vector<bool>. When you access one of these at index n, the code has to first figure out which byte the nth bit belongs to, and then the sub-index to the bit within that. That first phase typically involves a division/rshifts against an lvalue along with modulo/bitwise and which is more costly than the actual bit operation you're trying to perform.

The iterator design for vector<bool> faces a similar awkward dilemma where it either has to branch into different code every 8+ times you iterate through it or pay that kind of indexing cost described above. If the former is done, it makes the logic asymmetrical across iterations, and iterator designs tend to take a performance hit in those rare cases. To exemplify, if vector had a for_each method of its own, you could iterate through, say, a range of 64 elements at once by just masking the bits against a 64-bit mask for vector<bool> if all the bits are set without checking each bit individually. It could even use FFS to figure out the range all at once. An iterator design would tend to inevitably have to do it in a scalar fashion or store more state which has to be redundantly checked every iteration.

For random access, optimizers can't seem to optimize away this indexing overhead to figure out which byte and relative bit to access (perhaps a bit too runtime-dependent) when it's not needed, and you tend to see significant performance gains with that more manual code processing bits sequentially with advanced knowledge of which byte/word/dword/qword it's working on. It's somewhat of an unfair comparison, but the difficulty with std::bitset is that there's no way to make a fair comparison in such cases where the code knows what byte it wants to access in advance, and more often than not, you tend to have this info in advance. It's an apples to orange comparison in the random-access case, but you often only need oranges.

Perhaps that wouldn't be the case if the interface design involved a bitset where operator[] returned a proxy, requiring a two-index access pattern to use. For example, in such a case, you would access bit 8 by writing bitset[0][6] = true; bitset[0][7] = true; with a template parameter to indicate the size of the proxy (64-bits, e.g.). A good optimizer may be able to take such a design and make it rival the manual, old school kind of way of doing the bit manipulation by hand by translating that into: bitset |= 0x60;

Another design that might help is if bitsets provided a for_each_bit kind of method, passing a bit proxy to the functor you provide. That might actually be able to rival the manual method.

std::deque has a similar interface problem. Its performance shouldn't be that much slower than std::vector for sequential access. Yet unfortunately we access it sequentially using operator[] which is designed for random access or through an iterator, and the internal rep of deques simply don't map very efficiently to an iterator-based design. If deque provided a for_each kind of method of its own, then there it could potentially start to get a lot closer to std::vector's sequential access performance. These are some of the rare cases where that Sequence interface design comes with some efficiency overhead that optimizers often can't obliterate. Often good optimizers can make convenience come free of runtime cost in a production build, but unfortunately not in all cases.

Sorry!

Also sorry, in retrospect I wandered a bit with this post talking about vector<bool> and deque in addition to bitset. It's because we had a codebase where the use of these three, and particularly iterating through them or using them with random-access, were often hotspots.

Apples to Oranges

As emphasized in the old answer, comparing straightforward usage of bitset to primitive types with low-level bitwise logic is comparing apples to oranges. It's not like bitset is implemented very inefficiently for what it does. If you genuinely need to access a bunch of bits with a random access pattern which, for some reason or other, needs to check and set just one bit a time, then it might be ideally implemented for such a purpose. But my point is that almost all use cases I've encountered didn't require that, and when it's not required, the old school way involving bitwise operations tends to be significantly more efficient.

Is using a vector of boolean values slower than a dynamic bitset?

A great deal here depends on how many Boolean values you're working with.

Both bitset and vector<bool> normally use a packed representation where a Boolean is stored as only a single bit.

On one hand, that imposes some overhead in the form of bit manipulation to access a single value.

On the other hand, that also means many more of your Booleans will fit in your cache.

If you're using a lot of Booleans (e.g., implementing a sieve of Eratosthenes) fitting more of them in the cache will almost always end up a net gain. The reduction in memory use will gain you a lot more than the bit manipulation loses.

Most of the arguments against std::vector<bool> come back to the fact that it is not a standard container (i.e., it does not meet the requirements for a container). IMO, this is mostly a question of expectations -- since it says vector, many people expect it to be a container (other types of vectors are), and they often react negatively to the fact that vector<bool> isn't a container.

If you're using the vector in a way that really requires it to be a container, then you probably want to use some other combination -- either deque<bool> or vector<char> can work fine. Think before you do that though -- there's a lot of (lousy, IMO) advice that vector<bool> should be avoided in general, with little or no explanation of why it should be avoided at all, or under what circumstances it makes a real difference to you.

Yes, there are situations where something else will work better. If you're in one of those situations, using something else is clearly a good idea. But, be sure you're really in one of those situations first. Anybody who tells you (for example) that "Herb says you should use vector<char>" without a lot of explanation about the tradeoffs involved should not be trusted.

Let's give a real example. Since it was mentioned in the comments, let's consider the Sieve of Eratosthenes:

#include <vector>
#include <iostream>
#include <iterator>
#include <chrono>

unsigned long primes = 0;

template <class bool_t>
unsigned long sieve(unsigned max) {
    std::vector<bool_t> sieve(max, false);
    sieve[0] = sieve[1] = true;

    for (int i = 2; i < max; i++) {
        if (!sieve[i]) {
            ++primes;
            for (int temp = 2 * i; temp < max; temp += i)
                sieve[temp] = true;
        }
    }
    return primes;
}

// Warning: auto return type will fail with older compilers
// Fine with g++ 5.1 and VC++ 2015 though.
//
template <class F>
auto timer(F f, int max) {
    auto start = std::chrono::high_resolution_clock::now();
    primes += f(max);
    auto stop = std::chrono::high_resolution_clock::now();

    return stop - start;
}

int main() {
    using namespace std::chrono;

    unsigned number = 100000000;

    auto using_bool = timer(sieve<bool>, number);
    auto using_char = timer(sieve<char>, number);

    std::cout << "ignore: " << primes << "\n";
    std::cout << "Time using bool: " << duration_cast<milliseconds>(using_bool).count() << "\n";
    std::cout << "Time using char: " << duration_cast<milliseconds>(using_char).count() << "\n";
}

We've used a large enough array that we can expect a large portion of it to occupy main memory. I've also gone to a little pain to ensure that the only thing that changes between one invocation and the other is the use of a vector<char> vs. vector<bool>. Here are some results. First with VC++ 2015:

ignore: 34568730
Time using bool: 2623
Time using char: 3108

...then the time using g++ 5.1:

ignore: 34568730
Time using bool: 2359
Time using char: 3116

Obviously, the vector<bool> wins in both cases--by around 15% with VC++, and over 30% with gcc. Also note that in this case, I've chosen the size to show vector<char> in quite favorable light. If, for example, I reduce number from 100000000 to 10000000, the time differential becomes much larger:

ignore: 3987474
Time using bool: 72
Time using char: 249

Although I haven't done a lot of work to confirm, I'd guess that in this case, the version using vector<bool> is saving enough space that the array fits entirely in the cache, while the vector<char> is large enough to overflow the cache, and involve a great deal of main memory access.

Choosing between setint vs. vectorbool vs. vectorboolean_t to use as a bitmap (bitset / bit array)

Without knowing the platform you are running this code on and your access patterns, it's hard to say whether vector<bool> will be faster than vector<char> (or vector<int>) or even set<int> or unordered_set<int>.

For example, if you have an extremely sparse array, a linear search of a vector<int> that just contains the indices set might be the best answer. (See Mike Abrash's article on optimizing Pixomatic for x86.)

On the other hand, you might have a somewhat sparse array. By somewhat sparse, I mean that the number of set elements is much greater than L1 or L2. In that case, more low-level details start to come into play, as well as your actual access patterns.

For example, on some platforms, variable bit shifting is incredibly expensive. So, if you are querying a random set of identifiers, the more frequently you do this, the more a vector<char> or vector<int> becomes a better idea than bitset<...> or vector<bool>. (The latter two use bit shifts to lookup bits.) On the other hand, if you are iterating through the sparse bit vector in order and just want the bits set, you can optimize that iteration to get rid of the overhead of variable shifts.

At this point, you might also want to know how your sparse identifiers are actually distributed. If they are clumped, you need to know the tradeoff between the optimal memory read size and reading a char at a time. That will dictate whether hitting the cache more often will offset reading in non-native sized data.

If the identifiers are scattered, you may get a significant win by using a hash set (unordered_set<int>) instead of a bit vector. That depends on the load, however.

Converting Between std::bitset and std::vectorbool

Matthew Austern wrote an iterator for bitset here: http://www.drdobbs.com/the-standard-librarian-bitsets-and-bit-v/184401382?pgno=2

It's over 100 lines so I feel just lifting it and putting it in this answer may be a bit out of bounds. But it works marvelously for STL algorithms.

Austern answers this exact question:

While bitset doesn’t have the STL container interface, it’s still a perfectly good (fixed-size) container. If it makes sense for you to use a bitset, and if you also need iterators, then you can define a simple “index iterator” adaptor that translates iterator notation like *i into array notation like b[n]. The implementation is straightforward: maintain an index and a pointer to a container.

He does warn of his iterator:

If we were willing to accept a slightly more cumbersome interface, we could define a class that worked with arbitrary array-like types. A general purpose index iterator adaptor is often useful when dealing with pre-STL container classes, and sometimes even when dealing with STL containers like vector.

It should also be noted that just as with a vector<bool>::iterator, Austern's bitset_iterator::operator* doesn't return a bool& but a proxy reference: bitset<>::reference.

Usage of bitset_iterator looks like this:

std::bitset<10> foo;
std::vector<bool> bar(begin(foo), end(foo));

How does one store a vectorbool or a bitset into a file, but bit-wise?

Simplest approach : take consecutive 8 boolean values, represent them as a single byte, write that byte to your file. That would save lot of space.

In the beginning of file, you can write the number of boolean values you want to write to the file; that number will help while reading the bytes from file, and converting them back into boolean values!

Bitset, vector of bool or vector of ints for simple big integer

The operations you describe (leaving out the implicit interpretation as an integer) are actually those provided efficiently by a deque. If you can tolerate the memory overhead, you could use std::deque<bool> (std::list<bool> would also work, but would have even higher overhead).

If the overhead is too much, you could start with

struct Bits {
  std::deque<unsigned> deq;
  int ms_free,ls_free;   // unused bits in the two end words
};

and write methods to push bits on either end (for the right end, you would deq.push_back() if lsb_free==0 and store into deq.back() otherwise). Comparison would use deq.size() and ms_free+ls_free to know how to align the two sequences.

Fast std::vectorbool Reset

Method 1 "Dense":

If you need a "dense" container I would suggest you use a mixture of vector and bitset.

We store bits as a sequence of bitsets, thus we can use bitset::reset on each "chunk" to reset them all. DynamicBitset::resize can be used to make room for the correct number of bits.

class DynamicBitset
{
private:
    static const unsigned BITS_LEN = 64; // or 32, or more?
    typedef std::bitset<BITS_LEN> Bits;
    std::vector<Bits> data_;
public:
    DynamicBitset(size_t len=0) {
        resize(len);
    }
    // reset all bits to false
    void reset() {
        for(auto it=data_.begin(); it!=data_.end(); ++it) {
            it->reset(); // we can use the fast bitfield::reset :)
        }
    }
    // reset the whole bitset belonging to bit i
    void reset_approx(size_t i) {
        data_[i/BITS_LEN].reset();
    }
    // make room for len bits
    void resize(size_t len) {
        data_.resize(len/BITS_LEN + 1);
    }
    // access a bit
    Bits::reference operator[](size_t i) {
        size_t k = i/BITS_LEN;
        return data_[k][i-k*BITS_LEN];
    }
    bool operator[](size_t i) const {
        size_t k = i/BITS_LEN;
        return data_[k][i-k*BITS_LEN];
    }
};

Method 2 "Sparse":

If you store only very few bits, you can also go with a mixture of map and bitset.

Here we store chunks only if there is at least on bit set in it. This has additional costs for accessing bits as we need a lookup into a std::map which has O(log N), but uses much less memory.

Additionally the function reset does exactly what you stated in your question - it only touches areas where you have set a bit.

SparseDynamicBitset is a good choice when you have very long sequences of bits which are always false, e.g. for 1000...000100...010, and not for 0101000111010110.

class SparseDynamicBitset
{
private:
    static const unsigned BITS_LEN = 64; // ?
    typedef std::bitset<BITS_LEN> Bits;
    std::map<unsigned,Bits> data_;
public:
    // reset all "stored" bits to false
    void reset() {
        for(auto it=data_.begin(); it!=data_.end(); ++it) {
            it->second.reset(); // uses bitfield::reset :)
        }
    }
    // access a bit
    Bits::reference operator[](size_t i) {
        size_t k = i/BITS_LEN;
        return data_[k][i-k*BITS_LEN];
    }
    bool operator[](size_t i) const {
        size_t k = i/BITS_LEN;
        auto it = data_.find(k);
        if(it == it.end()) {
            return false; // the default
        }
        return it->second[i-k*BITS_LEN];
    }
};

Is using unsigned int instead of std::vectorbool or std::bitset a recommended practice?

The use of some size of unsigned integer to represent a
fixed-size sequence of bits was the only option in C++ before the 1998 Standard, when
std::vector<bool> and std::bitset were introduced. The practice was inherited from
C, in which it is considered a competent programmer's proficiency and it remains so considered in C++.

std::vector<bool> has come to be regarded with regret. See e.g.
vector<bool>: More Problems, Better Solutions and
Effective STL Item 18.
std:::bitset is considered fit for purpose.

The unsigned integer practice inherently represents a hand-rolled simulation of
of a fixed-size bit-sequence by overt artifice, constrained and complicated by
the fact that there are only a few sizes of unsigned integer (even if the chosen
size is made precise). If you require a fixed-size bit-sequence to be subject
to operations that are all supported by the interface
of std::bitset then other things being equal the Standard Library's provision
is to be preferred to hand-rolled code for the all of the reasons to which the
Library owes its existence.

You must be the judge whether, in the context of your application, other things
are equal.

Performance gap between vectorbool and array

std::vector<bool> isn't like any other vector. The documentation says:

std::vector<bool> is a possibly space-efficient specialization of
std::vector for the type bool.

That's why it may use up less memory than an array, because it might represent multiple boolean values with one byte, like a bitset. It also explains the performance difference, since accessing it isn't as simple anymore. According to the documentation, it doesn't even have to store it as a contiguous array.

How Can Std::Bitset Be Faster Than Std::Vector<Bool>