Magic Number in Boost::Hash_Combine

Magic number in boost::hash_combine

The magic number is supposed to be 32 random bits, where each is equally likely to be 0 or 1, and with no simple correlation between the bits. A common way to find a string of such bits is to use the binary expansion of an irrational number; in this case, that number is the reciprocal of the golden ratio:

phi = (1 + sqrt(5)) / 2
2^32 / phi = 0x9e3779b9

So including this number "randomly" changes each bit of the seed; as you say, this means that consecutive values will be far apart. Including the shifted versions of the old seed makes sure that, even if hash_value() has a fairly small range of values, differences will soon be spread across all the bits.

Why is the magic number in boost::hash_combine specified in hex

Literals are literals and different representations of the same literal are... literally identical.

However, expressions (literal or not) also have a type.

The equivalent literal would have been 2654435769u (note the type suffix making it unsigned).

Look at this simple test Live On Coliru

  • 0x9e3779b9 has type unsigned int (32 bit) and
  • 2654435769 has type long (64 bit)
  • 2654435769u has type unsigned int (32 bit) again

As you can see, the hex representation favours unsigned and the decimal representation favours signed, making the type bigger¹.


¹ native integer sizes are implementation defined

(Beyond types, one could argue that maybe, perhaps, bit-distribution is slightly more apparent in hex, octal or ultimately binary representations)

C++ - Why is boost::hash_combine the best way to combine hash-values?

It being the "best" is argumentative.

It being "good", or even "very good", at least superficially, is easy.

seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);

We'll presume seed is a previous result of hasher or this algorithm.

^= means that the bits on the left and bits on the right all change the bits of the result.

hasher(v) is presumed to be a decent hash on v. But the rest is defence in case it isn't a decent hash.

0x9e3779b9 is a 32 bit value (it could be extended to 64 bit if size_t was 64 bit arguably) that contains half 0s and half 1s. It is basically a random series of 0s and 1s done by approximating particular irrational constant as a base-2 fixed point value. This helps ensure that if the hasher returns bad values, we still get a smear of 1s and 0s in our output.

(seed<<6) + (seed>>2) is a bit shuffle of the incoming seed.

Imagine the 0x constant was missing. Imagine the hasher returns the constant 0x01000 for almost every v passed in. Now, each bit of the seed is spread out over the next iteration of the hash, during which it is again spread out.

The seed ^= (seed<<6) + (seed>>2) 0x00001000 becomes 0x00041400 after one iteration. Then 0x00859500. As you repeat the operation, any set bits are "smeared out" over the output bits. Eventually the right and left bits collide, and carry moves the set bit from "even locations" to "odd locations".

The bits dependent on the value of an input seed grows relatively fast and in complex ways as the combine operation recurses on the seed operation. Adding causes carries, which smear things even more. The 0x constant adds a bunch of pseudo-random bits that make boring hash values occupy more than a few bits of the hash space after being combined.

It is asymmetric thanks to addition (combining the hashes of "dog" and "god" gives different results), it handles boring hash values (mapping characters to their ascii value, which only involves twiddling a handful of bits). And, it is reasonably fast.

Slower hash combines that are cryptographically strong can be better in other situations. I, naively, would presume that making the shifts be a combination of even and odd shifts might be a good idea (but maybe addition, which moves even bits from odd bits, makes that less of a problem: after 3 iterations, incoming lone seed bits will collide and add and cause a carry).

The downside to this kind of analysis is that it only takes one mistake to make a hash function really bad. Pointing out all the good things doesn't help that much. So another thing that makes it good now is that it is reasonably famous and in an open-source repository, and I haven't heard anyone point out why it is bad.

getting too many collisions with hash_combine

You are correct. Boost's hash_combine function does poorly for this data set. You can test with this code which shows almost 600,000 collisions for the one million test entries.

Here's a simple fix:

for (int i = 0; i < myvec.size(); ++i)
boost::hash_combine(seed, myvec[i] * 2654435761);

The magic number is a prime close to 2^32 * (sqrt(5)-1)/2 -- see Knuth for an explanation of why that works to expand the intervals.

Determinism guaranteed only during one program run in boost::hash_combine

The reason is that hashes are frequently used in hash tables. A malicious user trying to attack a service (that uses C++ code involving hash tables) may massively degrade its performance by forcing hash collisions for items inserted into a hash table (with performance for common operations going from O(1) to O(N)). By having every run use a different hash function, this becomes a lot harder to do.

std::hash is standardized like that too. To cite https://en.cppreference.com/w/cpp/utility/hash:

Hash functions are only required to produce the same result for the same input within a single execution of a program; this allows salted hashes that prevent collision DoS attacks. (since C++ 14)

How do I combine hash values in C++0x?

Well, just do it like the boost guys did it:

template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
std::hash<T> hasher;
seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

C++ some questions on boost::unordered_map & boost::hash

This is a bit muddled.

  • What you say are not "things that you can do to speed things up"; rather, they are mandatory requirements of your type to be eligible as the element type of an unordered map, and also for an unordered set (which you might rather want).

  • You need to provide an equality operator that compares objects, not hash values. The whole point of the equality is to distinguish elements with the same hash.

  • size_t is an unsigned integral type, 32 bits on x86 and 64 bits on x64. Since you want "billions of elements", which means many gigabytes of data, I assume you have a solid x64 machine anyway.

  • What's crucial is that your hash function is good, i.e. has few collisions.

  • You want a set, not a map. Put the objects directly in the set: std::unordered_set<State>. Use a map if you are mapping to something, i.e. states to something else. Oh, use C++0x, not boost, if you can.

  • Using hash_combine is good.


Baby example:

struct State
{
inline bool operator==(const State &) const;
/* Stuff */
};

namespace std
{
template <> struct hash<State>
{
inline std::size_t operator()(const State & s) const
{
/* your hash algorithm here */
}
};
}

std::size_t Foo(const State & s) { /* some code */ }

int main()
{
std::unordered_set<State> states; // no extra data needed
std::unordered_set<State, Foo> states; // another hash function
}


Related Topics



Leave a reply



Submit