C++ - Why Is Boost::Hash_Combine the Best Way to Combine Hash-Values

C++ - Why is boost::hash_combine the best way to combine hash-values?

It being the "best" is argumentative.

It being "good", or even "very good", at least superficially, is easy.

seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);

We'll presume seed is a previous result of hasher or this algorithm.

^= means that the bits on the left and bits on the right all change the bits of the result.

hasher(v) is presumed to be a decent hash on v. But the rest is defence in case it isn't a decent hash.

0x9e3779b9 is a 32 bit value (it could be extended to 64 bit if size_t was 64 bit arguably) that contains half 0s and half 1s. It is basically a random series of 0s and 1s done by approximating particular irrational constant as a base-2 fixed point value. This helps ensure that if the hasher returns bad values, we still get a smear of 1s and 0s in our output.

(seed<<6) + (seed>>2) is a bit shuffle of the incoming seed.

Imagine the 0x constant was missing. Imagine the hasher returns the constant 0x01000 for almost every v passed in. Now, each bit of the seed is spread out over the next iteration of the hash, during which it is again spread out.

The seed ^= (seed<<6) + (seed>>2) 0x00001000 becomes 0x00041400 after one iteration. Then 0x00859500. As you repeat the operation, any set bits are "smeared out" over the output bits. Eventually the right and left bits collide, and carry moves the set bit from "even locations" to "odd locations".

The bits dependent on the value of an input seed grows relatively fast and in complex ways as the combine operation recurses on the seed operation. Adding causes carries, which smear things even more. The 0x constant adds a bunch of pseudo-random bits that make boring hash values occupy more than a few bits of the hash space after being combined.

It is asymmetric thanks to addition (combining the hashes of "dog" and "god" gives different results), it handles boring hash values (mapping characters to their ascii value, which only involves twiddling a handful of bits). And, it is reasonably fast.

Slower hash combines that are cryptographically strong can be better in other situations. I, naively, would presume that making the shifts be a combination of even and odd shifts might be a good idea (but maybe addition, which moves even bits from odd bits, makes that less of a problem: after 3 iterations, incoming lone seed bits will collide and add and cause a carry).

The downside to this kind of analysis is that it only takes one mistake to make a hash function really bad. Pointing out all the good things doesn't help that much. So another thing that makes it good now is that it is reasonably famous and in an open-source repository, and I haven't heard anyone point out why it is bad.

How do I combine hash values in C++0x?

Well, just do it like the boost guys did it:

template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
std::hash<T> hasher;
seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

I am using boost hash function in C++ to combine 3 doubles into a hash which is facing a collision

I have a list of vertices and using the hash created from the doubles as a key in the map.

Not the best use of a hash, in the sense of what std::hash or boost::hash represents.

You're looking for uniqueness. A hash in this sense is not unique.

I was wondering if a more reliable hash combine function for this specific application exists.

Not unless the hash space has a 1:1 correlation with the space of possible values of z, y and z - which essentially means using the vertex itself as the identifier.

Summary:

If you want a container indexed by unique vertices, you may want to consider a std::unordered_map. You will need to provide an equality operator.

An example:

#include <boost/functional/hash.hpp> // see below
#include <tuple>
#include <unordered_map>
#include <string>

// a simple Vertex class
struct Vertex
{
double x, y, z;
};

// a useful general-purpose accessor
auto as_tuple(Vertex const& v) -> decltype(auto)
{
return std::tie(v.x, v.y, v.z);
}

// equality implemented in terms of tuple, for simplicity
bool operator==(Vertex const& l , Vertex const& r)
{
return as_tuple(l) == as_tuple(r);
}

// hash_value implemented in terms of tuple, for consistency and simplicity
std::size_t hash_value(Vertex const& v)
{
using boost::hash_value;
return hash_value(as_tuple(v));
}

// the boring bit - injecting a hash specialisation into the std:: namespace
// but let's derive from boost's hash class, which is much better
// in that it allows easy hashing using free functions
namespace std {
template<> struct hash<::Vertex> : boost::hash<::Vertex> {};
}

using vertex_map = std::unordered_map<Vertex, std::string>;

int main()
{
auto m = vertex_map();

m[{0, 0, 0}] = "Sol";
m[{1, 3, 5}] = "Mars";
m[{100.4, 343.2, 92.44}] = "Pluto";
}

Note: The numbers above are in Zargian NonLinear Megaunits - you won't find them in any Earthbound textbook on the Solar System.

boost hash returning same value for different inputs

Look up perfect hashing, and the birthday paradox, and for completeness's sake the pigeonhole principle.

What it boils down to is hash functions generally do produce collisions,unless what you're hashing has very specific properties you've taken advantage of. Your chances of seeing a hash collision for any given set of keys is going to be counterintuitively high just because that's one of the mathematical realities we're not wired for: with a 1/365 chance of getting any particular hash, your odds of a collision are 50/50 given just 23 keys.

Magic number in boost::hash_combine

The magic number is supposed to be 32 random bits, where each is equally likely to be 0 or 1, and with no simple correlation between the bits. A common way to find a string of such bits is to use the binary expansion of an irrational number; in this case, that number is the reciprocal of the golden ratio:

phi = (1 + sqrt(5)) / 2
2^32 / phi = 0x9e3779b9

So including this number "randomly" changes each bit of the seed; as you say, this means that consecutive values will be far apart. Including the shifted versions of the old seed makes sure that, even if hash_value() has a fairly small range of values, differences will soon be spread across all the bits.

getting too many collisions with hash_combine

You are correct. Boost's hash_combine function does poorly for this data set. You can test with this code which shows almost 600,000 collisions for the one million test entries.

Here's a simple fix:

for (int i = 0; i < myvec.size(); ++i)
boost::hash_combine(seed, myvec[i] * 2654435761);

The magic number is a prime close to 2^32 * (sqrt(5)-1)/2 -- see Knuth for an explanation of why that works to expand the intervals.

Determinism guaranteed only during one program run in boost::hash_combine

The reason is that hashes are frequently used in hash tables. A malicious user trying to attack a service (that uses C++ code involving hash tables) may massively degrade its performance by forcing hash collisions for items inserted into a hash table (with performance for common operations going from O(1) to O(N)). By having every run use a different hash function, this becomes a lot harder to do.

std::hash is standardized like that too. To cite https://en.cppreference.com/w/cpp/utility/hash:

Hash functions are only required to produce the same result for the same input within a single execution of a program; this allows salted hashes that prevent collision DoS attacks. (since C++ 14)

Creating 'good' hash codes for .NET ala Boost.Functional/Hash

The answers to this question contains some examples of helper-classes that resembles Boost.Functional/Hash. None looks quite as elegant, though.

I am not aware of any real .NET library that provides the equivalent.



Related Topics



Leave a reply



Submit