Quick and Simple Hash Code Combinations

Quick and Simple Hash Code Combinations

I would personally avoid XOR - it means that any two equal values will result in 0 - so hash(1, 1) == hash(2, 2) == hash(3, 3) etc. Also hash(5, 0) == hash(0, 5) etc which may come up occasionally. I have deliberately used it for set hashing - if you want to hash a sequence of items and you don't care about the ordering, it's nice.

I usually use:

unchecked
{
int hash = 17;
hash = hash * 31 + firstField.GetHashCode();
hash = hash * 31 + secondField.GetHashCode();
return hash;
}

That's the form that Josh Bloch suggests in Effective Java. Last time I answered a similar question I managed to find an article where this was discussed in detail - IIRC, no-one really knows why it works well, but it does. It's also easy to remember, easy to implement, and easy to extend to any number of fields.

Alternative to hashing for quick comparison to avoid conflicts

I'll leave the solution I ended up using. This is based on the fact that conflicts can be avoided by storing whole values from multiple fields in a single hash where possible:

byte b1 = 42, b2 = 255;
int naiveHash = CombineHash(b1.GetHashCode(), b2.GetHashCode()); // will always have conflicts
int typeAwareHash = b1 << 8 + b2; // no conflicts

To know how many bits are required by a field I required the implementation of IObjectDescriptorField:

interface IObjectDescriptorField
{
int GetHashCodeBitCount();
}

I then updated the ObjectDescriptor class with an HashCodeBuilder class:

class ObjectDescriptor
{
public IObjectDescriptorField[] Fields;

public override int GetHashCode()
{
HashCodeBuilder hash = new HashCodeBuilder();
for (int i = 0; i < Fields.Length; i++)
{
hash.AddBits(Fields[i].GetHashCode(), Fields[i].GetHashCodeBitCount());
}
return hash.GetHashCode();
}
}

HashCodeBuilder stack up bits until all 32 are used, and then uses a simple hash combination function like before:

public class HashCodeBuilder
{
private const int HASH_SEED = 352654597;

private static int Combine(int hash1, int hash2)
{
return ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ hash2;
}

private int hashAccumulator;
private int bitAccumulator;
private int bitsLeft;

public HashCodeBuilder()
{
hashAccumulator = HASH_SEED;
bitAccumulator = 0;
bitsLeft = 32;
}

public void AddBits(int bits, int bitCount)
{
if (bitsLeft < bitCount)
{
hashAccumulator = Combine(hashAccumulator, bitAccumulator);
bitsLeft = 32;
hashAccumulator = 0;
}

bitAccumulator = bitAccumulator << bitCount + bits;
bitsLeft -= bitCount;
}

public override int GetHashCode()
{
return Combine(hashAccumulator, bitAccumulator);
}
}

This solution of course still have conflicts if more then 32 bits are used, but it worked for me because many of the fields where just bools or Enums with few values, which greatly benefit to be combined like this.

What is the best algorithm for overriding GetHashCode?

I usually go with something like the implementation given in Josh Bloch's fabulous Effective Java. It's fast and creates a pretty good hash which is unlikely to cause collisions. Pick two different prime numbers, e.g. 17 and 23, and do:

public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
// Suitable nullity checks etc, of course :)
hash = hash * 23 + field1.GetHashCode();
hash = hash * 23 + field2.GetHashCode();
hash = hash * 23 + field3.GetHashCode();
return hash;
}
}

As noted in comments, you may find it's better to pick a large prime to multiply by instead. Apparently 486187739 is good... and although most examples I've seen with small numbers tend to use primes, there are at least similar algorithms where non-prime numbers are often used. In the not-quite-FNV example later, for example, I've used numbers which apparently work well - but the initial value isn't a prime. (The multiplication constant is prime though. I don't know quite how important that is.)

This is better than the common practice of XORing hashcodes for two main reasons. Suppose we have a type with two int fields:

XorHash(x, x) == XorHash(y, y) == 0 for all x, y
XorHash(x, y) == XorHash(y, x) for all x, y

By the way, the earlier algorithm is the one currently used by the C# compiler for anonymous types.

This page gives quite a few options. I think for most cases the above is "good enough" and it's incredibly easy to remember and get right. The FNV alternative is similarly simple, but uses different constants and XOR instead of ADD as a combining operation. It looks something like the code below, but the normal FNV algorithm operates on individual bytes, so this would require modifying to perform one iteration per byte, instead of per 32-bit hash value. FNV is also designed for variable lengths of data, whereas the way we're using it here is always for the same number of field values. Comments on this answer suggest that the code here doesn't actually work as well (in the sample case tested) as the addition approach above.

// Note: Not quite FNV!
public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = (int) 2166136261;
// Suitable nullity checks etc, of course :)
hash = (hash * 16777619) ^ field1.GetHashCode();
hash = (hash * 16777619) ^ field2.GetHashCode();
hash = (hash * 16777619) ^ field3.GetHashCode();
return hash;
}
}

Note that one thing to be aware of is that ideally you should prevent your equality-sensitive (and thus hashcode-sensitive) state from changing after adding it to a collection that depends on the hash code.

As per the documentation:

You can override GetHashCode for immutable reference types. In general, for mutable reference types, you should override GetHashCode only if:

  • You can compute the hash code from fields that are not mutable; or
  • You can ensure that the hash code of a mutable object does not change while the object is contained in a collection that relies on its hash code.

The link to the FNV article is broken but here is a copy in the Internet Archive: Eternally Confuzzled - The Art of Hashing

Integer hash function colliding after few iterations

The reason is that your multiplication part is moving the bits out to the left, and if you have enough loop iterations the bits obtained from the first numbers in the list will eventually be thrown out completely and no longer have an effect on the final result.

The number 9176 can be written in binary as 10001111011000, and in practice the lowest 1-bit will dictate how many rounds you need to run before the first entry completely falls off the list.

The last 1-bit, is at position 3 (or the 4th position from the right), and this means you're moving the bits from the first number 4 positions to the left on every iteration. By the time you've done this 8 times, you've moved that number completely out of the 32-bit buffer (int is 32-bit).

A better method (but see my comment below) would be to at least ensure no bits are completely lost, so a different but still fairly simple way of calculating the hash code could be like this:

hashCode = ((hashCode << 27) | (hashCode >> 5)) ^ c;

This basically rotates the current hash code 27 bits to the left, and the 5 bits that fall off are rotated back in from the right, and then an exclusive OR with c bakes that into the number as well.


You should, however, use a more standardized way of calculating these hashes. My suggested change above is bound to have problems of its own, they're just not as obvious.

And really, because of the pigeon hole principle, you cannot calculate a unique number for a list of numbers, and this has nothing to do with which hash code algorithm you're using. None of them will solve this part of the problem. So I would really ask you to rethink what you're doing in the first place.

How to combine hash codes in in Python3?

The easiest way to produce good hashes is to put your values in a standard hashable Python container, then hash that. This includes combining hashes in subclasses. I'll explain why, and then how.

Base requirements

First things first:

  • If two objects test as equal, then they MUST have the same hash value
  • Objects that have a hash, MUST produce the same hash over time.

Only when you follow those two rules can your objects safely be used in dictionaries and sets. The hash not changing is what keeps dictionaries and sets from breaking, as they use the hash to pick a storage location, and won't be able to locate the object again given another object that tests equal if the hash changed.

Note that it doesn’t even matter if the two objects are of different types; True == 1 == 1.0 so all have the same hash and will all count as the same key in a dictionary.

What makes a good hash value

You'd want to combine the components of your object value in ways that will produce, as much as possible, different hashes for different values. That includes things like ordering and specific meaning, so that two attributes that represent different aspects of your value, but that can hold the same type of Python objects, still result in different hashes, most of the time.

Note that it's fine if two objects that represent different values (won't test equal) have equal hashes. Reusing a hash value won't break sets or dictionaries. However, if a lot of different object values produce equal hashes then that reduces their efficiency, as you increase the likelihood of collisions. Collisions require collision resolution and collision resolution takes more time, so much so that you can use denial of service attacks on servers with predictable hashing implementations) (*).

So you want a nice wide spread of possible hash values.

Pitfalls to watch out for

The documentation for the object.__hash__ method includes some advice on how to combine values:

The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.

but only using XOR will not produce good hash values, not when the values whose hashes that you XOR together can be of the same type but have different meaning depending on the attribute they've been assigned to. To illustrate with an example:

>>> class Foo:
... def __init__(self, a, b):
... self.a = a
... self.b = b
... def __hash__(self):
... return hash(self.a) ^ hash(self.b)
...
>>> hash(Foo(42, 'spam')) == hash(Foo('spam', 42))
True

Because the hashes for self.a and self.b were just XOR-ed together, we got the same hash value for either order, and so effectively halving the number of usable hashes. Do so with more attributes and you cut the number of unique hashes down rapidly. So you may want to include a bit more information in the hash about each attribute, if the same values can be used in different elements that make up the hash.

Next, know that while Python integers are unbounded, hash values are not. That is to say, hashes values have a finite range. From the same documentation:

Note: hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds.

This means that if you used addition or multiplication or other operations that increase the number of bits needed to store the hash value, you will end up losing the upper bits and so reduce the number of different hash values again.

Next, if you combine multiple hashes with XOR that already have a limited range, chances are you end up with an even smaller number of possible hashes. Try XOR-ing the hashes of 1000 random integers in the range 0-10, for an extreme example.

Hashing, the easy way

Python developers have long since wrestled with the above pitfalls, and solved it for the standard library types. Use this to your advantage. Put your values in a tuple, then hash that tuple.

Python tuples use a simplified version of the xxHash algorithm to capture order information and to ensure a broad range of hash values. So for different attributes, you can capture the different meanings by giving them different positions in a tuple, then hashing the tuple:

def __hash__(self):
return hash((self.a, self.b))

This ensures you get unique hash values for unique orderings.

If you are subclassing something, put the hash of the parent implementation into one of the tuple positions:

def __hash__(self):
return hash((super().__hash__(), self.__more_data))

Hashing a hash value does reduce it to a 60-bit or 30-bit value (on 32-bit or 64-bit platforms, respectively), but that's not a big problem when combined with other values in a tuple. If you are really concerned about this, put None in the tuple as a placeholder and XOR the parent hash (so super().__hash__() ^ hash((None, self.__more_data))). But this is overkill, really.

If you have a multiple values whose relative order doesn't matter, and don't want to XOR these all together one by one, consider using a frozenset() object for fast processing, combined with a collections.Counter() object if values are not meant to be unique. The frozenset() hash operation accounts for small hash ranges by reshuffling the bits in hashes first:

# unordered collection hashing
from collections import Counter
hash(frozenset(Counter(...).items()))

As always, all values in the tuple or frozenset() must be hashable themselves.

Consider using dataclasses

For most objects you write __hash__ functions for, you actually want to be using a dataclass generated class:

from dataclasses import dataclass
from typing import Union

@dataclass(frozen=True)
class Foo:
a: Union[int, str]
b: Union[int, str]

Dataclasses are given a sane __hash__ implementation when frozen=True or unsafe_hash=True, using a tuple() of all the field values.


(*) Python protects your code against such hash collision attacks by using a process-wide random hash seed to hash strings, bytes and datetime objects.

Quickly creating 32 bit hash code uniquely identifying a struct composed of (mostly) primitive values

Well, even in a sparse table one should better be prepared for collisions, depending on what "sparse" means.

Hash collisions probability (uniform distribution)

You would need to be able to make very specific assumptions about the data you will be hashing at the same time to beat this graph with 32 bits.

Go with SHA256. Your hashes will not depend on CLR version and you will have no collisions. Well, you will still have some, but less frequently than meteorite impacts, so you can afford not anticipating any.



Related Topics



Leave a reply



Submit