What's a Correct and Good Way to Implement _Hash_()

What's a correct and good way to implement hash()?

An easy, correct way to implement __hash__() is to use a key tuple. It won't be as fast as a specialized hash, but if you need that then you should probably implement the type in C.

Here's an example of using a key for hash and equality:

class A:
    def __key(self):
        return (self.attr_a, self.attr_b, self.attr_c)

    def __hash__(self):
        return hash(self.__key())

    def __eq__(self, other):
        if isinstance(other, A):
            return self.__key() == other.__key()
        return NotImplemented

Also, the documentation of __hash__ has more information, that may be valuable in some particular circumstances.

How to implement a good hash function in python

__hash__ should return the same value for objects that are equal. It also shouldn't change over the lifetime of the object; generally you only implement it for immutable objects.

A trivial implementation would be to just return 0. This is always correct, but performs badly.

Your solution, returning the hash of a tuple of properties, is good. But note that you don't need to list all properties that you compare in __eq__ in the tuple. If some property usually has the same value for inequal objects, just leave it out. Don't make the hash computation any more expensive than it needs to be.

Edit: I would recommend against using xor to mix hashes in general. When two different properties have the same value, they will have the same hash, and with xor these will cancel eachother out. Tuples use a more complex calculation to mix hashes, see tuplehash in tupleobject.c.

What is the default hash in python?

What you can rely on: custom objects have a default hash() that is based in some way on the identity of the object. i.e. any object using the default hash will have a constant value for that hash over its lifetime and different objects may or may not have a different hash value.

You cannot rely on any particular relationship between the value returned by id() and the value returned by hash(). In the standard C implementation of Python 2.6 and earlier they were the same, in Python 2.7-3.2 hash(x)==id(x)/16.

Edit: originally I wrote that in releases 3.2.3 and later or 2.7.3 or later the hash value may be randomised and in Python 3.3 the relationship will always be randomised. In fact that randomisation at present only applies to hashing strings so in fact the divide by 16 relationship may continue to hold for now, but don't bank on it.

Hash collisions don't usually matter: in a dictionary lookup to find an object it must have the same hash and must also compare equal. Collisions only matter if you get a very high proportion of collisions such as in the denial of service attack that led to recent versions of Python being able to randomise the hash calculation.

Recommended way to implement eq and hash

Answering my own question. It seems one way of performing this is to define an auxillary __members function and to use that in defining __hash__ and __eq__. This way, there is no duplication:

class MyClass(object):
    def __init__(self, a, b):
        self.a = a
        self.b = b

    def __members(self):
        return (self.a, self.b)

    def __eq__(self, other):
        if type(other) is type(self):
            return self.__members() == other.__members()
        else:
            return False

    def __hash__(self):
        return hash(self.__members())

Best implementation for hashCode method for a collection

The best implementation? That is a hard question because it depends on the usage pattern.

A for nearly all cases reasonable good implementation was proposed in Josh Bloch's Effective Java in Item 8 (second edition). The best thing is to look it up there because the author explains there why the approach is good.

A short version

Create a int result and assign a non-zero value.
For every field f tested in the equals() method, calculate a hash code c by:
- If the field f is a boolean:
  calculate (f ? 0 : 1);
- If the field f is a byte, char, short or int: calculate (int)f;
- If the field f is a long: calculate (int)(f ^ (f >>> 32));
- If the field f is a float: calculate Float.floatToIntBits(f);
- If the field f is a double: calculate Double.doubleToLongBits(f) and handle the return value like every long value;
- If the field f is an object: Use the result of the hashCode() method or 0 if f == null;
- If the field f is an array: see every field as separate element and calculate the hash value in a recursive fashion and combine the values as described next.
Combine the hash value c with result:
```
result = 37 * result + c
```
Return result

This should result in a proper distribution of hash values for most use situations.

Is the hash of a GUID unique?

Not as reliably unique as the GUID itself, no.

Just to expand, you are reducing your uniqueness by a factor of 4, going from 16 bytes to 4 bytes of possible combinations.

As pointed out in the comments the hash size will make a difference. The 4 byte thing was an assumption, horrible at best I know, that it may be used in .NET, where the default hash size is 4 bytes (int). So you can replace what I said above with whatever byte size your hash may be.

Built in Python hash() function

Use hashlib as hash() was designed to be used to:

quickly compare dictionary keys during a dictionary lookup

and therefore does not guarantee that it will be the same across Python implementations.

What's a Correct and Good Way to Implement _Hash_()