What Is the Default _Hash_ in Python

What is the default __hash__ in python?

What you can rely on: custom objects have a default hash() that is based in some way on the identity of the object. i.e. any object using the default hash will have a constant value for that hash over its lifetime and different objects may or may not have a different hash value.

You cannot rely on any particular relationship between the value returned by id() and the value returned by hash(). In the standard C implementation of Python 2.6 and earlier they were the same, in Python 2.7-3.2 hash(x)==id(x)/16.

Edit: originally I wrote that in releases 3.2.3 and later or 2.7.3 or later the hash value may be randomised and in Python 3.3 the relationship will always be randomised. In fact that randomisation at present only applies to hashing strings so in fact the divide by 16 relationship may continue to hold for now, but don't bank on it.

Hash collisions don't usually matter: in a dictionary lookup to find an object it must have the same hash and must also compare equal. Collisions only matter if you get a very high proportion of collisions such as in the denial of service attack that led to recent versions of Python being able to randomise the hash calculation.

What is the default hash of user defined classes?

The relevant function appears to be:

Py_hash_t
_Py_HashPointer(void *p)
{
Py_hash_t x;
size_t y = (size_t)p;
/* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
excessive hash collisions for dicts and sets */
y = (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4));
x = (Py_hash_t)y;
if (x == -1)
x = -2;
return x;
}

(that code comes from here, and is then used to be the tp_hash slot in type here.) The comment there seems to give a reason for not using the pointer (which is the same thing as the id) directly. Indeed, the commit that introduced that change to the function is here, and states that the reason for the change is:

Issue #5186: Reduce hash collisions for objects with no hash
method by rotating the object pointer by 4 bits to the right.

which refers to this issue, which explains more why the change was made.

Using an object's id() as a hash value

The __hash__ method has to satisfy the following requirement in order to work:

Forall x, y such that x == y, then hash(x) == hash(y).

In your case your class does not implement __eq__ which means that x == y if and only if id(x) == id(y), and thus your hash implementation satisfy the above property.

Note however that if you do implement __eq__ then this implementation will likely fail.

Also: there is a difference between having a "valid" __hash__ and having a good hash. For example the following is a valid __hash__ definition for any class:

def __hash__(self):
return 1

A good hash should try to distribute uniformly the objects as to avoid collisions as much as possible. Usually this requires a more complex definition.
I'd avoid trying to come up with formulas and instead rely on python built-in hash function.

For example if your class has fields a, b and c then I'd use something like this as __hash__:

def __hash__(self):
return hash((self.a, self.b, self.c))

The definition of hash for tuples should be good enough for the average case.

Finally: you should not define __hash__ in classes that are mutable (in the fields used for equality). That's because modifying the instances will change their hash and this will break things.

Python - Using the default __hash__ method in __hash__ method definition

To call parent implementation use:

super(Foo, self).__hash__()

It also occurred to me that I could rewrite it as return
object.__hash__(self)
. This works, but seems even worse, as special
methods are not intended to be called directly.

You are overriding a magic method, so it's ok to call parent's implementation directly.

hash function in Python 3.3 returns different results between sessions

Python uses a random hash seed to prevent attackers from tar-pitting your application by sending you keys designed to collide. See the original vulnerability disclosure. By offsetting the hash with a random seed (set once at startup) attackers can no longer predict what keys will collide.

You can set a fixed seed or disable the feature by setting the PYTHONHASHSEED environment variable; the default is random but you can set it to a fixed positive integer value, with 0 disabling the feature altogether.

Python versions 2.7 and 3.2 have the feature disabled by default (use the -R switch or set PYTHONHASHSEED=random to enable it); it is enabled by default in Python 3.3 and up.

If you were relying on the order of keys in a Python set, then don't. Python uses a hash table to implement these types and their order depends on the insertion and deletion history as well as the random hash seed. Note that in Python 3.5 and older, this applies to dictionaries, too.

Also see the object.__hash__() special method documentation:

Note: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

See also PYTHONHASHSEED.

If you need a stable hash implementation, you probably want to look at the hashlib module; this implements cryptographic hash functions. The pybloom project uses this approach.

Since the offset consists of a prefix and a suffix (start value and final XORed value, respectively) you cannot just store the offset, unfortunately. On the plus side, this does mean that attackers cannot easily determine the offset with timing attacks either.

Python - class __hash__ method and set

Your reading is incorrect. The __eq__ method is used for equality checks. The documents just state that the __hash__ value must also be the same for 2 objects a and b for which a == b (i.e. a.__eq__(b)) is true.

This is a common logic mistake: a == b being true implies that hash(a) == hash(b) is also true. However, an implication does not necessarily mean equivalence, that in addition to the prior, hash(a) == hash(b) would mean that a == b.

To make all instances of MyClass compare equal to each other, you need to provide an __eq__ method for them; otherwise Python will compare their identities instead. This might do:

class MyClass(object):
def __hash__(self):
return 0
def __eq__(self, other):
# another object is equal to self, iff
# it is an instance of MyClass
return isinstance(other, MyClass)

Now:

>>> result = set()
>>> result.add(MyClass())
>>> result.add(MyClass())
1

In reality you'd base the __hash__ on those properties of your object that are used for __eq__ comparison, for example:

class Person
def __init__(self, name, ssn):
self.name = name
self.ssn = ssn

def __eq__(self, other):
return isinstance(other, Person) and self.ssn == other.ssn

def __hash__(self):
# use the hashcode of self.ssn since that is used
# for equality checks as well
return hash(self.ssn)

p = Person('Foo Bar', 123456789)
q = Person('Fake Name', 123456789)
print(len({p, q}) # 1


Related Topics



Leave a reply



Submit