Built in Python Hash() Function

Built in Python hash() function

Use hashlib as hash() was designed to be used to:

quickly compare dictionary keys during a dictionary lookup

and therefore does not guarantee that it will be the same across Python implementations.

Python hash() function on strings

Hash values are not dependent on the memory location but the contents of the object itself. From the documentation:

Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).

See CPython's implementation of str.__hash__ in:

Objects/unicodeobject.c (for unicode_hash)
Python/pyhash.c (for _Py_HashBytes)

What does hash do in python?

A hash is an fixed sized integer that identifies a particular value. Each value needs to have its own hash, so for the same value you will get the same hash even if it's not the same object.

>>> hash("Look at me!")
4343814758193556824
>>> f = "Look at me!"
>>> hash(f)
4343814758193556824

Hash values need to be created in such a way that the resulting values are evenly distributed to reduce the number of hash collisions you get. Hash collisions are when two different values have the same hash. Therefore, relatively small changes often result in very different hashes.

>>> hash("Look at me!!")
6941904779894686356

These numbers are very useful, as they enable quick look-up of values in a large collection of values. Two examples of their use are Python's set and dict. In a list, if you want to check if a value is in the list, with if x in values:, Python needs to go through the whole list and compare x with each value in the list values. This can take a long time for a long list. In a set, Python keeps track of each hash, and when you type if x in values:, Python will get the hash-value for x, look that up in an internal structure and then only compare x with the values that have the same hash as x.

The same methodology is used for dictionary lookup. This makes lookup in set and dict very fast, while lookup in list is slow. It also means you can have non-hashable objects in a list, but not in a set or as keys in a dict. The typical example of non-hashable objects is any object that is mutable, meaning that you can change its value. If you have a mutable object it should not be hashable, as its hash then will change over its life-time, which would cause a lot of confusion, as an object could end up under the wrong hash value in a dictionary.

Note that the hash of a value only needs to be the same for one run of Python. In Python 3.3 they will in fact change for every new run of Python:

$ /opt/python33/bin/python3
Python 3.3.2 (default, Jun 17 2013, 17:49:21) 
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> hash("foo")
1849024199686380661
>>> 
$ /opt/python33/bin/python3
Python 3.3.2 (default, Jun 17 2013, 17:49:21) 
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> hash("foo")
-7416743951976404299

This is to make is harder to guess what hash value a certain string will have, which is an important security feature for web applications etc.

Hash values should therefore not be stored permanently. If you need to use hash values in a permanent way you can take a look at the more "serious" types of hashes, cryptographic hash functions, that can be used for making verifiable checksums of files etc.

Is this an appropriate use of python's built-in hash function?

Python's hash function is designed for speed, and maps into a 64-bit space. Due to the birthday paradox, this means you'll likely get a collision at about 5 billion entries (probably way earlier, since the hash function is not cryptographical). Also, the precise definition of hash is up to the Python implementation, and may be architecture- or even machine-specific. Don't use it you want the same result on multiple machines.

md5 is designed as a cryptographic hash function; even slight perturbations in the input totally change the output. It also maps into a 128-bit space, which makes it unlikely you'll ever encounter a collision at all unless you're specifically looking for one.

If you can handle collisions (i.e. test for equality between all members in a bucket, possibly by using a cryptographic algorithm like MD5 or SHA2), Python's hash function is perfectly fine.

One more thing: To save space, you should store the data in binary form if you write it to disk. (i.e. struct.pack('!q', hash('abc')) / hashlib.md5('abc').digest()).

As a side note: is is not equivalent to == in Python. You mean ==.

hash function in Python 3.3 returns different results between sessions

Python uses a random hash seed to prevent attackers from tar-pitting your application by sending you keys designed to collide. See the original vulnerability disclosure. By offsetting the hash with a random seed (set once at startup) attackers can no longer predict what keys will collide.

You can set a fixed seed or disable the feature by setting the PYTHONHASHSEED environment variable; the default is random but you can set it to a fixed positive integer value, with 0 disabling the feature altogether.

Python versions 2.7 and 3.2 have the feature disabled by default (use the -R switch or set PYTHONHASHSEED=random to enable it); it is enabled by default in Python 3.3 and up.

If you were relying on the order of keys in a Python set, then don't. Python uses a hash table to implement these types and their order depends on the insertion and deletion history as well as the random hash seed. Note that in Python 3.5 and older, this applies to dictionaries, too.

Also see the object.__hash__() special method documentation:

Note: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

See also PYTHONHASHSEED.

If you need a stable hash implementation, you probably want to look at the hashlib module; this implements cryptographic hash functions. The pybloom project uses this approach.

Since the offset consists of a prefix and a suffix (start value and final XORed value, respectively) you cannot just store the offset, unfortunately. On the plus side, this does mean that attackers cannot easily determine the offset with timing attacks either.

Python: Are `hash` values for built-in numeric types, strings standardised?

The hash values for strings and integers are absolutely not standardized. They could change with any new implementation of Python, including between 2.6.1 and 2.6.2, or between a Mac and a PC implementation of the same version, etc.

More importantly, though, stable hash values doesn't imply repeatable iteration order. You cannot depend on the ordering of values in a set, ever. Even within one process, two sets can be equal and not return their values in the same order. This can happen if one set has had many additions and deletions, but the other has not:

>>> a = set()
>>> for i in range(1000000): a.add(str(i))
...
>>> for i in range(6, 1000000): a.remove(str(i))
...
>>> b = set()
>>> for i in range(6): b.add(str(i))
...
>>> a == b
True
>>> list(a)
['1', '5', '2', '0', '3', '4']
>>> list(b)
['1', '0', '3', '2', '5', '4']

hash() function producing inconsistent hashes

hash() is only suitable for producing mappings, hash tables. It uses a random seed to prevent attacks. It is not a cryptographic hash and should not be counted on to be stable across Python invocations.

From the hash() function documentation:

Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).

and from the __hash__ hook method, which hash() calls if present:

Note: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

Stick to the hashlib module options; those are stable across calls.

Apart from this, within a single Python process, hash() on objects with the same value, will also produce the exact same hash. Since your block dictionary changes between blocks (as it includes the hash for the preceding block in the chain), it will naturally not be the same string and so not the same hash value.

The same applies to the hashlib functions; they produce the same value for the same input only. If your hash values differ, then the input differs. And your inputs naturally differ because each block dictionary includes a reference to the preceding hash.

How to implement a good hash function in python

__hash__ should return the same value for objects that are equal. It also shouldn't change over the lifetime of the object; generally you only implement it for immutable objects.

A trivial implementation would be to just return 0. This is always correct, but performs badly.

Your solution, returning the hash of a tuple of properties, is good. But note that you don't need to list all properties that you compare in __eq__ in the tuple. If some property usually has the same value for inequal objects, just leave it out. Don't make the hash computation any more expensive than it needs to be.

Edit: I would recommend against using xor to mix hashes in general. When two different properties have the same value, they will have the same hash, and with xor these will cancel eachother out. Tuples use a more complex calculation to mix hashes, see tuplehash in tupleobject.c.

Is python's hash() persistent?

The contract for the __hash__ method requires that it be consistent within a given run of Python. There is no guarantee that it be consistent across different runs of Python, and in fact, for the built-in str, bytes-like types, and datetime.datetime objects (possibly others), the hash is salted with a per-run value so that it's almost never the same for the same input in different runs of Python.

Built in Python Hash() Function