Hashing a Dictionary

Hashing a dictionary?

If your dictionary is not nested, you could make a frozenset with the dict's items and use hash():

hash(frozenset(my_dict.items()))

This is much less computationally intensive than generating the JSON string or representation of the dictionary.

UPDATE: Please see the comments below, why this approach might not produce a stable result.

Python - hash() and dict

If the keys and values hash the same, frozenset is designed to be a stable and unique representation of the underlying values. The docs explicitly state:

Two sets are equal if and only if every element of each set is contained in the other (each is a subset of the other).

And the rules for hashable types require that:

Hashable objects which compare equal must have the same hash value.

So by definition frozensets with equal, hashable elements are equal and hash to the same value. This can only be violated if a user-defined class which does not obey the rules for hashing and equality is contained in the resulting frozenset (but then you've got bigger problems).

Note that this does not mean they'll iterate in the same order or produce the same repr; thanks to chaining on hash collisions, two frozensets constructed from the same elements in a different order need not iterate in the same order. But they're still equal to one another, and hash the same (precise outputs and ordering is implementation dependent, could easily vary between different versions of Python; this just happens to work on my Py 3.5 install to create the desired "different iteration order" behavior):

>>> frozenset([1,9])
frozenset({1, 9})
>>> frozenset([9,1])
frozenset({9, 1}) # <-- Different order; consequence of 8 buckets colliding for 1 and 9
>>> hash(frozenset([1,9]))
-7625378979602737914
>>> hash(frozenset([9,1]))
-7625378979602737914 # <-- Still the same hash though
>>> frozenset([1,9]) == frozenset([9,1])
True # <-- And still equal

What is the true difference between a dictionary and a hash table?

A dictionary is a general concept that maps keys to values. There are many ways to implement such a mapping.

A hashtable is a specific way to implement a dictionary.

Besides hashtables, another common way to implement dictionaries is red-black trees.

Each method has it's own pros and cons. A red-black tree can always perform a lookup in O(log N). A hashtable can perform a lookup in O(1) time although that can degrade to O(N) depending on the input.

How does the process of hashing work in DictionaryTKey, TValue

The hashing process in an Dictionary uses a technique that's refered to as chaining.
With chaining, a secondary data structure is utilized to hold any collisions. Specifically, each slot in the Dictionary has an array of elements that map to a bucket. In the event of a collision, the colliding element is prepended to the bucket's list.

See this article on MSDN for more details.

Is a Python dictionary an example of a hash table?

Yes, it is a hash mapping or hash table. You can read a description of python's dict implementation, as written by Tim Peters, here.

That's why you can't use something 'not hashable' as a dict key, like a list:

>>> a = {}
>>> b = ['some', 'list']
>>> hash(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list objects are unhashable
>>> a[b] = 'some'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list objects are unhashable

You can read more about hash tables or check how it has been implemented in python and why it is implemented that way.

More consistent hashing in dictionary with python objects?

Python dicts are insert ordered. The repr respects that. Your hexdigest of {"A":1,"B":2} will differ from {"B":2,"A":1} whereas == - wise those dicts are the same.

Yours won't work out:

from  hashlib import md5

def yourHash(d):
return md5(repr(d).encode()).hexdigest()

a = {"A":1,"B":2}
b = {"B":2,"A":1}

print(repr(a))
print(repr(b))
print (a==b)

print(yourHash(a) == yourHash(b))

gives

{'A': 1, 'B': 2}   # repr a
{'B': 2, 'A': 1} # repr b
True # a == b

False # your hashes equall'ed

I really do not see the "sense" in hashing dicts at all ... and those ones here are not even "nested".


You could try JSON to sort keys down to the last nested one and using the json.dumps() of the whole structure to be hashed - but still - don't see the sense and it will give you plenty computational overhead:

import json
a = {"A":1,"B":2, "C":{2:1,3:2}}
b = {"B":2,"A":1, "C":{3:2,2:1}}

for di in (a,b):
print(json.dumps(di,sort_keys=True))

gives

{"A": 1, "B": 2, "C": {"2": 1, "3": 2}}  # thouroughly sorted
{"A": 1, "B": 2, "C": {"2": 1, "3": 2}} # recursively...

which is exactly what this answer in Hashing a dictionary? proposes ... why stray from it?

Hashing in dictionary in kdb

The creation of the hash table is not exposed to the user so you're unlikely to find out how it's done behind the scenes. Also the hash table isn't created out of the dictionary, it's created only for the list which forms the key of your dictionary.

Your dictionary lookup requires a lookup of the keys (a list) - the u# attribute speeds up the searching of said list by not defaulting to using linear search.

Some info here: http://www.timestored.com/kdb-guides/table-attributes#unique-attribute

Alter the hash function of a dictionary

You can't change the hash-function - the dict will call hash on the keys it's supposed to insert, and that's that.

However, you can wrap the keys to provide different __hash__ and __eq__-Methods.

class MyHash(object):
def __init__(self, v):
self._v = v

def __hash__(self):
return hash(self._v) * -1

def __eq__(self, other):
return self._v == other._v

If this actually helps anything with your original problem/question I doubt though, it seems rather a custom array/list-based data-structure might be the answer. Or not.

What hash algorithm does Python's dictionary mapping use?

The hash that it uses depends on the object being used as a key -- each class can define its own __hash__() method, and the value that it returns for a particular instance is what is used for the dictionary.

Python itself provides the hash implementation for str and tuple types. A quick look at the source should reveal the exact algorithm for those.

The hash of a tuple is based on the hashes of its contents. The algorithm is essentially this (simplified slightly):

def hash(tuple):
mult = 1000003
x = 0x345678
for index, item in enumerate(tuple):
x = ((x ^ hash(item)) * mult) & (1<<32)
mult += (82520 + (len(tuple)-index)*2)
return x + 97531

For strings, the interpreter also iterates over every character, combining them with this (again, slightly simplified) algorithm:

def hash(string):
x = string[0] << 7
for chr in string[1:]:
x = ((1000003 * x) ^ chr) & (1<<32)
return x

A bigger issue to worry about is avoiding hash collisions. Colliding hash keys will cause a linear search as the dictionary tries to find a place to store the new object (this is now being recognized as a security issue, and tha behavior may be changing in upcoming python versions)



Related Topics



Leave a reply



Submit