Is a Python Dictionary an Example of a Hash Table

Is a Python dictionary an example of a hash table?

Yes, it is a hash mapping or hash table. You can read a description of python's dict implementation, as written by Tim Peters, here.

That's why you can't use something 'not hashable' as a dict key, like a list:

>>> a = {}
>>> b = ['some', 'list']
>>> hash(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list objects are unhashable
>>> a[b] = 'some'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list objects are unhashable

You can read more about hash tables or check how it has been implemented in python and why it is implemented that way.

What is the true difference between a dictionary and a hash table?

A dictionary is a general concept that maps keys to values. There are many ways to implement such a mapping.

A hashtable is a specific way to implement a dictionary.

Besides hashtables, another common way to implement dictionaries is red-black trees.

Each method has it's own pros and cons. A red-black tree can always perform a lookup in O(log N). A hashtable can perform a lookup in O(1) time although that can degrade to O(N) depending on the input.

How does Python implement dictionaries?

(some of) The following answers are taken from Upgrade your Python skills: Examining the Dictionary. More information on Python hash tables can be found in Python Hash Tables Under The Hood:

When we create a dictionary what is its initial size?


  1. As can be seen in the source code:
/* PyDict_MINSIZE is the starting size for any new dict.
* 8 allows dicts with no more than 5 active entries; experiments suggested
* this suffices for the majority of dicts (consisting mostly of usually-small
* dicts created to pass keyword arguments).
* Making this 8, rather than 4 reduces the number of resizes for most
* dictionaries, without any significant extra memory use.
*/
#define PyDict_MINSIZE 8

Imagine we update with a lot of key value pairs, i suppose we need to externe the hash table. I suppose we need to recompute the hash function to adapt the size of the new bigger hash table while keeping a kind of logic with the previous hash table....

CPython checks the hash table size every time we add a key. If the table is two-thirds full, it would resize the hash table by GROWTH_RATE (which is currently set to 3), and insert all elements:

/* GROWTH_RATE. Growth rate upon hitting maximum load.
* Currently set to used*3.
* This means that dicts double in size when growing without deletions,
* but have more head room when the number of deletions is on a par with the
* number of insertions. See also bpo-17563 and bpo-33205.
*
* GROWTH_RATE was set to used*4 up to version 3.2.
* GROWTH_RATE was set to used*2 in version 3.3.0
* GROWTH_RATE was set to used*2 + capacity/2 in 3.4.0-3.6.0.
*/
#define GROWTH_RATE(d) ((d)->ma_used*3)

The USABLE_FRACTION is the two thirds I mentioned above:

/* USABLE_FRACTION is the maximum dictionary load.
* Increasing this ratio makes dictionaries more dense resulting in more
* collisions. Decreasing it improves sparseness at the expense of spreading
* indices over more cache lines and at the cost of total memory consumed.
*
* USABLE_FRACTION must obey the following:
* (0 < USABLE_FRACTION(n) < n) for all n >= 2
*
* USABLE_FRACTION should be quick to calculate.
* Fractions around 1/2 to 2/3 seem to work well in practice.
*/
#define USABLE_FRACTION(n) (((n) << 1)/3)

Furthermore, the index calculation is:

i = (size_t)hash & mask;

where mask is HASH_TABLE_SIZE-1.

Here's how hash collisions are dealt:

perturb >>= PERTURB_SHIFT;
i = (i*5 + perturb + 1) & mask;

Explained in the source code:


The first half of collision resolution is to visit table indices via this
recurrence:
j = ((5*j) + 1) mod 2**i
For any initial j in range(2**i), repeating that 2**i times generates each
int in range(2**i) exactly once (see any text on random-number generation for
proof). By itself, this doesn't help much: like linear probing (setting
j += 1, or j -= 1, on each loop trip), it scans the table entries in a fixed
order. This would be bad, except that's not the only thing we do, and it's
actually *good* in the common cases where hash keys are consecutive. In an
example that's really too small to make this entirely clear, for a table of
size 2**3 the order of indices is:
0 -> 1 -> 6 -> 7 -> 4 -> 5 -> 2 -> 3 -> 0 [and here it's repeating]
If two things come in at index 5, the first place we look after is index 2,
not 6, so if another comes in at index 6 the collision at 5 didn't hurt it.
Linear probing is deadly in this case because there the fixed probe order
is the *same* as the order consecutive keys are likely to arrive. But it's
extremely unlikely hash codes will follow a 5*j+1 recurrence by accident,
and certain that consecutive hash codes do not.
The other half of the strategy is to get the other bits of the hash code
into play. This is done by initializing a (unsigned) vrbl "perturb" to the
full hash code, and changing the recurrence to:
perturb >>= PERTURB_SHIFT;
j = (5*j) + 1 + perturb;
use j % 2**i as the next table index;
Now the probe sequence depends (eventually) on every bit in the hash code,
and the pseudo-scrambling property of recurring on 5*j+1 is more valuable,
because it quickly magnifies small differences in the bits that didn't affect
the initial index. Note that because perturb is unsigned, if the recurrence
is executed often enough perturb eventually becomes and remains 0. At that
point (very rarely reached) the recurrence is on (just) 5*j+1 again, and
that's certain to find an empty slot eventually (since it generates every int
in range(2**i), and we make sure there's always at least one empty slot).

Why can a Python dict have multiple keys with the same hash?

For a detailed description of how Python's hashing works see my answer to Why is early return slower than else?

Basically it uses the hash to pick a slot in the table. If there is a value in the slot and the hash matches, it compares the items to see if they are equal.

If the hash matches but the items aren't equal, then it tries another slot. There's a formula to pick this (which I describe in the referenced answer), and it gradually pulls in unused parts of the hash value; but once it has used them all up, it will eventually work its way through all slots in the hash table. That guarantees eventually we either find a matching item or an empty slot. When the search finds an empty slot, it inserts the value or gives up (depending whether we are adding or getting a value).

The important thing to note is that there are no lists or buckets: there is just a hash table with a particular number of slots, and each hash is used to generate a sequence of candidate slots.

Is there a way to create a hash table without dictionaries?

A dictionary is a hash table, so the only way to use a hash table without using a dictionary would be to implement your own, which I would not recommend.

You don't have to compare every tuple to every other tuple. Instead, you can "normalize" the tuples by rotating them such that the smallest element is in front, e.g. 3,2,6,9,4 would be rotated to 2,6,9,4,3. If the list contains the minimum element more than once, you have to find all the positions of the minimum element and get the rotation that is "smallest" by natural order of lists, e.g. 1,2,3,4,1 is normalized to 1,1,2,3,4 and 1,3,1,2 to 1,2,1,3.

def normalize(lst):
min_ = min(lst)
return min(lst[i:] + lst[:i] for i, e in enumerate(lst) if e == min_)

You could then just put the normalized tuples into a set of "unique" tuples in O(n).

lsts = [(1, 2, 3, 4, 5), (2, 3, 4, 5, 1), (1, 3, 2, 4, 5)]
unique = set(map(normalize, lsts)) # has length 2

It could be argued, though, that a set is just a dict with no values. In this case, you could sort the list of normalized tuples and check for duplicates in O(nlogn + n).

srtd = sorted(map(normalize, lsts))
n = len(lsts) - sum(srtd[i-1] == srtd[i] for i in range(1, len(srtd))) # 2

Does Python optimize dictionary lookups under the hood?

Python's dictionary implementation reduces the average complexity of dictionary lookups to O(1) by requiring that key objects provide a "hash" function. Such a hash function takes the information in a key object and uses it to produce an integer, called a hash value. This hash value is then used to determine which "bucket" this (key, value) pair should be placed into. Pseudocode for this lookup function might look something like:

def lookup(d, key):
'''dictionary lookup is done in three steps:
1. A hash value of the key is computed using a hash function.

2. The hash value addresses a location in d.data which is
supposed to be an array of "buckets" or "collision lists"
which contain the (key,value) pairs.

3. The collision list addressed by the hash value is searched
sequentially until a pair is found with pair[0] == key. The
return value of the lookup is then pair[1].
'''
h = hash(key) # step 1
cl = d.data[h] # step 2
for pair in cl: # step 3
if key == pair[0]:
return pair[1]
else:
raise KeyError, "Key %s not found." % key

From the Python Wiki

How does hash-table in set works in python?

To be usable in sets and dicts, a __hash__ method must guarantee that if x == y, then hash(x) == hash(y). But that's a one-sided implication. It's not at all required that if hash(x) == hash(h) then x == y must be true. Indeed, that's impossible to achieve in general (for example, there are an unbounded number of distinct Python ints, but only a finite number of hash codes - there must be distinct ints that have the same hash value).

That your hashes are all the same is fine. They only tell the set/dict where to start looking. All objects in the container with the same hash are then compared, one by one, for equality, until success, or until all such objects have been tried without success.

However, while making all hashes the same doesn't hurt correctness, it's a disaster for performance: it effectively turns the set/dict into an exceptionally slow way to do an O(n) linear search.

handling hash collisions in python dictionaries

Generally, you would use the most unique element of your user record. And this usually means that the system in general has a username or a unique ID per record (user), which is guaranteed to be unique. The username or ID would be the unique key for the record. Since this is enforced by the system itself, for example by means of an auto-increment key in a database table, you can be sure that there is no collision.

THAT unique key therefore should be the key in your map to allow you to find a user record.

However, if for some reason you don't have access to such a guranteed-to-be-unique key, you can certainly create a hash from the record (as described by you) and use any of a number of hash table algorithms to store elements with possibly colliding keys. In that case, you don't avoid the collision, but you simply deal with it.

A quick and commonly used algorithm for that goes like this: Use a hash over the record to create a key, as you already do. This key may potentially not be unique. Now store a list of records at the location indicated by the key. We call those lists 'buckets'. To store a new element, hash it and then append it to the list stored at that location (add it to the bucket). To find an element, hash it, find the entry, then sequentially search through the list/bucket at that location to find the entry you want.

Here's an example:

mymap[123] = [ {'name':'John','age':27}, {'name':'Bob','age':19} ]
mymap[678] = [ {'name':'Frank','age':29} ]

In the example, you have your hash table (implemented via a dict). You have hash key value 678, for which one entry is stored in the bucket. Then you have hash key value 123, but there is a collision: Both the 'John' and 'Bob' entry have this hash value. No matter, you find the bucket stored at mymap[123] and iterate over it to find the value.

This is a flexible and very common implementation of hash maps, which doesn't require re-allocation or other complications. It's described in many places, for example here: https://www.cs.auckland.ac.nz/~jmor159/PLDS210/hash_tables.html (in chapter 8.3.1).

Performance generally only becomes an issue when you have lots of collisions (when the lists for each bucket get really long). Something you'll avoid with a good hash function.

However: A true unique ID for your record, enforced by the database for example, is probably still the preferred approach.

Implementation of hashtable/dictionary in python

You need to keep your hash and your table size separate. The hash should be based on the key alone, not the size. Store the following information per key-value entry:

  • The key
  • The hash - a constant value derived from the key. Make sure it has breath to avoid too many collisions.
  • The value

You pick a slot based on the table size and the hash:

slot = hash % tablesize

Then, when you run out of space in your current table, produce a new table (say, double the size), to accommodate your growing data set, and reslot everything. You already have the hashes cached, all you have to do is take each (key, hash, value) tuple and re-calculate the new slot with the above formula, now with a larger tablesize.

You'll also have to decide how to handle collisions; two hashes that given the current table size, end up in the same slot. Python's dict uses open addressing, where the hash is 'perturbed' in a reproducible fashion until another empty slot is found.

You may want to study the Python dict source code to see how they do this, see the extensive comments on how collisions are handled. You could also watch this PyCon presentation where Brandon Rhodes explains all those details with some very enlightening graphics. Or you could pick up a copy of Beautiful Code which has a whole chapter on the Python dict implementation.



Related Topics



Leave a reply



Submit