How Do Hashtables Deal with Collisions

How do HashTables deal with collisions?

Hash tables deal with collisions in one of two ways.

Option 1: By having each bucket contain a linked list of elements that are hashed to that bucket. This is why a bad hash function can make lookups in hash tables very slow.

Option 2: If the hash table entries are all full then the hash table can increase the number of buckets that it has and then redistribute all the elements in the table. The hash function returns an integer and the hash table has to take the result of the hash function and mod it against the size of the table that way it can be sure it will get to bucket. So by increasing the size, it will rehash and run the modulo calculations which if you are lucky might send the objects to different buckets.

Java uses both option 1 and 2 in its hash table implementations.

How are hash collisions handled?

Fundamentally, there are two major ways of handling hash collisions - separate chaining, when items with colliding hash codes are stored in a separate data structure, and open addressing, when colliding data is stored in another available bucket that was selected using some algorithm.

Both strategies have numerous sub-strategies, described in Wikipedia. The exact strategy used by a particular implementation is, not surprisingly, implementation-specific, so the authors can change it at any time for something more efficient without breaking the assumptions of their users.

A this point, the only way to find out how Swift handles collisions would be disassembling the library (that is, unless you work for Apple, and have access to the source code). Curious people did that to NSDictionary, and determined that it uses linear probing, the simplest variation of the open addressing technique.

How is data retrieved from a HashTable in collision?

There are many different data structures possible for handling the collisions. Here's a good summary. http://en.wikipedia.org/wiki/Hash_table.

The hash function narrows the search down to a single bucket in the data structure. The bucket then contains another data structure for resolving collisions. It can be link to an array, where the keys are maintained in a sorted or unsorted order. The link might be to the first element in a linked list of keys, or to the root node of a b-tree. The important point is that the hashing function very quickly narrows down the scope of the search.

Once the scope is narrowed, some other less-efficient search can be useful to work through the collisions. It's all about the trade-offs. You want a hashing algorithm that gives a large enough range of hashes (and buckets) to minimize collisions, limited to how much memory you can afford. If the collisions are rare, a linear search through a linked list of collisions isn't bad. If there many collisions, then the efficiency of re-sizing arrays for the buckets becomes more important.

Why does increasing size of hashtable decrease the number of collisions?

This is because while calculating hash it also take the consideration of array size. So while calculating hash if array size is big, it take bigger modulo value.

Eg:

Suppose if array size is 3 and pass values are 2 and 5

then 2%3 and 5%3 take same place i.e. 1.

Now take example of array size 5

then 2%5 and 5%5 take different place i.e. 2 and 0 respectively.

So with the increase in hash table size , number of collision decreases.

Hope this explanation help you.

Hash collisions in Java's ConcurrentHashMap: which precautions must be taken?

Javas ConcurrentHashMap uses a HashTable as the underlying data structure to store entries. The way this table deals with collisions is described here:
How do HashTables deal with collisions?

Generally you don't need to be concerned about hash collision when using the HashMap and ConcurrentHashMap types of the standard library. Those are guaranteed to not cause problems with keys that have the same hash values.

Understanding of hash tables and collision detection

Your implementation of a hash table is correct. I should point out that what you've described in your question isn't collision detection but the operation of updating a key with a new value. A collision is when two different keys map to the same bucket, not when you insert a key and discover that there is a prior entry with the same key. You're already taking care of collisions by chaining entries in the same bucket.

In any case, you've gone about updating entries correctly. Let's say you've inserted the (key, value) pair ('a', 'ant') into the hash table. This means that 'a' maps to 'ant'. If you insert ('a', 'aardvark'), the intention is to overwrite the 'a' entry so that it now maps to 'aardvark'. Therefore, you iterate over the chain of entries and check for the key 'a' in the bucket. You find it, so you replace the value 'ant' with 'aardvark'. Now 'a' maps to 'aardvark'. Good.

Let's suppose you don't iterate over the chain of entries. What happens if you blindly append ('a', 'aardvark') to the end of the chain? The consequence is that when you look up the key 'a' and you go through the entries in the bucket, you come upon ('a', 'ant') first, so you return 'ant'. This is an incorrect result. You recently inserted ('a', 'aardvark'), so you should have returned 'aardvark'.

Ah, but what if you always start searching through the chain from the end? In other words, you're treating it as a stack. To insert an entry, you push it onto the end of the chain. To look up a key, you start searching from the end. The first entry with the given key is the one that was most recently inserted, so it's the correct one and you can return the value without searching further.

That implementation would be correct, but it would also make the chain longer than necessary. Consider what happens if you're using the hash table to count letter frequencies in a text file. Initially you insert ('a', 0) in the table. When you find the first occurrence of 'a' in the text, you read 0 from the table, add 1 to it, and insert ('a', 1) into the hash table. Now you have two entries in the chain with the key 'a', and only the one nearer to the end is valid. When you find the next occurrence of 'a', a third entry is added to the chain, and so on. Thousands of insertions with the same key result in thousands of entries in the chain.

Not only does this use up memory, it slows down other key insertions. For example, suppose your hash function assigns the same index to the keys 'a' and 'q'. This means that the 'q' entries are in the same bucket as the 'a' entries. If you have a whole bunch of 'a' entries at the end of the chain, you might have to go past many of them before you find the most recent entry with 'q'. For these reasons, it's better to do what you did.

One more thought: what if each entry is a tuple (key, values), where values is an array of values? Then, as you suggest, you can append a new value to the end of values in the event of a key collision. But if you do that, what is the meaning of values? It contains the values that were inserted with that key, in order of the time they were inserted. If you treat it as a stack and always return the last value in the list, you're wasting space. You may as well overwrite a single value.

Is there ever a case when you can get away with putting a new value into the bucket and not checking for an existing key? Yes, you can do that if you have a perfect hash function, which guarantees that there are no collisions. Each key gets mapped to a different bucket. Now you don't need a chain of entries. You have a maximum of one value in each bucket, so you can implement the hash table as an array that contains, at each index, either undefined or the most recently inserted value at that index. This sounds great, except it isn't easy to come up with a perfect hash function, especially if you want your hash table to contain no more buckets than necessary. You would have to know in advance every possible key that might be used in order to devise a hash function that maps the n possible keys to n distinct buckets.

How Do Hashtables Deal with Collisions