How Are Hash Collisions Handled

How are hash collisions handled?

Fundamentally, there are two major ways of handling hash collisions - separate chaining, when items with colliding hash codes are stored in a separate data structure, and open addressing, when colliding data is stored in another available bucket that was selected using some algorithm.

Both strategies have numerous sub-strategies, described in Wikipedia. The exact strategy used by a particular implementation is, not surprisingly, implementation-specific, so the authors can change it at any time for something more efficient without breaking the assumptions of their users.

A this point, the only way to find out how Swift handles collisions would be disassembling the library (that is, unless you work for Apple, and have access to the source code). Curious people did that to NSDictionary, and determined that it uses linear probing, the simplest variation of the open addressing technique.

How do HashTables deal with collisions?

Hash tables deal with collisions in one of two ways.

Option 1: By having each bucket contain a linked list of elements that are hashed to that bucket. This is why a bad hash function can make lookups in hash tables very slow.

Option 2: If the hash table entries are all full then the hash table can increase the number of buckets that it has and then redistribute all the elements in the table. The hash function returns an integer and the hash table has to take the result of the hash function and mod it against the size of the table that way it can be sure it will get to bucket. So by increasing the size, it will rehash and run the modulo calculations which if you are lucky might send the objects to different buckets.

Java uses both option 1 and 2 in its hash table implementations.

Understanding of hash tables and collision detection

Your implementation of a hash table is correct. I should point out that what you've described in your question isn't collision detection but the operation of updating a key with a new value. A collision is when two different keys map to the same bucket, not when you insert a key and discover that there is a prior entry with the same key. You're already taking care of collisions by chaining entries in the same bucket.

In any case, you've gone about updating entries correctly. Let's say you've inserted the (key, value) pair ('a', 'ant') into the hash table. This means that 'a' maps to 'ant'. If you insert ('a', 'aardvark'), the intention is to overwrite the 'a' entry so that it now maps to 'aardvark'. Therefore, you iterate over the chain of entries and check for the key 'a' in the bucket. You find it, so you replace the value 'ant' with 'aardvark'. Now 'a' maps to 'aardvark'. Good.

Let's suppose you don't iterate over the chain of entries. What happens if you blindly append ('a', 'aardvark') to the end of the chain? The consequence is that when you look up the key 'a' and you go through the entries in the bucket, you come upon ('a', 'ant') first, so you return 'ant'. This is an incorrect result. You recently inserted ('a', 'aardvark'), so you should have returned 'aardvark'.

Ah, but what if you always start searching through the chain from the end? In other words, you're treating it as a stack. To insert an entry, you push it onto the end of the chain. To look up a key, you start searching from the end. The first entry with the given key is the one that was most recently inserted, so it's the correct one and you can return the value without searching further.

That implementation would be correct, but it would also make the chain longer than necessary. Consider what happens if you're using the hash table to count letter frequencies in a text file. Initially you insert ('a', 0) in the table. When you find the first occurrence of 'a' in the text, you read 0 from the table, add 1 to it, and insert ('a', 1) into the hash table. Now you have two entries in the chain with the key 'a', and only the one nearer to the end is valid. When you find the next occurrence of 'a', a third entry is added to the chain, and so on. Thousands of insertions with the same key result in thousands of entries in the chain.

Not only does this use up memory, it slows down other key insertions. For example, suppose your hash function assigns the same index to the keys 'a' and 'q'. This means that the 'q' entries are in the same bucket as the 'a' entries. If you have a whole bunch of 'a' entries at the end of the chain, you might have to go past many of them before you find the most recent entry with 'q'. For these reasons, it's better to do what you did.

One more thought: what if each entry is a tuple (key, values), where values is an array of values? Then, as you suggest, you can append a new value to the end of values in the event of a key collision. But if you do that, what is the meaning of values? It contains the values that were inserted with that key, in order of the time they were inserted. If you treat it as a stack and always return the last value in the list, you're wasting space. You may as well overwrite a single value.

Is there ever a case when you can get away with putting a new value into the bucket and not checking for an existing key? Yes, you can do that if you have a perfect hash function, which guarantees that there are no collisions. Each key gets mapped to a different bucket. Now you don't need a chain of entries. You have a maximum of one value in each bucket, so you can implement the hash table as an array that contains, at each index, either undefined or the most recently inserted value at that index. This sounds great, except it isn't easy to come up with a perfect hash function, especially if you want your hash table to contain no more buckets than necessary. You would have to know in advance every possible key that might be used in order to devise a hash function that maps the n possible keys to n distinct buckets.

handling hash collisions in python dictionaries

Generally, you would use the most unique element of your user record. And this usually means that the system in general has a username or a unique ID per record (user), which is guaranteed to be unique. The username or ID would be the unique key for the record. Since this is enforced by the system itself, for example by means of an auto-increment key in a database table, you can be sure that there is no collision.

THAT unique key therefore should be the key in your map to allow you to find a user record.

However, if for some reason you don't have access to such a guranteed-to-be-unique key, you can certainly create a hash from the record (as described by you) and use any of a number of hash table algorithms to store elements with possibly colliding keys. In that case, you don't avoid the collision, but you simply deal with it.

A quick and commonly used algorithm for that goes like this: Use a hash over the record to create a key, as you already do. This key may potentially not be unique. Now store a list of records at the location indicated by the key. We call those lists 'buckets'. To store a new element, hash it and then append it to the list stored at that location (add it to the bucket). To find an element, hash it, find the entry, then sequentially search through the list/bucket at that location to find the entry you want.

Here's an example:

mymap[123] = [ {'name':'John','age':27}, {'name':'Bob','age':19} ]
mymap[678] = [ {'name':'Frank','age':29} ]

In the example, you have your hash table (implemented via a dict). You have hash key value 678, for which one entry is stored in the bucket. Then you have hash key value 123, but there is a collision: Both the 'John' and 'Bob' entry have this hash value. No matter, you find the bucket stored at mymap[123] and iterate over it to find the value.

This is a flexible and very common implementation of hash maps, which doesn't require re-allocation or other complications. It's described in many places, for example here: https://www.cs.auckland.ac.nz/~jmor159/PLDS210/hash_tables.html (in chapter 8.3.1).

Performance generally only becomes an issue when you have lots of collisions (when the lists for each bucket get really long). Something you'll avoid with a good hash function.

However: A true unique ID for your record, enforced by the database for example, is probably still the preferred approach.

in C++, how to handle hash collision in hash map?

There are dozens of different ways to handle collisions in hash maps depending on what system you're using. Here are a few:

If you use closed addressing, then you probably would have each item hash to a linked list of values, all of which have the same hash code, and would then traverse the list looking for the element in question.
If you use linear probing, then following a hash collision you would start looking at adjacent buckets until you found the element you were looking for or an empty spot.
If you use quadratic probing, then following a hash collision you would look at the elements 1, 3, 6, 10, 15, ..., n(n+1)/2, ... away from the collision point in search of an empty spot or the element in question.
If you use cuckoo hashing, you would maintain two hash tables, then displace the element that you collided with into the other table, repeating this process until the collisions resolved or you had to rehash.
If you use dynamic perfect hashing, you would build up a perfect hash table from all elements sharing that hash code.

The particular implementation you pick is up to you. Go with whatever is simplest. I personally find chained hashing (closed addressing) the easiest, if that suggestion helps.

As for what makes a good hash function, that's really dependent on what type of data you're storing. Hash functions for strings are often very different than hash codes for integers, for example. Depending on the security guarantees you want, you may want to pick a cryptographically secure hash like SHA-256, or just a simple heuristic like a linear combination of the individual bits. Designing a good hash function is quite tricky, and I'd advise doing a bit of digging for advice on the particular structures you're going to be hashing before coming to a conclusion.

Hope this helps!

What happens when hash collision happens in Dictionary key?

Hash collisions are correctly handled by Dictionary<> - in that so long as an object implements GetHashCode() and Equals() correctly, the appropriate instance will be returned from the dictionary.

First, you shouldn't make any assumptions about how Dictionary<> works internally - that's an implementation detail that is likely to change over time. Having said that....

What you should be concerned with is whether the types you are using for keys implement GetHashCode() and Equals() correctly. The basic rules are that GetHashCode() must return the same value for the lifetime of the object, and that Equals() must return true when two instances represent the same object. Unless you override it, Equals() uses reference equality - which means it only returns true if two objects are actually the same instance. You may override how Equals() works, but then you must ensure that two objects that are 'equal' also produce the same hash code.

From a performance standpoint, you may also want to provide an implementation of GetHashCode() that generates a good spread of values to reduce the frequency of hashcode collision. The primarily downside of hashcode collisions, is that it reduces the dictionary into a list in terms of performance. Whenever two different object instances yield the same hash code, they are stored in the same internal bucket of the dictionary. The result of this, is that a linear scan must be performed, calling Equals() on each instance until a match is found.

What Exactly is Hash Collision

What exactly is Hash Collision - is it a feature, or common phenomenon which is mistakenly done but good to avoid?

It's a feature. It arises out of the nature of a hashCode: a mapping from a large value space to a much smaller value space. There are going to be collisions, by design and intent.

What exactly causes Hash Collision - the bad definition of custom class' hashCode() method,

A bad design can make it worse, but it is endemic in the notion.

OR to leave the equals() method un-overridden while imperfectly overriding the hashCode() method alone,

No.

OR is it not up to the developers and many popular java libraries also has classes which can cause Hash Collision?

This doesn't really make sense. Hashes are bound to collide sooner or later, and poor algorithms can make it sooner. That's about it.

Does anything go wrong or unexpected when Hash Collision happens?

Not if the hash table is competently written. A hash collision only means that the hashCode is not unique, which puts you into calling equals(), and the more duplicates there are the worse the performance.

I mean is there any reason why we should avoid Hash Collision?

You have to trade off ease of computation against spread of values. There is no single black and white answer.

Does Java generate or atleast try to generate unique hasCode per class during object initiation?

No. 'Unique hash code' is a contradiction in terms.

If no, is it right to rely on Java alone to ensure that my program would not run into Hash Collision for JRE classes? If not right, then how to avoid hash collision for hashmaps with final classes like String as key?

The question is meaningless. If you're using String you don't have any choice about the hashing algorithm, and you are also using a class whose hashCode has been slaved over by experts for twenty or more years.

How to handle hash collisions?

Typically hash collisions are handled in two ways:

Use a larger hash, so that collisions are practically impossible.
Consider hash codes to be non-unique, and use an equality comparer for the actual data to determine uniqueness.

A 128 bit GUID uses the first method. The HashSet<T> class in .NET is an example of the second method.

Hash collisions in Java's ConcurrentHashMap: which precautions must be taken?

Javas ConcurrentHashMap uses a HashTable as the underlying data structure to store entries. The way this table deals with collisions is described here:
How do HashTables deal with collisions?

Generally you don't need to be concerned about hash collision when using the HashMap and ConcurrentHashMap types of the standard library. Those are guaranteed to not cause problems with keys that have the same hash values.

In handling hash collisions, why use a linked list over a BST?

The justification of a hash table is that there should be very few items hashing to the same hash slot. For these small numbers, a linked list should actually be faster than the BST (better constant factors) and will in any case be simpler and more reliable to code. The worst case for a BST is the same as a linked list in any case, unless you want to go really over the top and use a balanced tree.

How Are Hash Collisions Handled