How Does C++ Stl Unordered_Map Resolve Collisions

How does C++ STL unordered_map resolve collisions?

The standard defines a little more about this than most people seem to realize.

Specifically, the standard requires (§23.2.5/9):

The elements of an unordered associative container are organized into buckets. Keys with the same hash code appear in the same bucket.

The interface includes a bucket_count that runs in constant time. (table 103). It also includes a bucket_size that has to run in time linear on the size of the bucket.

That's basically describing an implementation that uses collision chaining. When you do use collision chaining, meeting all the requirements is somewhere between easy and trivial. bucket_count() is the number of elements in your array. bucket_size() is the number of elements in the collision chain. Getting them in constant and linear time respectively is simple and straightforward.

By contrast, if you use something like linear probing or double hashing, those requirements become all but impossible to meet. Specifically, all the items that hashed to a specific value need to land in the same bucket, and you need to be able to count those buckets in constant time.

But, if you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though--it's linear on the number of items that hashed to the same or a colliding value.

With enough extra work and a fair amount of stretching the meaning of some of the requirements almost to the breaking point, it might be barely possible to create a hash table using something other than collision chaining, and still at least sort of meet the requirements--but I'm not really certain it's possible, and it would certain involve quite a lot of extra work.

Summary: all practical implementations of std::unordered_set (or unordered_map) undoubtedly use collision chaining. While it might be (just barely) possible to meet the requirements using linear probing or double hashing, such an implementation seems to lose a great deal and gain nearly nothing in return.

How does std::unordered_map handle collisions?

It doesn't guarantee O(1), it's O(1) on average... Worst case it can be O(n) when there are a lot of collisions. Please see link below, for more info:

https://stackoverflow.com/a/2771398/5874704

Update

Since the question has been edited, and now asks specifically about collisions for std::unordered_map, please have a look at the following answer:

https://stackoverflow.com/a/21519560/5874704

I think we can conclude that all practical implementations of std::unordered_set (or unordered_map) almost certainly use collision chaining. While it might be (just barely) possible to meet the requirements using linear probing or double hashing, such an implementation seems to lose a great deal and gain nearly nothing in return.

How to identify whether or not std::unordered_map has experienced hash collisions?

You can use the bucket interface and its bucket_size method.

std::unordered_map<int, int> map;
bool has_collision = false;

for(size_t bucket = 0; bucket < map.bucket_count(); bucket++) {
if(map.bucket_size(bucket) > 1) {
has_collision = true;
break;
}
}

C++ Hash Table - How is collision for unordered_map with custom data type as keys resolved?

That's a terrible hash function. But it is legal, so your implementation will work.

The rule (and really the only rule) for Hash and Equals is:

  • if a == b, then std::hash<value_type>(a) == std::hash<value_type>(b).

(It's also important that both Hash and Equals always produce the same value for the same arguments. I used to think that went without saying, but I've seen several SO questions where unordered_map produced unexpected results precisely because one or both of these functions depended on some external value.)

That would be satisfied by a hash function which always returned 42, in which case the map would get pretty slow as it filled up. But other than the speed issue, the code would work.

std::unordered_map uses a chained hash, not an open-addressed hash. All entries with the same hash value are placed in the same bucket, which is a linked list. So low-quality hashes do not distribute entries very well among the buckets.

It's clear that your hash gives {x, y} and {y, x} the same hash value. More seriously, any collection of points in a small rectangle will share the same small number of different hash values, because the high-order bits of the hash values will all be the same.

How to get chained values in case of Collision in unordered_map?

This question shows a misunderstanding of how chaining is used, and what people normally mean by a collision.

In your example code:

unordered_map<int,int> diff;
//collision : inserting two entries with same key 1
diff.insert(make_pair(1, 7));
diff.insert(make_pair(1, 26));

The two insertions have the same key, so they map to the same bucket in the hash table. The first insert finds the bucket empty, so {1,7} is linked from that bucket. The second time, the insert function finds the already-inserted key 1 linked from the bucket, and gives up on the insertion.

You shouldn't call this a "collision" though: it was the same key for both insert calls, which gets hashed by a hash function and mapped to a "bucket" in the hash table, but the same key will always map to the same bucket - that's an inherent property of hash functions and tables.

A collision is when two distinct keys map to the same bucket.

Hash tables support collisions: they let you store many {key,value} pairs with distinct keys that collide to the same bucket.

unordered_map does not support duplicate keys: i.e. multiple {key, value} pairs with the same key.

The output is just 7 How can I get both 7 and 26 assuming chaining is used in unordered_map for collision resolution.What is the behavior of unordered_map in such scenarios?

You can't get the two key-value pairs {1,7} and {1,26} into the same unordered_map at the same time. To actually allow the same key to be associated with multiple values - so insert doesn't search the bucket's chained elements to see if the key was already inserted and abort when it's found - you must use an unordered_multiset instead.

When using unordered_multiset, there's some logical justification for calling it a collision when multiple same-key values are inserted, as both pairs can't be first in the list of chained nodes - one has to push the other out of the way or go to the back of the line - so to speak, just as for a differenting-key collision at that bucket. Still, to avoid confusion, I recommend you reserve the "collision" terminology for differenting-key collisions.

How to count collisions in unordered_set c++

std::unordered_map will increase bucket_count in an attempt to keep load_factor near max_load_factor.

That means that bucket_count depends only on the number of elements in the map, and is unaffected by the number of collisions.

To check for collisions, count all elements that have a bucket size > 1.

size_t collisions = 0, empty = 0;
for (auto bucket = a.bucket_count(); bucket--;) {
if (a.bucket_size(bucket) == 0)
empty++;
else
collisions += a.bucket_size(bucket) - 1;
}
std::cout << "a = " << a.max_load_factor() << ' ' << a.load_factor() << ' '
<< ' ' << a.bucket_count() << ' ' << collisions << ' ' << empty << '\n';
empty = 0, collisions = 0;
for (auto bucket = b.bucket_count(); bucket--;) {
if (b.bucket_size(bucket) == 0)
empty++;
else
collisions += b.bucket_size(bucket) - 1;
}
std::cout << "b = " << b.max_load_factor() << ' ' << b.load_factor() << ' '
<< ' ' << b.bucket_count() << ' ' << collisions << ' ' << empty << '\n';

Prints

a = 1 0.610352  16384 9999 16383
b = 1 0.610352 16384 4773 11157

That is, with a bad hashing function there are 9999 collisions and 16383 out of 16384 empty buckets.


Unrelated: if you care about hash table performance, have a look at dense_hash_map, which implements linear probing for much better performance.

How to retrieve collisions of unordered map?

You're still mistaking key's value with key's hash. But to answer question as asked: you can use unordered_map's bucket() member function with bucket iterators:

std::unordered_map<int,int,dumbest_hash> m;
m[0] = 42;
m[1] = 43;

size_t bucket = m.bucket(1);

for(auto it = m.begin(bucket), e = m.end(bucket); it != e; ++it) {
cout << "bucket " << bucket << ": " << it->first << " -> " << it->second << '\n';
}

demo

In simple and mostly correct terms, unordered containers imitate their ordered counterparts in terms of interface. That means that if a map will not allow you to have duplicate keys, then neither will unordered_map.

unordered do employ hashing function to speed up the lookup, but if two keys have the same hash, they will not necessarily have the same value. To keep the behaviour similar to the ordered containers, unordered_set and unordered_map will only consider elements equal when they're actually equal (using operator== or provided comparator), not when their hashed values collide.

To put things in perspective, let's assume that "eggs" and "chicken" have the same hash value and that there's no equality checking. Then the following code would be "correct":

unordered_map<string, int> m;
m["eggs"] = 42;
m.insert(make_pair("chicken", 0)); // not inserted, key already exists
assert(m["chicken"] == 42);

But if you want allow duplicate keys in the same map, simply use unordered_multimap.

How std::unordered_map is implemented

The Standard effectively mandates that implementations of std::unordered_set and std::unordered_map - and their "multi" brethren - use open hashing aka separate chaining, which means an array of buckets, each of which holds the head of a linked list†. That requirement is subtle: it is a consequence of:

  • the default max_load_factor() being 1.0 (which means the table will resize whenever size() would otherwise exceed 1.0 times the bucket_count(), and
  • the guarantee that the table will not be rehashed unless grown beyond that load factor.

That would be impractical without chaining, as the collisions with the other main category of hash table implementation - closed hashing aka open addressing - become overwhelming as the load_factor()](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) approaches 1.

References:

23.2.5/15: The insert and emplace members shall not affect the validity of iterators if (N+n) < z * B, where N is the number of elements in the container prior to the insert operation, n is the number of elements inserted, B is the container’s bucket count, and z is the container’s maximum load factor.

amongst the Effects of the constructor at 23.5.4.2/1: max_load_factor() returns 1.0.

† To allow optimal iteration without passing over any empty buckets, GCC's implementation fills the buckets with iterators into a single singly-linked list holding all the values: the iterators point to the element immediately before that bucket's elements, so the next pointer there can be rewired if erasing the bucket's last value.

Regarding the text you quote:

No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

There is no "oversight"... what was done was very deliberate and done with full awareness. It's true that other compromises could have been struck, but the open hashing / chaining approach is a reasonable compromise for general use, that copes reasonably elegantly with collisions from mediocre hash functions, isn't too wasteful with small or large key/value types, and handles arbitrarily-many insert/erase pairs without gradually degrading performance the way many closed hashing implementations do.

As evidence of the awareness, from Matthew Austern's proposal here:

I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:

• It's necessary to distinguish between a vacant position and an occupied one.

• It's necessary either to restrict the hash table to types with a default constructor, and to construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory.

• Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best known solutions are complicated.

• Collision management is especially complicated when erasing elements is allowed. (See Knuth for a discussion.) A container class for the standard library ought to allow erasure.

• Collision management schemes for open addressing tend to assume a fixed size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory.

Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.

Specifically for insert-only tables with data small enough to store directly in the buckets, a convenient sentinel value for unused buckets, and a good hash function, a closed hashing approach may be roughly an order of magnitude faster and use dramatically less memory, but that's not general purpose.

A full comparison and elaboration of hash table design options and their implications is off topic for S.O. as it's way too broad to address properly here.

c++ unordered_map collision handling , resize and rehash

With every put request, hash the object and map it to a space in this memory

Unfortunately, this isn't exactly true. You are referring to an open addressing or closed hashing data structure which is not how unordered_map is specified.

Every unordered_map implementation stores a linked list to external nodes in the array of buckets. Meaning that inserting an item will always allocate at least once (the new node) if not twice (resizing the array of buckets, then the new node).

No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements. Because inserting might cause the bucket array to grow (reallocate), it is not generally possible to have an iterator pointing directly into the bucket array and meet the stability guarantees.

unordered_map is a better data structure if you are storing expensive-to-copy items as your key or value. Which makes sense, given that its general design was lifted from Boost's pre-move-semantics design.

Chandler Carruth (Google) mentions this problem in his CppCon '14 talk "Efficiency with Algorithms, Performance with Data Structures".



Related Topics



Leave a reply



Submit