Differencebetween Set and Unordered_Set in C++

what is the difference between set and unordered_set in C++?

I think you've generally answered your own question, however, this:

Not as compact as tree. (for practical purposes load factors is never 1)

is not necessarily true. Each node of a tree (we'll assume it's a red-black tree) for a type T utilizes space that is equal to at least 2 * pointer_size + sizeof(T) + sizeof(bool). This may be 3 * pointer size depending on whether the tree contains a parent pointer for each tree node.

Compare this to a hash-map: there will be wasted array space for each hash map due to the fact that load factor < 1 as you've said. However, assuming the hash map uses singly linked lists for chaining (and really, there's no real reason not to), each element inserted take only sizeof(T) + pointer size.

Note that this analysis ignores any overhead which may come from extra space used by alignment.

For any element T which has a small size (so, any basic type), the size of the pointers and other overhead dominates. At a load factor of > 0.5 (for example) the std::unordered_set may indeed use up less memory than the equivalent std::set.

The other big missing point is the fact that iterating through a std::set is guaranteed to produce an ordering from smallest to largest, based on the given comparison function, while iterating through an std::unordered_set will return the values in a "random" order.

Why would anyone use set instead of unordered_set?

When, for someone who wants to iterate over the items of the set, the order matters.

How to get (unordered) set difference or symmetric difference in C++?

std::set_difference is for use with arbitrary sorted inputs (pre-sorted std::vectors, std::lists, std::deques, plain array, etc.), it just happens to work with std::set (which is sorted) too.

If you're working with std::unordered_set (or std::set, and you're okay with operating in place), you'd just use the erase method to remove all elements from one such set from another to get the difference, e.g.:

for (const auto& elem : set_to_remove) {
myset.erase(elem);
}

You can also do it into a new set with std::copy_if; the recipe there is trivially adaptable to the case of symmetric difference (it's just two calls to std::copy_if, where each one runs on one input set, and is conditioned on the element not existing in other input set).

Should I use std::set or std::unordered_set for a set of pointers?

As Ami Tavory commented, if you don't need order, then it's usually best to go for unordered containers. The reason being that if order somehow improved performance, unordered containers would still be free to use it, and hence get the same or better complexity anyhow.

A downside of unordered collections is that they usually require a hash function for the key type. If it's too hard or expensive to make one, then containers which don't use hashes might be better.

In C++'s standard library, the average insertion complexity for std::set is O(log(N)), whereas for std::unordered_set it's O(1). Aside from that, there are probably less cache misses on average when using std::unordered_set.

At the end of the day though, this is just theory. You should try something that sounds good enough and profile it to see if it really is.

different result between c+ set and unordered_set

It happens because some of your insertions modify the same container that you are currently iterating over by a for cycle. Not surprisingly, insertions into setand into unordered_set might end up in different positions in the linear sequence of container elements. In one container the new element ends up in front of the current position and is later iterated over by the cycle. In other container the new element ends up behind the current position and is never seen by the cycle.

It is generally not a good idea to modify container that you are currently iterating over by a range-based for cycle. It might not produce any undefined behavior in your case (if you are using associative containers with stable iterators), but still... in my opinion range-based for should be reserved for iterating over non-changing containers.

In your case insertion of a new element into an std::unordered_set may trigger rehashing and invalidate all iterators of that unordered_set. It means that if that unordered_set is currently being iterated over by a range-based for, you end up with undefined behavior.

Comparing unordered_map vs unordered_set

They are nearly identical. unordered_set only contains keys, and no values. There is no mapping from a key to a value, so no need for an operator[]. unordered_map maps a key to a value.

You can use the various find methods within unordered_set to locate things.

C++: Why is unordered_set::find faster than find?

This is due to how they are implemented. std::find operates as you might expect. Start at the beginning and compare each element until it reaches the end. This is fairly universal, but won't benefit from the specific data structure used. However, unordered_set is a hashset so if there are no hash collisions, every element takes the same amount of time to find.

The reason it says there is a "worst case of linear in container size" is because if the hash table were to have a length of 1, every entry would be placed at the same position in the table (pseudocode: table[hash(element) % table_length].push(element)). If this were to happen, then depending on the implementation it could end up looking more like a list in memory and it would have to check every item sequentially. In practice though, this would likely never happen.

Is there any difference in perofrmance of unordered_set (C++) in case of strings vs in case of integer?

Is there any difference in perofrmance of unordered_set (C++) in case of strings vs in case of integer?

There can be. The language specification doesn't have guarantees one way or the other.

You can verify whether this is the case for your program on your target system by measuring the performance.


If you're considering whether to use string itself as the key, or a separately hashed string (i.e. integer), then technically separate hashing is more expensive since the integer would be hashed again. That said, hashing an integer is trivial (I think it may be the identity function), so this might have no noticeable effect.

Separate hashing + storing integer does have potential advantage: You can pre-hash the string keys once, and reuse the hashed integer key, while a map with string keys requires the key to be re-hashed on every lookup. Whether this is useful in your case depends on what you're going to do with the map.



Related Topics



Leave a reply



Submit