Is There Any Advantage of Using Map Over Unordered_Map in Case of Trivial Keys

Is there any advantage of using map over unordered_map in case of trivial keys?

Don't forget that map keeps its elements ordered. If you can't give that up, obviously you can't use unordered_map.

Something else to keep in mind is that unordered_map generally uses more memory. map just has a few house-keeping pointers, and memory for each object. Contrarily, unordered_map has a big array (these can get quite big in some implementations), and then additional memory for each object. If you need to be memory-aware, map should prove better, because it lacks the large array.

So, if you need pure lookup-retrieval, I'd say unordered_map is the way to go. But there are always trade-offs, and if you can't afford them, then you can't use it.

Just from personal experience, I found an enormous improvement in performance (measured, of course) when using unordered_map instead of map in a main entity look-up table.

On the other hand, I found it was much slower at repeatedly inserting and removing elements. It's great for a relatively static collection of elements, but if you're doing tons of insertions and deletions the hashing + bucketing seems to add up. (Note, this was over many iterations.)

How to choose between map and unordered_map?

In practice, if memory is no issue, unordered_map is always faster if you want single element access.

The worst case is theoretical and bound to a single hash accounting for all of the elements. This is not of practical relevance. The unordered_map gets slower as soon as you have at least log N elements belonging to the same hash. This is also not of practical relevance. In some special scenarios you could use a specific hashing algorithm that ensures a more uniform distribution. For ordinary strings that don't share a specific pattern, the generic hash functions coming with unordered_map are just as good.

If you want to traverse the map (using iterators) in a sorted fashion, you cannot use unordered_map. On the contrary, map not only allows that, but also can provide you with the next element in a map based on an approximation of the key (see lower_bound and upper_bound methods).

Choosing between std::map and std::unordered_map

As already mentioned, map allows to iterate over the elements in a sorted way, but unordered_map does not. This is very important in many situations, for example displaying a collection (e.g. address book). This also manifests in other indirect ways like: (1) Start iterating from the iterator returned by find(), or (2) existence of member functions like lower_bound().

Also, I think there is some difference in the worst case search complexity.

  • For map, it is O( lg N )

  • For unordered_map, it is O( N ) [This may happen when the hash function is not good leading to too many hash collisions.]

The same is applicable for worst case deletion complexity.

Is an unordered_map really faster than a map in practice?

In response to questions about performance in relation to the number of missed searches, I have refactored the test to parameterise this.

Example results:

searches=1000000 set_size=      0 miss=    100% ordered=   4384 unordered=  12901 flat_map=    681
searches=1000000 set_size= 99 miss= 99.99% ordered= 89127 unordered= 42615 flat_map= 86091
searches=1000000 set_size= 172 miss= 99.98% ordered= 101283 unordered= 53468 flat_map= 96008
searches=1000000 set_size= 303 miss= 99.97% ordered= 112747 unordered= 53211 flat_map= 107343
searches=1000000 set_size= 396 miss= 99.96% ordered= 124179 unordered= 59655 flat_map= 112687
searches=1000000 set_size= 523 miss= 99.95% ordered= 132180 unordered= 51133 flat_map= 121669
searches=1000000 set_size= 599 miss= 99.94% ordered= 135850 unordered= 55078 flat_map= 121072
searches=1000000 set_size= 695 miss= 99.93% ordered= 140204 unordered= 60087 flat_map= 124961
searches=1000000 set_size= 795 miss= 99.92% ordered= 146071 unordered= 64790 flat_map= 127873
searches=1000000 set_size= 916 miss= 99.91% ordered= 154461 unordered= 50944 flat_map= 133194
searches=1000000 set_size= 988 miss= 99.9% ordered= 156327 unordered= 54094 flat_map= 134288

Key:

searches = number of searches performed against each map
set_size = how big each map is (and therefore how many of the searches will result in a hit)
miss = the probability of generating a missed search. Used for generating searches and set_size.
ordered = the time spent searching the ordered map
unordered = the time spent searching the unordered_map
flat_map = the time spent searching the flat map

note: time is measured in std::system_clock::duration ticks.

TL;DR

Results: the unordered_map shows its superiority as soon as there is data in the map. The only time it exhibits worse performance than the ordered map is when the maps are empty.

Here's the new code:

#include <iostream>
#include <iomanip>
#include <random>
#include <algorithm>
#include <string>
#include <vector>
#include <map>
#include <unordered_map>
#include <unordered_set>
#include <chrono>
#include <tuple>
#include <future>
#include <stdexcept>
#include <sstream>

using namespace std;

// this sets the length of the string we will be using as a key.
// modify this to test whether key complexity changes the performance ratios
// of the various maps
static const size_t key_length = 20;

// the number of keys we will generate (the size of the test)
const size_t nkeys = 1000000;



// use a virtual method to prevent the optimiser from detecting that
// our sink function actually does nothing. otherwise it might skew the test
struct string_user
{
virtual void sink(const std::string&) = 0;
virtual ~string_user() = default;
};

struct real_string_user : string_user
{
virtual void sink(const std::string&) override
{

}
};

struct real_string_user_print : string_user
{
virtual void sink(const std::string& s) override
{
cout << s << endl;
}
};

// generate a sink from a string - this is a runtime operation and therefore
// prevents the optimiser from realising that the sink does nothing
std::unique_ptr<string_user> make_sink(const std::string& name)
{
if (name == "print")
{
return make_unique<real_string_user_print>();
}
if (name == "noprint")
{
return make_unique<real_string_user>();
}
throw logic_error(name);
}

// generate a random key, given a random engine and a distribution
auto gen_string = [](auto& engine, auto& dist)
{
std::string result(key_length, ' ');
generate(begin(result), end(result), [&] {
return dist(engine);
});
return result;
};

// comparison predicate for our flat map.
struct pair_less
{
bool operator()(const pair<string, string>& l, const string& r) const {
return l.first < r;
}

bool operator()(const string& l, const pair<string, string>& r) const {
return l < r.first;
}
};

template<class F>
auto time_test(F&& f, const vector<string> keys)
{
auto start_time = chrono::system_clock::now();

for (auto const& key : keys)
{
f(key);
}

auto stop_time = chrono::system_clock::now();
auto diff = stop_time - start_time;
return diff;
}

struct report_key
{
size_t nkeys;
int miss_chance;
};

std::ostream& operator<<(std::ostream& os, const report_key& key)
{
return os << "miss=" << setw(2) << key.miss_chance << "%";
}

void run_test(string_user& sink, size_t nkeys, double miss_prob)
{
// the types of map we will test
unordered_map<string, string> unordered;
map<string, string> ordered;
vector<pair<string, string>> flat_map;

// a vector of all keys, which we can shuffle in order to randomise
// access order of all our maps consistently
vector<string> keys;
unordered_set<string> keys_record;

// generate keys
auto eng = std::default_random_engine(std::random_device()());
auto alpha_dist = std::uniform_int_distribution<char>('A', 'Z');
auto prob_dist = std::uniform_real_distribution<double>(0, 1.0 - std::numeric_limits<double>::epsilon());

auto generate_new_key = [&] {
while(true) {
// generate a key
auto key = gen_string(eng, alpha_dist);
// try to store it in the unordered map
// if it already exists, force a regeneration
// otherwise also store it in the ordered map and the flat map
if(keys_record.insert(key).second) {
return key;
}
}
};

for (size_t i = 0 ; i < nkeys ; ++i)
{
bool inserted = false;
auto value = to_string(i);

auto key = generate_new_key();
if (prob_dist(eng) >= miss_prob) {
unordered.emplace(key, value);
flat_map.emplace_back(key, value);
ordered.emplace(key, std::move(value));
}
// record the key for later use
keys.emplace_back(std::move(key));
}
// turn our vector 'flat map' into an actual flat map by sorting it by pair.first. This is the key.
sort(begin(flat_map), end(flat_map),
[](const auto& l, const auto& r) { return l.first < r.first; });

// shuffle the keys to randomise access order
shuffle(begin(keys), end(keys), eng);

auto unordered_lookup = [&](auto& key) {
auto i = unordered.find(key);
if (i != end(unordered)) {
sink.sink(i->second);
}
};

auto ordered_lookup = [&](auto& key) {
auto i = ordered.find(key);
if (i != end(ordered)) {
sink.sink(i->second);
}
};

auto flat_map_lookup = [&](auto& key) {
auto i = lower_bound(begin(flat_map),
end(flat_map),
key,
pair_less());
if (i != end(flat_map) && i->first == key) {
sink.sink(i->second);
}
};

// spawn a thread to time access to the unordered map
auto unordered_future = async(launch::async,
[&]()
{
return time_test(unordered_lookup, keys);
});

// spawn a thread to time access to the ordered map
auto ordered_future = async(launch::async, [&]
{
return time_test(ordered_lookup, keys);
});

// spawn a thread to time access to the flat map
auto flat_future = async(launch::async, [&]
{
return time_test(flat_map_lookup, keys);
});

// synchronise all the threads and get the timings
auto ordered_time = ordered_future.get();
auto unordered_time = unordered_future.get();
auto flat_time = flat_future.get();

cout << "searches=" << setw(7) << nkeys;
cout << " set_size=" << setw(7) << unordered.size();
cout << " miss=" << setw(7) << setprecision(6) << miss_prob * 100.0 << "%";
cout << " ordered=" << setw(7) << ordered_time.count();
cout << " unordered=" << setw(7) << unordered_time.count();
cout << " flat_map=" << setw(7) << flat_time.count() << endl;

}

int main()
{
// generate the sink, preventing the optimiser from realising what it
// does.
stringstream ss;
ss << "noprint";
string arg;
ss >> arg;
auto puser = make_sink(arg);

for (double chance = 1.0 ; chance >= 0.0 ; chance -= 0.0001)
{
run_test(*puser, 1000000, chance);
}


return 0;
}

Which one is better stl map or unordered_map for the following cases

Which one performs faster in Insert, Delete, Look-up? Which one takes less memory and less time to clear it from the memory. Any explanations are heartily welcomed !!!

For a specific use, you should try both with your actual data and usage patterns and see which is actually faster... there are enough factors that it's dangerous to assume either will always "win".

implementation and characteristics of unordered maps / hash tables

Academically - as the number of elements increases towards infinity, those operations on an std::unordered_map (which is the C++ library offering for what Computing Science terms a "hash map" or "hash table") will tend to continue to take the same amount of time O(1) (ignoring memory limits/caching etc.), whereas with a std::map (a balanced binary tree) each time the number of elements doubles it will typically need to do an extra comparison operation, so it gets gradually slower O(log2n).

std::unordered_map implementations necessarily use open hashing: the fundamental expectation is that there'll be a contiguous array of "buckets", each logically a container of any values hashing thereto.

It generally serves to picture the hash table as a vector<list<pair<key,value>>> where getting from the vector elements to a value involves at least one pointer dereference as you follow the list-head-pointer stored in the bucket to the initial list node; the insert/find/delete operations' performance depends on the size of the list, which on average equals the unordered_map's load_factor.

If the max_load_factor is lowered (the default is 1.0), then there will be less collisions but more reallocation/rehashing during insertion and more wasted memory (which can hurt performance through increased cache misses).

The memory usage for this most-obvious of unordered_map implementations involves both the contiguous array of bucket_count() list-head-iterator/pointer-sized buckets and one doubly-linked list node per key/value pair. Typically, bucket_count() + 2 * size() extra pointers of overhead, adjusted for any rounding-up of dynamic memory allocation request sizes the implementation might do. For example, if you ask for 100 bytes you might get 128 or 256 or 512. An implementation's dynamic memory routines might use some memory for tracking the allocated/available regions too.

Still, the C++ Standard leaves room for real-world implementations to make some of their own performance/memory-usage decisions. They could, for example, keep the old contiguous array of buckets around for a while after allocating a new larger array, so rehashing values into the latter can be done gradually to reduce the worst-case performance at the cost of average-case performance as both arrays are consulted during operations.

implementation and characteristics of maps / balanced binary trees

A map is a binary tree, and can be expected to employ pointers linking distinct heap memory regions returned by different calls to new. As well as the key/value data, each node in the tree will need parent, left, and right pointers (see wikipedia's binary tree article if lost).

comparison

So, both unordered_map and map need to allocate nodes for key/value pairs with the former typically having two-pointer/iterator overhead for prev/next-node linkage, and the latter having three for parent/left/right. But, the unordered_map additionally has the single contiguous allocation for bucket_count() buckets (== size() / load_factor()).

For most purposes that's not a dramatic difference in memory usage, and the deallocation time difference for one extra region is unlikely to be noticeable.

another alternative

For those occasions when the container's populated up front then repeatedly searched without further inserts/erases, it can sometimes be fastest to use a sorted vector, searched using Standard algorithms binary_search, equal_range, lower_bound, upper_bound. This has the advantage of a single contiguous memory allocation, which is much more cache friendly. It always outperforms map, but unordered_map may still be faster - measure if you care.

Implement did you mean over the keys of an unordered_map

It is almost certainly not worth getting fancy here. Implement the brute force solution that iterates through all possible keys, computes a distance, and then takes the minimum. Profile it, and you'll probably find it's fast enough.

But if you want to have fun...

String edit distance follows the triangle inequality, which means any geometric approx-near-neighbor data structure that can take arbitrary distance functions applies here. I'm fond of LSH.

But ANN gets worse as the dimension increases, and dimension is roughly string length. So you might want a less rigorous approach. BLAST (genome search) does substring-based exact lookup. Your strings are shorter, so you might need bigram or trigram. Alternatively, you might figure the length will be close to correct, and just check everything that's a near-match there.

If you have access to a large database of typos, you could try training a convolutional neural net (one-hot encode each character) to map strings to low-dimensional feature vectors with a cost function that put typos close to their intended strings. Then keep the feature vectors of the legit strings in a KD tree.

But all that's for fun. If the code matters, keep it simple.

In practice, when `std::unordered_map` must be used instead of `std::map`?

I know the difference between them, say internal implementation,time complexity for searching element

In that case, you should know that that the average asymptotic element lookup time complexity of unordered map is constant, while the complexity of ordered map is logarithmic. This means that there is some size of container at which point the lookup will be faster when using unordered map.

But I really can't find a circumstance where std::unordered_map could not be replaced by std::map indeed.

If the container is large enough, and if you cannot afford the cost of ordered map lookup, then you cannot choose to replace the faster unordered map lookup.

Another case where ordered map cannot be used is where there doesn't exist a cheap function to compare relative order of the key.



Related Topics



Leave a reply



Submit