Boost::Flat_Map and Its Performance Compared to Map and Unordered_Map

boost::flat_map and its performance compared to map and unordered_map

I have run a benchmark on different data structures very recently at my company so I feel I need to drop a word. It is very complicated to benchmark something correctly.

Benchmarking

On the web, we rarely find (if ever) a well-engineered benchmark. Until today I only found benchmarks that were done the journalist way (pretty quickly and sweeping dozens of variables under the carpet).

1) You need to consider cache warming

Most people running benchmarks are afraid of timer discrepancy, therefore they run their stuff thousands of times and take the whole time, they just are careful to take the same thousand of times for every operation, and then consider that comparable.

The truth is, in the real world it makes little sense, because your cache will not be warm, and your operation will likely be called just once. Therefore you need to benchmark using RDTSC, and time stuff calling them once only.
Intel has made a paper describing how to use RDTSC (using a cpuid instruction to flush the pipeline, and calling it at least 3 times at the beginning of the program to stabilize it).

2) RDTSC accuracy measure

I also recommend doing this:

u64 g_correctionFactor;  // number of clocks to offset after each measurement to remove the overhead of the measurer itself.
u64 g_accuracy;

static u64 const errormeasure = ~((u64)0);

#ifdef _MSC_VER
#pragma intrinsic(__rdtsc)
inline u64 GetRDTSC()
{
int a[4];
__cpuid(a, 0x80000000); // flush OOO instruction pipeline
return __rdtsc();
}

inline void WarmupRDTSC()
{
int a[4];
__cpuid(a, 0x80000000); // warmup cpuid.
__cpuid(a, 0x80000000);
__cpuid(a, 0x80000000);

// measure the measurer overhead with the measurer (crazy he..)
u64 minDiff = LLONG_MAX;
u64 maxDiff = 0; // this is going to help calculate our PRECISION ERROR MARGIN
for (int i = 0; i < 80; ++i)
{
u64 tick1 = GetRDTSC();
u64 tick2 = GetRDTSC();
minDiff = std::min(minDiff, tick2 - tick1); // make many takes, take the smallest that ever come.
maxDiff = std::max(maxDiff, tick2 - tick1);
}
g_correctionFactor = minDiff;

printf("Correction factor %llu clocks\n", g_correctionFactor);

g_accuracy = maxDiff - minDiff;
printf("Measurement Accuracy (in clocks) : %llu\n", g_accuracy);
}
#endif

This is a discrepancy measurer, and it will take the minimum of all measured values, to avoid getting a -10**18 (64 bits first negatives values) from time to time.

Notice the use of intrinsics and not inline assembly. First inline assembly is rarely supported by compilers nowadays, but much worse of all, the compiler creates a full ordering barrier around inline assembly because it cannot static analyze the inside, so this is a problem to benchmark real-world stuff, especially when calling stuff just once. So an intrinsic is suited here because it doesn't break the compiler free-re-ordering of instructions.

3) parameters

The last problem is people usually test for too few variations of the scenario.
A container performance is affected by:

  1. Allocator
  2. size of the contained type
  3. cost of implementation of the copy operation, assignment operation, move operation, construction operation, of the contained type.
  4. number of elements in the container (size of the problem)
  5. type has trivial 3.-operations
  6. type is POD

Point 1 is important because containers do allocate from time to time, and it matters a lot if they allocate using the CRT "new" or some user-defined operation, like pool allocation or freelist or other...

(for people interested about pt 1, join the mystery thread on gamedev about system allocator performance impact)

Point 2 is because some containers (say A) will lose time copying stuff around, and the bigger the type the bigger the overhead. The problem is that when comparing to another container B, A may win over B for small types, and lose for larger types.

Point 3 is the same as point 2, except it multiplies the cost by some weighting factor.

Point 4 is a question of big O mixed with cache issues. Some bad-complexity containers can largely outperform low-complexity containers for a small number of types (like map vs. vector, because their cache locality is good, but map fragments the memory). And then at some crossing point, they will lose, because the contained overall size starts to "leak" to main memory and cause cache misses, that plus the fact that the asymptotic complexity can start to be felt.

Point 5 is about compilers being able to elide stuff that are empty or trivial at compile time. This can optimize greatly some operations because the containers are templated, therefore each type will have its own performance profile.

Point 6 same as point 5, PODs can benefit from the fact that copy construction is just a memcpy, and some containers can have a specific implementation for these cases, using partial template specializations, or SFINAE to select algorithms according to traits of T.

About the flat map

Apparently, the flat map is a sorted vector wrapper, like Loki AssocVector, but with some supplementary modernizations coming with C++11, exploiting move semantics to accelerate insert and delete of single elements.

This is still an ordered container. Most people usually don't need the ordering part, therefore the existence of unordered...

Have you considered that maybe you need a flat_unorderedmap? which would be something like google::sparse_map or something like that—an open address hash map.

The problem of open address hash maps is that at the time of rehash they have to copy everything around to the new extended flat land, whereas a standard unordered map just has to recreate the hash index, while the allocated data stays where it is. The disadvantage of course is that the memory is fragmented like hell.

The criterion of a rehash in an open address hash map is when the capacity exceeds the size of the bucket vector multiplied by the load factor.

A typical load factor is 0.8; therefore, you need to care about that, if you can pre-size your hash map before filling it, always pre-size to: intended_filling * (1/0.8) + epsilon this will give you a guarantee of never having to spuriously rehash and recopy everything during filling.

The advantage of closed address maps (std::unordered..) is that you don't have to care about those parameters.

But the boost::flat_map is an ordered vector; therefore, it will always have a log(N) asymptotic complexity, which is less good than the open address hash map (amortized constant time). You should consider that as well.

Benchmark results

This is a test involving different maps (with int key and __int64/somestruct as value) and std::vector.

tested types information:

typeid=__int64 .  sizeof=8 . ispod=yes
typeid=struct MediumTypePod . sizeof=184 . ispod=yes

Insertion

EDIT:

My previous results included a bug: they actually tested ordered insertion, which exhibited a very fast behavior for the flat maps.

I left those results later down this page because they are interesting.

This is the correct test:
random insert 100

random insert 10000

I have checked the implementation, there is no such thing as a deferred sort implemented in the flat maps here. Each insertion sorts on the fly, therefore this benchmark exhibits the asymptotic tendencies:

map: O(N * log(N))

hashmaps: O(N)

vector and flatmaps: O(N * N)

Warning: hereafter the 2 tests for std::map and both flat_maps are buggy and actually test ordered insertion (vs random insertion for other containers. yes it's confusing sorry):

mixed insert of 100 elements without reservation

We can see that ordered insertion, results in back pushing, and is extremely fast. However, from the non-charted results of my benchmark, I can also say that this is not near the absolute optimality for a back-insertion. At 10k elements, perfect back-insertion optimality is obtained on a pre-reserved vector. Which gives us 3Million cycles; we observe 4.8M here for the ordered insertion into the flat_map (therefore 160% of the optimal).

mixed insert of 10000 elements without reservation
Analysis: remember this is 'random insert' for the vector, so the massive 1 billion cycles come from having to shift half (in average) the data upward (one element by one element) at each insertion.

Random search of 3 elements (clocks renormalized to 1)

in size = 100

rand search within a container of 100 elements

in size = 10000

rand search within a container of 10000 elements

Iteration

over size 100 (only MediumPod type)

Iteration over 100 medium pods

over size 10000 (only MediumPod type)

Iteration over 10000 medium pods

Final grain of salt

In the end, I wanted to come back on "Benchmarking §3 Pt1" (the system allocator). In a recent experiment, I am doing around the performance of an open address hash map I developed, I measured a performance gap of more than 3000% between Windows 7 and Windows 8 on some std::unordered_map use cases (discussed here).

This makes me want to warn the reader about the above results (they were made on Win7): your mileage may vary.

map vs unordered_map for few elements

Under the premise that you always need to measure in order to figure out what's more appropriate in terms of performance, if all these things are true:

  1. Changes to the map are not frequent;
  2. The map contains a maximum of 10 elements;
  3. Lookups will be frequent;
  4. You care a lot about performance;

Then I would say you would be better off putting your elements in an std::vector and performing a plain iteration over all your elements to find the one you're looking for.

An std::vector will allocate its elements in a contiguous region of memory, so cache locality is likely to grant you a greater performance - the time required to fetch a cache line from main memory after a cache miss is at least one order of magnitude higher than the time required to access the CPU cache.

Quite interestingly, it seems like Boost's flat_map is ideal for your use case (courtesy of Praetorian):

flat_map is similar to std::map but it's implemented like an ordered vector. (from the online documentation)

So if using Boost is an option for you, you may want to try this one.

Is an unordered_map really faster than a map in practice?

In response to questions about performance in relation to the number of missed searches, I have refactored the test to parameterise this.

Example results:

searches=1000000 set_size=      0 miss=    100% ordered=   4384 unordered=  12901 flat_map=    681
searches=1000000 set_size= 99 miss= 99.99% ordered= 89127 unordered= 42615 flat_map= 86091
searches=1000000 set_size= 172 miss= 99.98% ordered= 101283 unordered= 53468 flat_map= 96008
searches=1000000 set_size= 303 miss= 99.97% ordered= 112747 unordered= 53211 flat_map= 107343
searches=1000000 set_size= 396 miss= 99.96% ordered= 124179 unordered= 59655 flat_map= 112687
searches=1000000 set_size= 523 miss= 99.95% ordered= 132180 unordered= 51133 flat_map= 121669
searches=1000000 set_size= 599 miss= 99.94% ordered= 135850 unordered= 55078 flat_map= 121072
searches=1000000 set_size= 695 miss= 99.93% ordered= 140204 unordered= 60087 flat_map= 124961
searches=1000000 set_size= 795 miss= 99.92% ordered= 146071 unordered= 64790 flat_map= 127873
searches=1000000 set_size= 916 miss= 99.91% ordered= 154461 unordered= 50944 flat_map= 133194
searches=1000000 set_size= 988 miss= 99.9% ordered= 156327 unordered= 54094 flat_map= 134288

Key:

searches = number of searches performed against each map
set_size = how big each map is (and therefore how many of the searches will result in a hit)
miss = the probability of generating a missed search. Used for generating searches and set_size.
ordered = the time spent searching the ordered map
unordered = the time spent searching the unordered_map
flat_map = the time spent searching the flat map

note: time is measured in std::system_clock::duration ticks.

TL;DR

Results: the unordered_map shows its superiority as soon as there is data in the map. The only time it exhibits worse performance than the ordered map is when the maps are empty.

Here's the new code:

#include <iostream>
#include <iomanip>
#include <random>
#include <algorithm>
#include <string>
#include <vector>
#include <map>
#include <unordered_map>
#include <unordered_set>
#include <chrono>
#include <tuple>
#include <future>
#include <stdexcept>
#include <sstream>

using namespace std;

// this sets the length of the string we will be using as a key.
// modify this to test whether key complexity changes the performance ratios
// of the various maps
static const size_t key_length = 20;

// the number of keys we will generate (the size of the test)
const size_t nkeys = 1000000;

// use a virtual method to prevent the optimiser from detecting that
// our sink function actually does nothing. otherwise it might skew the test
struct string_user
{
virtual void sink(const std::string&) = 0;
virtual ~string_user() = default;
};

struct real_string_user : string_user
{
virtual void sink(const std::string&) override
{

}
};

struct real_string_user_print : string_user
{
virtual void sink(const std::string& s) override
{
cout << s << endl;
}
};

// generate a sink from a string - this is a runtime operation and therefore
// prevents the optimiser from realising that the sink does nothing
std::unique_ptr<string_user> make_sink(const std::string& name)
{
if (name == "print")
{
return make_unique<real_string_user_print>();
}
if (name == "noprint")
{
return make_unique<real_string_user>();
}
throw logic_error(name);
}

// generate a random key, given a random engine and a distribution
auto gen_string = [](auto& engine, auto& dist)
{
std::string result(key_length, ' ');
generate(begin(result), end(result), [&] {
return dist(engine);
});
return result;
};

// comparison predicate for our flat map.
struct pair_less
{
bool operator()(const pair<string, string>& l, const string& r) const {
return l.first < r;
}

bool operator()(const string& l, const pair<string, string>& r) const {
return l < r.first;
}
};

template<class F>
auto time_test(F&& f, const vector<string> keys)
{
auto start_time = chrono::system_clock::now();

for (auto const& key : keys)
{
f(key);
}

auto stop_time = chrono::system_clock::now();
auto diff = stop_time - start_time;
return diff;
}

struct report_key
{
size_t nkeys;
int miss_chance;
};

std::ostream& operator<<(std::ostream& os, const report_key& key)
{
return os << "miss=" << setw(2) << key.miss_chance << "%";
}

void run_test(string_user& sink, size_t nkeys, double miss_prob)
{
// the types of map we will test
unordered_map<string, string> unordered;
map<string, string> ordered;
vector<pair<string, string>> flat_map;

// a vector of all keys, which we can shuffle in order to randomise
// access order of all our maps consistently
vector<string> keys;
unordered_set<string> keys_record;

// generate keys
auto eng = std::default_random_engine(std::random_device()());
auto alpha_dist = std::uniform_int_distribution<char>('A', 'Z');
auto prob_dist = std::uniform_real_distribution<double>(0, 1.0 - std::numeric_limits<double>::epsilon());

auto generate_new_key = [&] {
while(true) {
// generate a key
auto key = gen_string(eng, alpha_dist);
// try to store it in the unordered map
// if it already exists, force a regeneration
// otherwise also store it in the ordered map and the flat map
if(keys_record.insert(key).second) {
return key;
}
}
};

for (size_t i = 0 ; i < nkeys ; ++i)
{
bool inserted = false;
auto value = to_string(i);

auto key = generate_new_key();
if (prob_dist(eng) >= miss_prob) {
unordered.emplace(key, value);
flat_map.emplace_back(key, value);
ordered.emplace(key, std::move(value));
}
// record the key for later use
keys.emplace_back(std::move(key));
}
// turn our vector 'flat map' into an actual flat map by sorting it by pair.first. This is the key.
sort(begin(flat_map), end(flat_map),
[](const auto& l, const auto& r) { return l.first < r.first; });

// shuffle the keys to randomise access order
shuffle(begin(keys), end(keys), eng);

auto unordered_lookup = [&](auto& key) {
auto i = unordered.find(key);
if (i != end(unordered)) {
sink.sink(i->second);
}
};

auto ordered_lookup = [&](auto& key) {
auto i = ordered.find(key);
if (i != end(ordered)) {
sink.sink(i->second);
}
};

auto flat_map_lookup = [&](auto& key) {
auto i = lower_bound(begin(flat_map),
end(flat_map),
key,
pair_less());
if (i != end(flat_map) && i->first == key) {
sink.sink(i->second);
}
};

// spawn a thread to time access to the unordered map
auto unordered_future = async(launch::async,
[&]()
{
return time_test(unordered_lookup, keys);
});

// spawn a thread to time access to the ordered map
auto ordered_future = async(launch::async, [&]
{
return time_test(ordered_lookup, keys);
});

// spawn a thread to time access to the flat map
auto flat_future = async(launch::async, [&]
{
return time_test(flat_map_lookup, keys);
});

// synchronise all the threads and get the timings
auto ordered_time = ordered_future.get();
auto unordered_time = unordered_future.get();
auto flat_time = flat_future.get();

cout << "searches=" << setw(7) << nkeys;
cout << " set_size=" << setw(7) << unordered.size();
cout << " miss=" << setw(7) << setprecision(6) << miss_prob * 100.0 << "%";
cout << " ordered=" << setw(7) << ordered_time.count();
cout << " unordered=" << setw(7) << unordered_time.count();
cout << " flat_map=" << setw(7) << flat_time.count() << endl;

}

int main()
{
// generate the sink, preventing the optimiser from realising what it
// does.
stringstream ss;
ss << "noprint";
string arg;
ss >> arg;
auto puser = make_sink(arg);

for (double chance = 1.0 ; chance >= 0.0 ; chance -= 0.0001)
{
run_test(*puser, 1000000, chance);
}

return 0;
}

Is there a flat unsorted map/set implementation?

Something like this?

template<class Key, class Value, template<class...>class Storage=std::vector>
struct flat_map {
struct kv {
Key k;
Value v;
template<class K, class V>
kv( K&& kin, V&& vin ):k(std::forward<K>(kin)), v(std::forward<V>(vin)){}
};
using storage_t = Storage<kv>;
storage_t storage;

// TODO: adl upgrade
using iterator=decltype(std::begin(std::declval<storage_t&>()));
using const_iterator=decltype(std::begin(std::declval<const storage_t&>()));
// boilerplate:
iterator begin() {
using std::begin;
return begin(storage);
}
const_iterator begin() const {
using std::begin;
return begin(storage);
}
const_iterator cbegin() const {
using std::begin;
return begin(storage);
}
iterator end() {
using std::end;
return end(storage);
}
const_iterator end() const {
using std::end;
return end(storage);
}
const_iterator cend() const {
using std::end;
return end(storage);
}
size_t size() const {
return storage.size();
}
bool empty() const {
return storage.empty();
}
// these only have to be valid if called:
void reserve(size_t n) {
storage.reserve(n);
}
size_t capacity() const {
return storage.capacity();
}
// map-like interface:
// TODO: SFINAE check for type of key
template<class K>
Value& operator[](K&& k){
auto it = find(k);
if (it != end()) return it->v;
storage.emplace_back( std::forward<K>(k), Value{} );
return storage.back().v;
}
private: // C++14, but you can just inject the lambda at point of use in 11:
template<class K>
auto key_match( K& k ) {
return [&k](kv const& kv){
return kv.k == k;
};
}
public:
template<class K>
iterator find(K&& k) {
return std::find_if( begin(), end(), key_match(k) );
}
template<class K>
const_iterator find(K&& k) const {
return const_cast<flat_map*>(this)->find(k);
}
// iterator-less query functions:
template<class K>
Value* get(K&& k) {
auto it = find(std::forward<K>(k));
if (it==end()) return nullptr;
return std::addressof(it->v);
}
template<class K>
Value const* get(K&& k) const {
return const_cast<flat_map*>(this)->get(std::forward<K>(k));
}
// key-based erase: (SFINAE should be is_comparible, but that doesn't exist)
template<class K, class=std::enable_if_t<std::is_converible<K, Key>{}>>
bool erase(K&& k) {
auto it = std::remove(
storage.begin(), storage.end(), key_match(std::forward<K>(k))
);
if (it == storage.end()) return false;
storage.erase( it, storage.end() );
return true;
}
// classic erase, for iterating:
iterator erase(const_iterator it) {
return storage.erase(it);
}
template<class K2, class V2,
class=std::enable_if_t<
std::is_convertible< K2, Key >{}&&
std::is_convertible< V2, Value >{}
>
>
void set( K2&& kin, V2&& vin ) {
auto it = find(kin);
if (it != end()){
it->second = std::forward<V2>(vin);
return;
} else {
storage.emplace_back( std::forward<K2>(kin), std::forward<V2>(vin) );
}
}
};

I left the container type as a template argument, so you can use a SBO vector-like structure if you choose.

In theory, I should expose a template parameter for replacing equals on the keys. I did, however, make the key-search functions transparent.

std::map vs unordered_map memory footprint for small N

For small N, a sorted vector is often the fastest container, and has minimal memory footprint. boost::flat_map is an implementation, others call it AssociativeVector. You can use std::lower_bound to implement it easily. Since the data has to be sorted, insert and delete operations are O(n) since the data has to be shifted in the vector.

With a sorted vector, you can do a binary search which is O(log N) for retrieval. The complexity of std::lower_bound is:

The number of comparisons performed is logarithmic in the distance between first and last (At most log2(last - first) + O(1) comparisons)

Unordered_map of unordered_map vs custom hash function for pair key C++?

I would use std::unordered_map<std::pair<std::string, std::string>, Value, [pair_hash][1]> for two reasons:

  1. Performance

Of course, you can measure your two versions with your favorite profiler, but basing on my experience - the number of allocation is what matters here the most - so see:

flat_map.insert(key, value)); will create on average just one new bucket (or extend one), whilst

auto it = map2.insert(make_pair(key.first, map1{}));
it->second.insert(make_pair(key.second, value));

have to create empty map1 - what might not be zero-cost. Then it has to add/extend two buckets (list associated with the given hash value).


  1. Maintainability/Readability

The Second reason is more important for me. Flat(one) map is easy to use. You could see in insert example already that it is more complicated, but consider erase - it is so complicated, that it is easy to make a mistake:


void remove(
std::unordered_map<std::string,
std::unordered_map<std::string, Value>>& map,
std::pair<std::string, std::string> const& key)
{
auto it1 = map.find(key.first);
if (it1 == map.end()) return;
it1->second.erase(key.second);
// easy to forget part
if (it1->second.empty())
{
map.erase(it1);
}
}

Binary search of range in std::map slower than map::find() search of whole map

There is a reason that std::map::find() exists. The implementation already does a binary search, as the std::map has a balanced binary tree as implementation.

Your implementation of binary search is much slower because you can't take advantage of that binary tree.
If you want to take the middle of the map, you start with std::advance it takes the first node (which is at the leaf of the tree) and navigates through several pointers towards what you consider to be the middle. Afterwards, you again need to go from one of these leaf nodes to the next. Again following a lot of pointers.

The result: next to a lot more looping, you get a lot of cache misses, especially when the map is large.

If you want to improve the lookups in your map, I would recommend using a different structure. When ordering ain't important, you could use std::unordered_map. When order is important, you could use a sorted std::vector<std::pair<Key, Value>>. In case you have boost available, this already exists in a class called boost::container::flat_map.



Related Topics



Leave a reply



Submit