What's the Most Efficient Way to Erase Duplicates and Sort a Vector

What's the most efficient way to erase duplicates and sort a vector?

I agree with R. Pate and Todd Gardner; a std::set might be a good idea here. Even if you're stuck using vectors, if you have enough duplicates, you might be better off creating a set to do the dirty work.

Let's compare three approaches:

Just using vector, sort + unique

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

Convert to set (manually)

set<int> s;
unsigned size = vec.size();
for( unsigned i = 0; i < size; ++i ) s.insert( vec[i] );
vec.assign( s.begin(), s.end() );

Convert to set (using a constructor)

set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );

Here's how these perform as the number of duplicates changes:

comparison of vector and set approaches

Summary: when the number of duplicates is large enough, it's actually faster to convert to a set and then dump the data back into a vector.

And for some reason, doing the set conversion manually seems to be faster than using the set constructor -- at least on the toy random data that I used.

how to remove duplicates of a vector of structures c++

The simplest way is to use :

movements.erase(std::unique(movements.begin(), movements.end()), movements.end());

But std::unique only removes consecutive duplicated elements, so you need to sort the std::vector first, by overloading < and == operator:

struct posToMove
{
int fromX;
int fromY;
int toX;
int toY;

bool operator < (const posToMove& other) const
{
//declare how 2 variable of type posToMove should be compared with <
return std::make_tuple(fromX, fromY, toX, toY) < std::make_tuple(other.fromX, other.fromY, other.toX, other.toY);
}

bool operator == (const posToMove& other) const
{
//declare how 2 variable of type posToMove should be compared with ==
return std::make_tuple(fromX, fromY, toX, toY) == std::make_tuple(other.fromX, other.fromY, other.toX, other.toY);
}
};

I declared < and == operator with make_tuple(), but you can replace that with your choice of comparing as well.

Code :

#include <iostream>
#include <vector>
#include <tuple>
#include <algorithm>
struct posToMove
{
int fromX;
int fromY;
int toX;
int toY;

bool operator < (const posToMove& other) const
{
//declare how 2 variable of type posToMove should be compared with <
return std::make_tuple(fromX, fromY, toX, toY) < std::make_tuple(other.fromX, other.fromY, other.toX, other.toY);
}

bool operator == (const posToMove& other) const
{
//declare how 2 variable of type posToMove should be compared with ==
return std::make_tuple(fromX, fromY, toX, toY) == std::make_tuple(other.fromX, other.fromY, other.toX, other.toY);
}
};

std::vector<posToMove>movements;

int main()
{
movements.push_back({0,1,0,0});
movements.push_back({1,2,5,7});
movements.push_back({3,9,9,6});
movements.push_back({0,1,0,0});
movements.push_back({4,1,8,0});
movements.push_back({1,2,5,7});

std::sort(movements.begin(), movements.end());

std::cout << "After sort : \n";
for (auto x : movements)
{
std::cout << x.fromX << " " << x.fromY << " " << x.toX << " " << x.toY << "\n";
}
std::cout << "\n";

movements.erase(std::unique(movements.begin(), movements.end()), movements.end());

std::cout << "After removing : \n";
for (auto x : movements)
{
std::cout << x.fromX << " " << x.fromY << " " << x.toX << " " << x.toY << "\n";
}
}

Result:

After sort :
0 1 0 0
0 1 0 0
1 2 5 7
1 2 5 7
3 9 9 6
4 1 8 0

After removing :
0 1 0 0
1 2 5 7
3 9 9 6
4 1 8 0

Related : Remove duplicates in vector of structure c++

Documentations:

  • erase() : https://en.cppreference.com/w/cpp/container/vector/erase2

  • std::unique : https://en.cppreference.com/w/cpp/algorithm/unique

  • std::sort : https://en.cppreference.com/w/cpp/algorithm/sort

  • Overloading operators : https://en.cppreference.com/w/cpp/language/operators

How to remove duplicates from unsorted std::vector while keeping the original ordering using algorithms?

The naive way is to use std::set as everyone tells you. It's overkill and has poor cache locality (slow).

The smart* way is to use std::vector appropriately (make sure to see footnote at bottom):

#include <algorithm>
#include <vector>
struct target_less
{
template<class It>
bool operator()(It const &a, It const &b) const { return *a < *b; }
};
struct target_equal
{
template<class It>
bool operator()(It const &a, It const &b) const { return *a == *b; }
};
template<class It> It uniquify(It begin, It const end)
{
std::vector<It> v;
v.reserve(static_cast<size_t>(std::distance(begin, end)));
for (It i = begin; i != end; ++i)
{ v.push_back(i); }
std::sort(v.begin(), v.end(), target_less());
v.erase(std::unique(v.begin(), v.end(), target_equal()), v.end());
std::sort(v.begin(), v.end());
size_t j = 0;
for (It i = begin; i != end && j != v.size(); ++i)
{
if (i == v[j])
{
using std::iter_swap; iter_swap(i, begin);
++j;
++begin;
}
}
return begin;
}

Then you can use it like:

int main()
{
std::vector<int> v;
v.push_back(6);
v.push_back(5);
v.push_back(5);
v.push_back(8);
v.push_back(5);
v.push_back(8);
v.erase(uniquify(v.begin(), v.end()), v.end());
}

*Note: That's the smart way in typical cases, where the number of duplicates isn't too high. For a more thorough performance analysis, see this related answer to a related question.

Fastest way to remove duplicates from a vector

Question 1: Both 1 and 2 are O(n log n), 3 is O(n^2). Between 1 and 2, it depends on the data.

Question 2: 4 is also O(n log n) and can be better than 1 and 2 if you have lots of duplicates, because it only stores one copy of each. Imagine a million values that are all equal.

Question 3: Well, that really depends on what you need to do.

The only thing that can be said without knowing more is that your alternative number 3 is asymptotically worse than the others.

If you're using C++11 and don't need ordering, you can use std::unordered_set, which is a hash table and can be significantly faster than std::set.

How to remove duplicates in a vector based on an equality predicate?

I believe the simplest method would be to use std::unordered_set and exploit its second and third template parameters.

Method

  1. Define a hash function

This step goal is a provide a " pre-filtering " step which eliminates obvious duplicates according to the meaning in the context (so here, v1 and -v1 should for example have the same hash).

That's something that should be on a per-class basis . There is no way to come up with a generic performant hashing function, though non-performant critical application may use the first hasher below (but I won't really recommend that anymore).

a. The bad hasher

This is what I originally proposed, before taking comment from @axnsan and @François Andrieux in count.

The simplest hasher I can think of looks like that:

struct bad_hasher
{
std::size_t operator()(const value_type&) const
{
return 0;
}
};

It makes all hash collide and forces std::unordered_set to use KeyEqual to determine objects equality.
So indeed, that works, but that does not make it a good choice. @axnsan and @François Andrieux pointed the following drawbacks in the comment below:

  • "it changes its performance characteristics to O(n^2) (it will have to iterate through all elements on each lookup) "(- @axnsan)
  • "[it converts] the set into a simple unsorted linked list. Every element will collide with every other element, it it looks like typical implementations use collision chaining". (- @François Andrieux)

In other words, this makes std::unordered_set become the same as a std::vector + std::remove_if.

b. The better hasher

@axnsan suggests the following hasher for this particular application:

struct better_hasher
{
std::size_t operator()(const vec3& v) const
{
return static_cast<std::size_t>(std::abs(v.x) + std::abs(v.y) + std::abs(v.z));
}
};

It fills the following requirements:

  • better_hasher(v) == better_hasher(-v).
  • v1 != v2 => better_hasher(v1) != better_hasher(v2) in most cases ((1, 0, 1) will collide with (1, 1, 0) for example)
  • not all hashes collide.
  • removes obvious duplicates.

Which is probably somewhere near the best we could hope for in this configuration.

We then need to remove those "false positives" due to hash collisions.


  1. Define a key equality predicate

The goal here is to remove duplicates that were not detected by the hasher (here, typically vectors such as (1, 0, 1) / (1, 1, 0) or overflow).

Declare a predicate struct which roughly looks like:

struct key_equal
{
bool operator()(const value_type& a, const value_type& b) const
{

return (a == b) || <...>;
}
};

Where <...> is anything making two values identical in the current context ( so here a == b) || -a == b for example).

Note that this expects operator== to be implemented.


  1. Erase duplicates

Declare an std::unordered_set which removes duplicates:

const std::unordered_set<value_type, hasher, key_equal> s(container.begin(), container.end());
container.assign(s.begin(), s.end());

  1. (alt) Erase duplicates (and conserve original order in container)

Basically the same, but this checks if an element can be inserted in the std::unordered_set, and if does not, remove it. Adapted from @yury's answer in What's the most efficient way to erase duplicates and sort a vector?.

std::unordered_set<value_type, hasher, comparator> s;

const auto predicate = [&s](const value_type& value){return !s.insert(value).second;};

container.erase(std::remove_if(container.begin(), container.end(), predicate), container.end());

Generic (container-independent) templated function:

template<typename key_equal_t, typename hasher_t, bool order_conservative, typename container_t>
void remove_duplicates(container_t& container)
{
using value_type = typename container_t::value_type;

if constexpr(order_conservative)
{
std::unordered_set<value_type, hasher_t, key_equal_t> s;
const auto predicate = [&s](const value_type& value){return !s.insert(value).second;};
container.erase(std::remove_if(container.begin(), container.end(), predicate), container.end());
}
else
{
const std::unordered_set<value_type, hasher, key_equal_t> s(container.begin(), container.end());
container.assign(s.begin(), s.end());
}
}

Expects to be provided key_equal_t and hasher_t (and a bool known a compile time indicating if you care about element being kept in the same order or not). I did not benchmark any of the two branches in this function so I do not know if one is significantly better than another, though this answer seems show manual insertion may be faster.

Example in this use case:

/// "Predicate" indicating if two values are considered duplicates or not
struct key_equal
{
bool operator()(const vec3& a, const vec3& b) const
{
// Remove identical vectors and their opposite
return (a == b) || (-a == b);
}
};

/// Hashes a vec3 by adding absolute values of its components.
struct hasher
{
std::size_t operator()(const vec3& v) const
{
return static_cast<std::size_t>(std::abs(v.x) + std::abs(v.y) + std::abs(v.z));
}
};

remove_duplicates<key_equal, hasher, order_conservative>(vec);

Test data:

vec3 v1{1, 1, 0};   // Keep
vec3 v2{0, 1, 0}; // Keep
vec3 v3{1, -1, 0}; // Keep
vec3 v4{-1, -1, 0}; // Remove
vec3 v5{1, 1, 0}; // Remove

std::vector vec{v1, v2, v3, v4, v5};

Test 1: non-order-conservative:

// ...<print vec values>
remove_duplicates<key_equal, hasher, false>(vec);
// ... <print vec values>

Output (before / after):

(1, 1, 0) (0, 1, 0) (1, -1, 0) (-1, -1, 0) (1, 1, 0)
(1, -1, 0) (0, 1, 0) (1, 1, 0)

Test 2: order-conservative:

// ... <print vec values>
remove_duplicates<key_equal, hasher, true>(vec);
// ... <print vec values>

Output (before / after):

(1, 1, 0) (0, 1, 0) (1, -1, 0) (-1, -1, 0) (1, 1, 0) 
(1, 1, 0) (0, 1, 0) (1, -1, 0)

C++ - Remove duplicates from a sorted vector of structs

you need to define equal predicate

struct Num 
{
unsigned int key;
unsigned int val;
};

bool num_less(const Num &a, const Num &b)
{
return (a.key<b.key)||(!(b.key < a.key))&&(a.val<b.val));
}
bool num_equal(const Num &a, const Num &b)
{
return (a.key==b.key)&&(a.val==b.val);
}
int main()
{
vector<Num> arr;
Num temp;
// add some examples
temp.key=10; temp.val=20;
arr.push_back(temp); arr.push_back(temp);
temp.key=11; temp.val=23;
arr.push_back(temp); arr.push_back(temp);
temp.key=10; temp.val=20;
arr.push_back(temp); arr.push_back(temp); arr.push_back(temp);
//sort
sort(arr.begin(),arr.end(),num_less);
//delete dublicates
arr.erase(unique(arr.begin(),arr.end(),num_equal),arr.end());
return 0;
}

What is the most efficient way of removing duplicates from a container only using almost equality criteria (no sort)

First, don't remove elements one at a time.

Next, use a hash table (or similar structure) to detect duplicates.

If you don't need to preserve order, then copy all elements into a hashset (this destroys duplicates), then recreate the vector using the values left in the hashset.

If you need to preserve order, then:

  1. Set read and write iterators to the beginning of the vector.
  2. Start moving the read iterator through, checking elements against a hashset or octtree or something that allows finding nearby elements quickly.
  3. For each element that collides with one in the hashset/octtree, advance the read iterator only.
  4. For elements that do not collide, move from read iterator to write iterator, copy to hashset/octtree, then advance both.
  5. When read iterator reaches the end, call erase to truncate the vector at the write iterator position.

The key advantage of the octtree is that while it doesn't let you immediately determine whether there is something close enough to be a "duplicate", it allows you to test against only near neighbors, excluding most of your dataset. So your algorithm might be O(N lg N) or even O(N lg lg N) depending on the spatial distribution.

Again, if you don't care about the ordering, you can actually move survivors into the hashset/octtree and at the end move them back into the vector (compactly).

What is the best way to sort a vector leaving the original one unaltered?

Use partial_sort_copy. Here is an example:

vector<int> v{9,8,6,7,4,5,2,0,3,1};
vector<int> v_sorted(v.size());
partial_sort_copy(begin(v), end(v), begin(v_sorted), end(v_sorted));

Now, v remains untouched but v_sorted contains {0,1,2,3,4,5,6,7,8,9}.



Related Topics



Leave a reply



Submit