What's the most efficient way to erase duplicates and sort a vector?
I agree with R. Pate and Todd Gardner; a std::set
might be a good idea here. Even if you're stuck using vectors, if you have enough duplicates, you might be better off creating a set to do the dirty work.
Let's compare three approaches:
Just using vector, sort + unique
sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );
Convert to set (manually)
set<int> s;
unsigned size = vec.size();
for( unsigned i = 0; i < size; ++i ) s.insert( vec[i] );
vec.assign( s.begin(), s.end() );
Convert to set (using a constructor)
set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );
Here's how these perform as the number of duplicates changes:
Summary: when the number of duplicates is large enough, it's actually faster to convert to a set and then dump the data back into a vector.
And for some reason, doing the set conversion manually seems to be faster than using the set constructor -- at least on the toy random data that I used.
how to remove duplicates of a vector of structures c++
The simplest way is to use :
movements.erase(std::unique(movements.begin(), movements.end()), movements.end());
But std::unique
only removes consecutive duplicated elements, so you need to sort the std::vector
first, by overloading <
and ==
operator:
struct posToMove
{
int fromX;
int fromY;
int toX;
int toY;
bool operator < (const posToMove& other) const
{
//declare how 2 variable of type posToMove should be compared with <
return std::make_tuple(fromX, fromY, toX, toY) < std::make_tuple(other.fromX, other.fromY, other.toX, other.toY);
}
bool operator == (const posToMove& other) const
{
//declare how 2 variable of type posToMove should be compared with ==
return std::make_tuple(fromX, fromY, toX, toY) == std::make_tuple(other.fromX, other.fromY, other.toX, other.toY);
}
};
I declared <
and ==
operator with make_tuple()
, but you can replace that with your choice of comparing as well.
Code :
#include <iostream>
#include <vector>
#include <tuple>
#include <algorithm>
struct posToMove
{
int fromX;
int fromY;
int toX;
int toY;
bool operator < (const posToMove& other) const
{
//declare how 2 variable of type posToMove should be compared with <
return std::make_tuple(fromX, fromY, toX, toY) < std::make_tuple(other.fromX, other.fromY, other.toX, other.toY);
}
bool operator == (const posToMove& other) const
{
//declare how 2 variable of type posToMove should be compared with ==
return std::make_tuple(fromX, fromY, toX, toY) == std::make_tuple(other.fromX, other.fromY, other.toX, other.toY);
}
};
std::vector<posToMove>movements;
int main()
{
movements.push_back({0,1,0,0});
movements.push_back({1,2,5,7});
movements.push_back({3,9,9,6});
movements.push_back({0,1,0,0});
movements.push_back({4,1,8,0});
movements.push_back({1,2,5,7});
std::sort(movements.begin(), movements.end());
std::cout << "After sort : \n";
for (auto x : movements)
{
std::cout << x.fromX << " " << x.fromY << " " << x.toX << " " << x.toY << "\n";
}
std::cout << "\n";
movements.erase(std::unique(movements.begin(), movements.end()), movements.end());
std::cout << "After removing : \n";
for (auto x : movements)
{
std::cout << x.fromX << " " << x.fromY << " " << x.toX << " " << x.toY << "\n";
}
}
Result:
After sort :
0 1 0 0
0 1 0 0
1 2 5 7
1 2 5 7
3 9 9 6
4 1 8 0
After removing :
0 1 0 0
1 2 5 7
3 9 9 6
4 1 8 0
Related : Remove duplicates in vector of structure c++
Documentations:
erase()
: https://en.cppreference.com/w/cpp/container/vector/erase2std::unique
: https://en.cppreference.com/w/cpp/algorithm/uniquestd::sort
: https://en.cppreference.com/w/cpp/algorithm/sortOverloading operators : https://en.cppreference.com/w/cpp/language/operators
How to remove duplicates from unsorted std::vector while keeping the original ordering using algorithms?
The naive way is to use std::set
as everyone tells you. It's overkill and has poor cache locality (slow).
The smart* way is to use std::vector
appropriately (make sure to see footnote at bottom):
#include <algorithm>
#include <vector>
struct target_less
{
template<class It>
bool operator()(It const &a, It const &b) const { return *a < *b; }
};
struct target_equal
{
template<class It>
bool operator()(It const &a, It const &b) const { return *a == *b; }
};
template<class It> It uniquify(It begin, It const end)
{
std::vector<It> v;
v.reserve(static_cast<size_t>(std::distance(begin, end)));
for (It i = begin; i != end; ++i)
{ v.push_back(i); }
std::sort(v.begin(), v.end(), target_less());
v.erase(std::unique(v.begin(), v.end(), target_equal()), v.end());
std::sort(v.begin(), v.end());
size_t j = 0;
for (It i = begin; i != end && j != v.size(); ++i)
{
if (i == v[j])
{
using std::iter_swap; iter_swap(i, begin);
++j;
++begin;
}
}
return begin;
}
Then you can use it like:
int main()
{
std::vector<int> v;
v.push_back(6);
v.push_back(5);
v.push_back(5);
v.push_back(8);
v.push_back(5);
v.push_back(8);
v.erase(uniquify(v.begin(), v.end()), v.end());
}
*Note: That's the smart way in typical cases, where the number of duplicates isn't too high. For a more thorough performance analysis, see this related answer to a related question.
Fastest way to remove duplicates from a vector
Question 1: Both 1 and 2 are O(n log n), 3 is O(n^2). Between 1 and 2, it depends on the data.
Question 2: 4 is also O(n log n) and can be better than 1 and 2 if you have lots of duplicates, because it only stores one copy of each. Imagine a million values that are all equal.
Question 3: Well, that really depends on what you need to do.
The only thing that can be said without knowing more is that your alternative number 3 is asymptotically worse than the others.
If you're using C++11 and don't need ordering, you can use std::unordered_set
, which is a hash table and can be significantly faster than std::set
.
How to remove duplicates in a vector based on an equality predicate?
I believe the simplest method would be to use std::unordered_set
and exploit its second and third template
parameters.
Method
- Define a hash function
This step goal is a provide a " pre-filtering " step which eliminates obvious duplicates according to the meaning in the context (so here, v1
and -v1
should for example have the same hash).
That's something that should be on a per-class basis . There is no way to come up with a generic performant hashing function, though non-performant critical application may use the first hasher below (but I won't really recommend that anymore).
a. The bad hasher
This is what I originally proposed, before taking comment from @axnsan and @François Andrieux in count.
The simplest hasher I can think of looks like that:
struct bad_hasher
{
std::size_t operator()(const value_type&) const
{
return 0;
}
};
It makes all hash collide and forces std::unordered_set
to use KeyEqual
to determine objects equality.
So indeed, that works, but that does not make it a good choice. @axnsan and @François Andrieux pointed the following drawbacks in the comment below:
- "it changes its performance characteristics to O(n^2) (it will have to iterate through all elements on each lookup) "(- @axnsan)
- "[it converts] the set into a simple unsorted linked list. Every element will collide with every other element, it it looks like typical implementations use collision chaining". (- @François Andrieux)
In other words, this makes std::unordered_set
become the same as a std::vector
+ std::remove_if
.
b. The better hasher
@axnsan suggests the following hasher for this particular application:
struct better_hasher
{
std::size_t operator()(const vec3& v) const
{
return static_cast<std::size_t>(std::abs(v.x) + std::abs(v.y) + std::abs(v.z));
}
};
It fills the following requirements:
better_hasher(v) == better_hasher(-v)
.v1 != v2
=>better_hasher(v1) != better_hasher(v2)
in most cases ((1, 0, 1)
will collide with(1, 1, 0)
for example)- not all hashes collide.
- removes obvious duplicates.
Which is probably somewhere near the best we could hope for in this configuration.
We then need to remove those "false positives" due to hash collisions.
- Define a key equality predicate
The goal here is to remove duplicates that were not detected by the hasher (here, typically vectors such as (1, 0, 1) / (1, 1, 0)
or overflow).
Declare a predicate struct
which roughly looks like:
struct key_equal
{
bool operator()(const value_type& a, const value_type& b) const
{
return (a == b) || <...>;
}
};
Where <...>
is anything making two values identical in the current context ( so here a == b) || -a == b
for example).
Note that this expects operator==
to be implemented.
- Erase duplicates
Declare an std::unordered_set
which removes duplicates:
const std::unordered_set<value_type, hasher, key_equal> s(container.begin(), container.end());
container.assign(s.begin(), s.end());
- (alt) Erase duplicates (and conserve original order in container)
Basically the same, but this checks if an element can be inserted in the std::unordered_set
, and if does not, remove it. Adapted from @yury's answer in What's the most efficient way to erase duplicates and sort a vector?.
std::unordered_set<value_type, hasher, comparator> s;
const auto predicate = [&s](const value_type& value){return !s.insert(value).second;};
container.erase(std::remove_if(container.begin(), container.end(), predicate), container.end());
Generic (container-independent) templated function:
template<typename key_equal_t, typename hasher_t, bool order_conservative, typename container_t>
void remove_duplicates(container_t& container)
{
using value_type = typename container_t::value_type;
if constexpr(order_conservative)
{
std::unordered_set<value_type, hasher_t, key_equal_t> s;
const auto predicate = [&s](const value_type& value){return !s.insert(value).second;};
container.erase(std::remove_if(container.begin(), container.end(), predicate), container.end());
}
else
{
const std::unordered_set<value_type, hasher, key_equal_t> s(container.begin(), container.end());
container.assign(s.begin(), s.end());
}
}
Expects to be provided key_equal_t
and hasher_t
(and a bool
known a compile time indicating if you care about element being kept in the same order or not). I did not benchmark any of the two branches in this function so I do not know if one is significantly better than another, though this answer seems show manual insertion may be faster.
Example in this use case:
/// "Predicate" indicating if two values are considered duplicates or not
struct key_equal
{
bool operator()(const vec3& a, const vec3& b) const
{
// Remove identical vectors and their opposite
return (a == b) || (-a == b);
}
};
/// Hashes a vec3 by adding absolute values of its components.
struct hasher
{
std::size_t operator()(const vec3& v) const
{
return static_cast<std::size_t>(std::abs(v.x) + std::abs(v.y) + std::abs(v.z));
}
};
remove_duplicates<key_equal, hasher, order_conservative>(vec);
Test data:
vec3 v1{1, 1, 0}; // Keep
vec3 v2{0, 1, 0}; // Keep
vec3 v3{1, -1, 0}; // Keep
vec3 v4{-1, -1, 0}; // Remove
vec3 v5{1, 1, 0}; // Remove
std::vector vec{v1, v2, v3, v4, v5};
Test 1: non-order-conservative:
// ...<print vec values>
remove_duplicates<key_equal, hasher, false>(vec);
// ... <print vec values>
Output (before / after):
(1, 1, 0) (0, 1, 0) (1, -1, 0) (-1, -1, 0) (1, 1, 0)
(1, -1, 0) (0, 1, 0) (1, 1, 0)
Test 2: order-conservative:
// ... <print vec values>
remove_duplicates<key_equal, hasher, true>(vec);
// ... <print vec values>
Output (before / after):
(1, 1, 0) (0, 1, 0) (1, -1, 0) (-1, -1, 0) (1, 1, 0)
(1, 1, 0) (0, 1, 0) (1, -1, 0)
C++ - Remove duplicates from a sorted vector of structs
you need to define equal predicate
struct Num
{
unsigned int key;
unsigned int val;
};
bool num_less(const Num &a, const Num &b)
{
return (a.key<b.key)||(!(b.key < a.key))&&(a.val<b.val));
}
bool num_equal(const Num &a, const Num &b)
{
return (a.key==b.key)&&(a.val==b.val);
}
int main()
{
vector<Num> arr;
Num temp;
// add some examples
temp.key=10; temp.val=20;
arr.push_back(temp); arr.push_back(temp);
temp.key=11; temp.val=23;
arr.push_back(temp); arr.push_back(temp);
temp.key=10; temp.val=20;
arr.push_back(temp); arr.push_back(temp); arr.push_back(temp);
//sort
sort(arr.begin(),arr.end(),num_less);
//delete dublicates
arr.erase(unique(arr.begin(),arr.end(),num_equal),arr.end());
return 0;
}
What is the most efficient way of removing duplicates from a container only using almost equality criteria (no sort)
First, don't remove elements one at a time.
Next, use a hash table (or similar structure) to detect duplicates.
If you don't need to preserve order, then copy all elements into a hashset (this destroys duplicates), then recreate the vector using the values left in the hashset.
If you need to preserve order, then:
- Set read and write iterators to the beginning of the vector.
- Start moving the read iterator through, checking elements against a hashset or octtree or something that allows finding nearby elements quickly.
- For each element that collides with one in the hashset/octtree, advance the read iterator only.
- For elements that do not collide, move from read iterator to write iterator, copy to hashset/octtree, then advance both.
- When read iterator reaches the end, call
erase
to truncate the vector at the write iterator position.
The key advantage of the octtree is that while it doesn't let you immediately determine whether there is something close enough to be a "duplicate", it allows you to test against only near neighbors, excluding most of your dataset. So your algorithm might be O(N lg N)
or even O(N lg lg N)
depending on the spatial distribution.
Again, if you don't care about the ordering, you can actually move survivors into the hashset/octtree and at the end move them back into the vector (compactly).
What is the best way to sort a vector leaving the original one unaltered?
Use partial_sort_copy. Here is an example:
vector<int> v{9,8,6,7,4,5,2,0,3,1};
vector<int> v_sorted(v.size());
partial_sort_copy(begin(v), end(v), begin(v_sorted), end(v_sorted));
Now, v remains untouched but v_sorted contains {0,1,2,3,4,5,6,7,8,9}.
Related Topics
Using Getline(Cin, S) After Cin
Why Should I Use a Pointer Rather Than the Object Itself
Cin and Getline Skipping Input
How to Convert an Instance of Std::String to Lower Case
How to Find the Length of an Array
Why Is Volatile Not Considered Useful in Multithreaded C or C++ Programming
How to Get the Cpu Cycle Count in X86_64 from C++
Does a Const Reference Class Member Prolong the Life of a Temporary
Is It Better in C++ to Pass by Value or Pass by Reference-To-Const
Storing C++ Template Function Definitions in a .Cpp File
Parse (Split) a String in C++ Using String Delimiter (Standard C++)
The Static Keyword and Its Various Uses in C++
Why Is Unsigned Integer Overflow Defined Behavior But Signed Integer Overflow Isn'T
How to Check Whether a String Contains Every Letter in the Alphabet in C++
Derived Template-Class Access to Base-Class Member-Data
Automatically Add All Files in a Folder to a Target Using Cmake