How to Remove Duplicates from Unsorted Std::Vector While Keeping the Original Ordering Using Algorithms

How to remove duplicates from unsorted std::vector while keeping the original ordering using algorithms?

The naive way is to use std::set as everyone tells you. It's overkill and has poor cache locality (slow).

The smart* way is to use std::vector appropriately (make sure to see footnote at bottom):

#include <algorithm>
#include <vector>
struct target_less
{
template<class It>
bool operator()(It const &a, It const &b) const { return *a < *b; }
};
struct target_equal
{
template<class It>
bool operator()(It const &a, It const &b) const { return *a == *b; }
};
template<class It> It uniquify(It begin, It const end)
{
std::vector<It> v;
v.reserve(static_cast<size_t>(std::distance(begin, end)));
for (It i = begin; i != end; ++i)
{ v.push_back(i); }
std::sort(v.begin(), v.end(), target_less());
v.erase(std::unique(v.begin(), v.end(), target_equal()), v.end());
std::sort(v.begin(), v.end());
size_t j = 0;
for (It i = begin; i != end && j != v.size(); ++i)
{
if (i == v[j])
{
using std::iter_swap; iter_swap(i, begin);
++j;
++begin;
}
}
return begin;
}

Then you can use it like:

int main()
{
std::vector<int> v;
v.push_back(6);
v.push_back(5);
v.push_back(5);
v.push_back(8);
v.push_back(5);
v.push_back(8);
v.erase(uniquify(v.begin(), v.end()), v.end());
}

*Note: That's the smart way in typical cases, where the number of duplicates isn't too high. For a more thorough performance analysis, see this related answer to a related question.

erase duplicate elements keeping order

One of many ways this can be accomplished it to use std::unordered_set to keep track of duplicates and std::stable_partition to partition the duplicates from the lone values while preserving the order of the items:

#include <iostream>
#include <vector>
#include <algorithm>
#include <unordered_set>

int main()
{
std::unordered_set<int> numSet;
std::vector<int> v= {1, 7, 2, 3, 8, 4, 5, 3, 2, 3, 2, 6, 2, 3, 2, 9, 10, 1, 2, 2, 1};
auto iter = std::stable_partition(v.begin(), v.end(), [&](int n)
{ bool ret = !numSet.count(n); numSet.insert(n); return ret; }); // returns true if the item has not been "seen"
v.erase(iter, v.end());
for(auto p : v)
std::cout << p << " ";
}

Output:

1  7  2  3  8  4  5  6  9  10 

The std::stable_partition will return true if the item has not been seen, thus place it to the left of the partition point. Once done, an iterator to the partition point is returned, and we use this iterator to do one single erasure from that point to the end of the vector. Note that the lambda function updates the unordered_set for each item processed.

The reason why std::stable_partition was used instead of std::remove_if is that std::remove_if is not guaranteed to process the items in order. For example, it could have been possible for an implementation to process the second 1 in that data first, instead of the first 1. So to be safe stable_partition will not erase elements, but simply place the elements in the correct position, ready for the erasure at the end.

Way to delete/erase duplicate elements from std::vector while maintaining order?

How about using a temporary container:

std::vector< int >::iterator i , j ;
std::set< int > t_set;
for( i = v.begin() , j = v.begin() ; i != v.end() ; ++i )
if( t_set.insert( *i ).second)
*j++ = *i ;
v.erase( j , v.end() );

Using std::remove_if, I can think of this:

std::set<int> t_set;
std::vector<int> res; //Resultant vector

remove_copy_if(v.begin(), v.end(), std::back_inserter(res),
[&t_set](int x){
return !t_set.insert(x).second;
} );

How to remove duplicated elements from vector?

I think best option is to write some binary predicate to use for the sorting of the vector and use std::unique afterwards. Keep in mind that the predicate must be transitive!

If this is not an option you can not do anything else but use the quardatic algorithm:

std::vector<type> a
std::vector<type> result;
for (unsigned i = 0; i < a.size(); ++i) {
bool repeated = false;
for (int j = 0; j < i; ++j) {
if (a[j] == a[i]) {
repeated = true;
break;
}
}
if (!repeated) {
result.push_back(a[i]);
}
}

// result stores the unique elements.

What's the most efficient way to erase duplicates and sort a vector?

I agree with R. Pate and Todd Gardner; a std::set might be a good idea here. Even if you're stuck using vectors, if you have enough duplicates, you might be better off creating a set to do the dirty work.

Let's compare three approaches:

Just using vector, sort + unique

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

Convert to set (manually)

set<int> s;
unsigned size = vec.size();
for( unsigned i = 0; i < size; ++i ) s.insert( vec[i] );
vec.assign( s.begin(), s.end() );

Convert to set (using a constructor)

set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );

Here's how these perform as the number of duplicates changes:

comparison of vector and set approaches

Summary: when the number of duplicates is large enough, it's actually faster to convert to a set and then dump the data back into a vector.

And for some reason, doing the set conversion manually seems to be faster than using the set constructor -- at least on the toy random data that I used.

c++ - Remove duplicates from ordered vector of strings

A simple way is to iterate through the vector while keeping track of the elements encountered, and deleting those that have been encountered before.

Here is a piece of code that does exactly that.

std::unordered_set<std::string> encounters;
for (auto i = 0u; i < container.size(); ++i) {
if (!encounters.insert(container[i]).second) {
// The string was already in encounters
container.erase(container.begin() + i);
--i;
}
}

Live on Coliru.

It could probably be optimized, for example by deleting ranges of elements when all are duplicates, or maybe by swapping each new element with the current first duplicate and, at the end, erasing the whole end of the vector that contains all the duplicates.



Related Topics



Leave a reply



Submit