Determining If an Unordered Vector<T> Has All Unique Elements

Determining if an unordered vector<T> has all unique elements

Your first example should be O(N log N) as set takes log N time for each insertion. I don't think a faster O is possible.

The second example is obviously O(N^2). The coefficient and memory usage are low, so it might be faster (or even the fastest) in some cases.

It depends what T is, but for generic performance, I'd recommend sorting a vector of pointers to the objects.

template< class T >
bool dereference_less( T const *l, T const *r )
{ return *l < *r; }

template <class T>
bool is_unique(vector<T> const &x) {
vector< T const * > vp;
vp.reserve( x.size() );
for ( size_t i = 0; i < x.size(); ++ i ) vp.push_back( &x[i] );
sort( vp.begin(), vp.end(), ptr_fun( &dereference_less<T> ) ); // O(N log N)
return adjacent_find( vp.begin(), vp.end(),
not2( ptr_fun( &dereference_less<T> ) ) ) // "opposite functor"
== vp.end(); // if no adjacent pair (vp_n,vp_n+1) has *vp_n < *vp_n+1
}

or in STL style,

template <class I>
bool is_unique(I first, I last) {
typedef typename iterator_traits<I>::value_type T;


And if you can reorder the original vector, of course,

template <class T>
bool is_unique(vector<T> &x) {
sort( x.begin(), x.end() ); // O(N log N)
return adjacent_find( x.begin(), x.end() ) == x.end();
}

Determine if there are duplicates in vector

The "best" option does not exist. It depends on how you define "best".

Here are some solutions, each with their own advantages and disadvantages:

Using map
template <class T>
auto has_duplicates(const std::vector<T>& v) -> bool
{
std::unordered_map<T, int> m;

for (const auto& e : v)
{
++m[e];

if (m[e] > 1)
return true;
}
return false;
}
Using set
template <class T>
auto has_duplicates(const std::vector<T>& v) -> bool
{
std::unordered_set<int> s;
std::copy(v.begin(), v.end(), std::inserter(s, s.begin());

return v.size() != s.size();
}
Using sort and adjacent_find (mutating range)
template <class T>
auto has_duplicates(std::vector<T>& v) -> bool
{
std::sort(v.begin(), v.end());

return std::adjacent_find(v.begin(), v.end()) != v.last();
}

Manual iteration with std::find
template <class T>
auto has_duplicates(const std::vector<T>& v) -> bool
{
for (auto it = v.begin(); it != v.end(); ++it)
if (std::find(it + 1, v.end(), *it) != v.end())
return true;

return false;
}
Manual iteration
template <class T>
auto has_duplicates(const std::vector<T>& v) -> bool
{
for (auto i1 = v.begin(); i1 != v.end(); ++i1)
for (auto i2 = i1 + 1; i2 != v.end(); ++i2)
if (*i1 == *i2)
return true;

return false;
}

Check std::vector has duplicates

Looking at google for std::unique I found this page std::unique. I looked at what it did:

Eliminates all except the first element from every consecutive group of equivalent elements from the range [first, last)

So it looks like it does what you want - removes the duplicates.

I then looked at what it returns...

... returns a past-the-end iterator for the new logical end of the range

So the result from std::unique is a sequence which is not necessary the same as the whole vector.

If nothing was removed, the return value would be the end of the vector.

So you want:

vector<int>::iterator it = std::unique(a.begin(), a.end());
bool wasUnique = (it == a.end());

Or for C++11:

auto it = std::unique(a.begin(), a.end());
bool wasUnique = (it == a.end());

Finally for the unique function to work, the vector needs to be sorted, so the complete code would be:

sort(a.begin(), a.end());
auto it = std::unique(a.begin(), a.end());
bool wasUnique = (it == a.end());

Number of unique elements only using equality comparisons

If a list contains N elements with only one duplicate, then there are N(N-1)/2 possible pairs of elements that can be compared for equality, and only one of those pairs will compare equal.

So, given any algorithm that is purported to count distinct elements, and adversary can provide it with a list of N distinct elements and observe which comparisons it makes and what answer it provides. Then:

  • If the algorithm gives an answer that is not N, then it's wrong.
  • Otherwise, if the algorithm makes fewer than N(N-1)/2 comparisons, then there is at least one pair that it didn't compare. The adversary can set those two elements equal and run the algoirthm again. Since all the comparisons it makes will have the same result, it will give the answer N again, but this time it will be wrong.

So any algorithm that always makes fewer than N(N-1)/2 comparisons must return the wrong answer for at least one input.

C++/STL which algorithm should I use to check if a container has duplicates?

STL is about efficiency and universality. There seems to be no universal and efficient way to check if a container has duplicates without modifying it. Hence, no wonder that no such algorithm exists in STL.

Effective unique on unordered elements

Just comparing different approaches (Not considering radix sort):

#include <algorithm>
#include <deque>
#include <iostream>
#include <unordered_set>
#include <set>
#include <vector>
#include <chrono>

template <template <typename ...> class Container, typename T, typename ... A, typename Comp>
inline bool insert_sorted(Container<T, A...>& container, T const& e, Comp const& comp) {
auto const it = std::lower_bound(container.begin(), container.end(), e, comp);
if (it != container.end() and not comp(e, *it)) { return false; }
container.insert(it, e);
return true;
}

template <template <typename ...> class Container, typename T, typename ... A>
inline bool insert_sorted(Container<T, A...>& container, T const& e) {
return insert_sorted(container, e, std::less<T>{});
}

int main() {
using namespace std::chrono;
typedef std::vector<int> data_type;

const unsigned Size = unsigned(1) << 24;
const unsigned Limit = 1000;
data_type data;
data.reserve(Size);
for(unsigned i = 0; i < Size; ++i) {
int value = double(Limit) * std::rand() / RAND_MAX - 0.1;
data.push_back(value);
while(i < Size - 1 && rand() < RAND_MAX * 0.25) {
data.push_back(value);
++i;
}
}

std::cout
<< "Data\n"
<< "====\n"
<< " Size of data: " << Size << '\n';

std::cout
<< "Unorderd Set\n"
<< "============\n";
{
auto start = system_clock::now();

typedef std::unordered_set<int> set_type;
set_type set;
unsigned i = 0;
for( ; i < Size - 1; ++i) {
// Ignore a range of equal values
while(data[i] == data[i+1]) ++i;
set.insert(data[i]);
}
if(i < Size)
set.insert(data[i]);

auto stop = system_clock::now();

std::cout
<< "Number of different elements: "
<< set.size() << '\n';
std::cout
<< " Timing: "
<< duration_cast<duration<double>>(stop - start).count()
<< '\n';
}

std::cout
<< "Set\n"
<< "===\n";
{
auto start = system_clock::now();

typedef std::set<int> set_type;
set_type set;
unsigned i = 0;
for( ; i < Size - 1; ++i) {
// Ignore a range of equal values
while(data[i] == data[i+1]) ++i;
set.insert(data[i]);
}
if(i < Size)
set.insert(data[i]);

auto stop = system_clock::now();

std::cout
<< "Number of different elements: "
<< set.size() << '\n';
std::cout
<< " Timing: "
<< duration_cast<duration<double>>(stop - start).count()
<< '\n';
}

std::cout
<< "Sorted Vector\n"
<< "=============\n";
{
auto start = system_clock::now();

typedef std::vector<int> set_type;
set_type set;
unsigned i = 0;
for( ; i < Size - 1; ++i) {
// Ignore a range of equal values
while(data[i] == data[i+1]) ++i;
insert_sorted(set, data[i]);
}
if(i < Size)
insert_sorted(set, data[i]);

auto stop = system_clock::now();

std::cout
<< "Number of different elements: "
<< set.size() << '\n';
std::cout
<< " Timing: "
<< duration_cast<duration<double>>(stop - start).count()
<< '\n';
}

std::cout
<< "BitVector\n"
<< "=========\n";
{
auto start = system_clock::now();

typedef std::vector<bool> set_type;
set_type set(Limit);
unsigned i = 0;
unsigned elements = 0;
for( ; i < Size; ++i) {
if( ! set[data[i]]) {
set[data[i]] = true;
++elements;
}
}

auto stop = system_clock::now();

std::cout
<< "Number of different elements: "
<< elements << '\n';
std::cout
<< " Timing: "
<< duration_cast<duration<double>>(stop - start).count()
<< '\n';
}

std::cout
<< "Sorted Data\n"
<< "===========\n";
{
auto start = system_clock::now();

std::sort(data.begin(), data.end());
auto last = std::unique(data.begin(), data.end());

auto stop = system_clock::now();

std::cout
<< "Number of different elements: "
<< last - data.begin() << '\n';
std::cout
<< " Timing: "
<< duration_cast<duration<double>>(stop - start).count()
<< '\n';
}

return 0;
}

Compiled with g++ -std=c++11 -O3 gives:

Data
====
Size of data: 16777216
Unorderd Set
============
Number of different elements: 1000
Timing: 0.269752
Set
===
Number of different elements: 1000
Timing: 1.23478
Sorted Vector
=============
Number of different elements: 1000
Timing: 1.13783
BitVector
=========
Number of different elements: 1000
Timing: 0.038408
Sorted Data
===========
Number of different elements: 1000
Timing: 1.32827

Hence if memory is no issue or the range of numbers is limited setting a bit is the best choice. Otherwise, unordered_set is a good one.

Getting unique elements from a container [c++]

EDIT based on info about source data:
The reason you're seeing the set insertion complete faster than sorting the vector is that your input data is two already sorted ranges. For quicksort (typically used by std::sort) this is a degenerate case and one of the worst possible inputs you could give it. For an input size of 1e8 changing the sort from std::sort to std::stable_sort cut the runtime from ~25s to <9s.

If you want to keep the original item order, you could try something like the following which keeps a hash of all the items. I have no idea what the performance of this would be, but for example you could utilize an approach with hashing and remove_if as sketched below:

struct Remover
{
explicit Remover(hash& found_items) : found_items_(found_items) { }
bool operator()(const Iter& item) { retval = <does exist in hash>; add to hash; return retval; }

hash& found_items_;
};

hash dup_finder;
Remover remover(dup_finder);
std::erase(std::remove_if(src.begin(), src.end(), remover), src.end());

Original components of my answer:

If the elements in the source container are already mostly sorted, you may see better performance with stable_sort rather than sort prior to calling unique. I'm unable to guess without more information about yoru data set what might cause option 3 to perform better than 1&2.

Option 3 should remove uniques but keep in mind that in spite of what you assert, it will still reorder the items in exactly the same way that the first two options do.



Related Topics



Leave a reply



Submit