Algorithm for Nth_Element

algorithm for nth_element

It's called a selection algorithm and wikipedia has a decent page on it: http://en.wikipedia.org/wiki/Selection_algorithm

Also read about order statistics: http://en.wikipedia.org/wiki/Order_statistic

How is nth_element Implemented?

You asked two questions, the titular one

How is nth_element Implemented?

Which you already answered:

There are a lot of claims on StackOverflow and elsewhere that nth_element is O(n) and that it is typically implemented with Introselect.

Which I also can confirm from looking at my stdlib implementation. (More on this later.)

And the one where you don't understand the answer:

How can an algorithm switch between QSort and Median-of-Medians?

Lets have a look at pseudo code that I extracted from my stdlib:

nth_element(first, nth, last)
{ 
  if (first == last || nth == last)
    return;

  introselect(first, nth, last, log2(last - first) * 2);
}

introselect(first, nth, last, depth_limit)
{
  while (last - first > 3)
  {
      if (depth_limit == 0)
      {
          // [NOTE by editor] This should be median-of-medians instead.
          // [NOTE by editor] See Azmisov's comment below
          heap_select(first, nth + 1, last);
          // Place the nth largest element in its final position.
          iter_swap(first, nth);
          return;
      }
      --depth_limit;
      cut = unguarded_partition_pivot(first, last);
      if (cut <= nth)
        first = cut;
      else
        last = cut;
  }
  insertion_sort(first, last);
}

Without getting into details about the referenced functions heap_select and unguarded_partition_pivot we can clearly see, that nth_element gives introselect 2 * log2(size) subdivision steps (twice as much as needed by quickselect in the best case) until heap_select kicks in and solves the problem for good.

What is nth_element and what does it do exactly? and how to implement it

So where is nth element here?

The n-th element is the 2 at index 2 because thats what you asked for when you passed begin()+2.

The element pointed at by nth is changed to whatever element would occur in that position if [first, last) was sorted.

This means that, if the vector was sorted, the order of elements would be

0 1 2 3 4 5 6 7 8 9 
    ^--- begin() + 2

You asked to have 3rd largest element at index 2 (3rd position), and thats what the algorithm does.

In addition it puts all elements smaller in the front and all elements larger in the back:

!(*j < *i) (for the first version, or comp(*j, *i) == false for the second version) is met for any i in the range [first, nth) and for any j in the range [nth, last).

Let's use indices rather than iterators, then for any i < 2 and any j > 2 it holds that v[i] < v[j]. In other words, 1 and 0 are both smaller than any element in 2 3 6 7 8 5 4 9.

nth_element implementations complexities

The expected running time is O(N)
The worst case running time for most implemententations is O(N * N) because most implementations use QuickSelect and it could be that QuickSelect runs into bad partitions.
That is true for Microsoft VS2008, VS2010 & VS2012.

Now with the new ISO C++ 2011 standard, the complexity for std::sort has been tightened up - it is guaranteed to be O(N * log N) and has no worse case as David Musser's IntroSort is used: - use QuickSort and if parts of the array experience bad partitioning, swap to heapsort.

Ideally exactly the same should apply std::nth_element but the ISO C++ 2011 standard has not tightened up the complexity requirements. So std::nth_element could be O(N * N) in the worst case. This could be because in David Musser's original paper (see here) he did not mention what algorithm should be swapped to if QuickSelect goes bad.

In the worst case, the median-of-medians using groups of 5 could be used (I have seen a paper the recommended groups of 7 but cannot find it). So a quality implementation of std::nth_element could use QuickSelect and swap to median-of-medians if partitioning goes bad. This would guarantee O(N) behaviour. QuickSelect can be improved by using sampling making the worst case unlikely but not impossible.

Towards understanding nth_element

Try:

std::nth_element (myvector.begin(), myvector.begin()+ 4, myvector.end());

instead of :

std::nth_element (myvector.begin() + 1, 
                  myvector.begin() + 4, 
                  myvector.begin() + 5);

You are shuffling and then calling nth_element for only a subsequence. This way you cannot what the returned element should be. If you do it for the whole sequence, then the answer is 5.

Std::nth_element is not giving the correct element

The variable end is not correctly pointing to the end of the vector. Hence, you only perform nth_element() on the first 2 elements. Since the variable median value is 1, the point with greater x value will be returned.

You can use points.end() instead of points.begin() + end.

nth_element on a vector of structs c++

I suppose that that call to nth_element produced a compile error, possibly one of those famously incomprehensible C++ template errors. But there should have been some sort of clue at least.

The second argument -- points.size() + points.size()/2 -- is not an iterator into the container; it's an integer. You meant points.begin() + points.size()/2.

What is an interesting use-case for C++ nth_element algorithm?

A typical use case is when one wants to avoid sorting the whole set because it is large. If you just need the top n, then nth_element is used to partition the set. The first partition then contains the top n elements.

nth_element implementation based on modified quick_sort, not working as expected

In case if anybody tries to implement nth_element based on Hoar partition:

use 2 pointers i,j and maintain loop invariant:

for k = 0..i-1 a[k] <= X and for z = j+1 .. n-1 a[z] >= X

after the loop, the array might be partitioned in 2 ways:

...| j | i |...

or
...| j | X | i |...
all elements(left part) with idx in [L,j] <= X

with idx(right part) in [i,R] >= X

based on the number of elements in left/right part recur on the required path

std::random_device dev;
std::mt19937 rng(dev());
long long kth(vector<long long>& a, int L, int R, int k) {
  if(L>=R) return a[L];

  int i = L,j = R;

  std::uniform_int_distribution<std::mt19937::result_type> dist(L,R);
  long long X = a[dist(rng)];

  while(i <= j) {
    while(a[i] < X) ++i;
    while(a[j] > X) --j;
    if(i <= j) {
      swap(a[i], a[j]);
      ++i;
      --j;
    }
  }

  if(k <= j)
    return kth(a, L,j,k);
  if (k >=i)
    return kth(a,i,R,k);
  return X;
}

Algorithm for Nth_Element