What Is the Right Approach When Using Stl Container for Median Calculation

What is the right approach when using STL container for median calculation?

Any random-access container (like std::vector) can be sorted with the standard std::sort algorithm, available in the <algorithm> header.

For finding the median, it would be quicker to use std::nth_element; this does enough of a sort to put one chosen element in the correct position, but doesn't completely sort the container. So you could find the median like this:

int median(vector<int> &v)
{
    size_t n = v.size() / 2;
    nth_element(v.begin(), v.begin()+n, v.end());
    return v[n];
}

Compute Median of Values Stored In Vector - C++?

You are doing an extra division and overall making it a bit more complex than it needs to be. Also, there's no need to create a DIVISOR when 2 is actually more meaningful in context.

double CalcMHWScore(vector<int> scores)
{
  size_t size = scores.size();

  if (size == 0)
  {
    return 0;  // Undefined, really.
  }
  else
  {
    sort(scores.begin(), scores.end());
    if (size % 2 == 0)
    {
      return (scores[size / 2 - 1] + scores[size / 2]) / 2;
    }
    else 
    {
      return scores[size / 2];
    }
  }
}

How to obtain the index of the median using STL?

If it is accapteble to search the element

vector<int>::iterator itOfMedian = std::find(myarray.begin(), myarray.end(), median);
int index = itOfMedian - myarray.begin();

should do the trick.

EDIT

seems you have point here. nth_element sorts its argument vector... Therefore

vector<int> myArrayCopy = myarray;
// find median in myArrayCopy
vector<int>::iterator itOfMedian = std::find(myarray.begin(), myarray.end(), median);
int index = itOfMedian - myarray.begin();

What's the most efficient way to extract min, max & median from a vector

Use std::nth_element to partition which yields the median, then std::min_element in the left half and std::max_element in the right one.

If you need it to be faster than that then roll your own version based on std::nth_element.

median(x) is does not correctly return the middle value of x

Median is the middle value of sorted values. Have you sorted that column before finding this middle value? Here is a toy demonstration of what can go wrong if values are unsorted.

## a vector of even length
set.seed(0); x <- sample.int(10)
#[1]  9  4  7  1  2  5  3 10  6  8

## true value
median(x)
#[1] 5.5

## values are unsorted
is.unsorted(x)
#[1] TRUE
## "middle" value
0.5 * (x[length(x) / 2] + x[length(x) / 2 + 1])
#[1] 3.5

## correct calculation with sorted values
sx <- sort(x)
## "middle" value
(sx[length(x) / 2] + sx[length(x) / 2 + 1]) / 2
#[1] 5.5

Efficient way to get middle (median) of an std::set?

Depending on how often you insert/remove items versus look up the middle/median, a possibly more efficient solution than the obvious one is to keep a persistent iterator to the middle element and update it whenever you insert/delete items from the set. There are a bunch of edge cases which will need handling (odd vs even number of items, removing the middle item, empty set, etc.), but the basic idea would be that when you insert an item that's smaller than the current middle item, your middle iterator may need decrementing, whereas if you insert a larger one, you need to increment. It's the other way around for removals.

At lookup time, this is of course O(1), but it also has an essentially O(1) cost at each insertion/deletion, i.e. O(N) after N insertions, which needs to be amortised across a sufficient number of lookups to make it more efficient than brute forcing.

C++ Efficiently Calculating a Running Median

The streaming median is computed using two heaps. All the numbers less than or equal to the current median are in the left heap, which is arranged so that the maximum number is at the root of the heap. All the numbers greater than or equal to the current median are in the right heap, which is arranged so that the minimum number is at the root of the heap. Note that numbers equal to the current median can be in either heap. The count of numbers in the two heaps never differs by more than 1.

When the process begins the two heaps are initially empty. The first number in the input sequence is added to one of the heaps, it doesn’t matter which, and returned as the first streaming median. The second number in the input sequence is then added to the other heap, if the root of the right heap is less than the root of the left heap the two heaps are swapped, and the average of the two numbers is returned as the second streaming median.

Then the main algorithm begins. Each subsequent number in the input sequence is compared to the current median, and added to the left heap if it is less than the current median or to the right heap if it is greater than the current median; if the input number is equal to the current median, it is added to whichever heap has the smaller count, or to either heap arbitrarily if they have the same count. If that causes the counts of the two heaps to differ by more than 1, the root of the larger heap is removed and inserted in the smaller heap. Then the current median is computed as the root of the larger heap, if they differ in count, or the average of the roots of the two heaps, if they are the same size.

Code in Scheme and Python is available at my blog.

What Is the Right Approach When Using Stl Container for Median Calculation