C++ Efficiently Calculating a Running Median

The streaming median is computed using two heaps. All the numbers less than or equal to the current median are in the left heap, which is arranged so that the maximum number is at the root of the heap. All the numbers greater than or equal to the current median are in the right heap, which is arranged so that the minimum number is at the root of the heap. Note that numbers equal to the current median can be in either heap. The count of numbers in the two heaps never differs by more than 1.

When the process begins the two heaps are initially empty. The first number in the input sequence is added to one of the heaps, it doesn’t matter which, and returned as the first streaming median. The second number in the input sequence is then added to the other heap, if the root of the right heap is less than the root of the left heap the two heaps are swapped, and the average of the two numbers is returned as the second streaming median.

Then the main algorithm begins. Each subsequent number in the input sequence is compared to the current median, and added to the left heap if it is less than the current median or to the right heap if it is greater than the current median; if the input number is equal to the current median, it is added to whichever heap has the smaller count, or to either heap arbitrarily if they have the same count. If that causes the counts of the two heaps to differ by more than 1, the root of the larger heap is removed and inserted in the smaller heap. Then the current median is computed as the root of the larger heap, if they differ in count, or the average of the roots of the two heaps, if they are the same size.

Code in Scheme and Python is available at my blog.

Find running median from a stream of integers

There are a number of different solutions for finding running median from streamed data, I will briefly talk about them at the very end of the answer.

The question is about the details of the a specific solution (max heap/min heap solution), and how heap based solution works is explained below:

For the first two elements add smaller one to the maxHeap on the left, and bigger one to the minHeap on the right. Then process stream data one by one,

Step 1: Add next item to one of the heaps

   if next item is smaller than maxHeap root add it to maxHeap,
   else add it to minHeap

Step 2: Balance the heaps (after this step heaps will be either balanced or
   one of them will contain 1 more item)

   if number of elements in one of the heaps is greater than the other by
   more than 1, remove the root element from the one containing more elements and
   add to the other one

Then at any given time you can calculate median like this:

   If the heaps contain equal amount of elements;
     median = (root of maxHeap + root of minHeap)/2
   Else
     median = root of the heap with more elements

Now I will talk about the problem in general as promised in the beginning of the answer. Finding running median from a stream of data is a tough problem, and finding an exact solution with memory constraints efficiently is probably impossible for the general case. On the other hand, if the data has some characteristics we can exploit, we can develop efficient specialized solutions. For example, if we know that the data is an integral type, then we can use counting sort, which can give you a constant memory constant time algorithm. Heap based solution is a more general solution because it can be used for other data types (doubles) as well. And finally, if the exact median is not required and an approximation is enough, you can just try to estimate a probability density function for the data and estimate median using that.

how to calculate running median efficiently

One approach is below:

def RunningMedian(x,N):
    idx = np.arange(N) + np.arange(len(x)-N+1)[:,None]
    b = [row[row>0] for row in x[idx]]
    return np.array(map(np.median,b))
    #return np.array([np.median(c) for c in b])  # This also works

I found a much faster one (tens of thousand times faster), copied as below:

from collections import deque
from bisect import insort, bisect_left
from itertools import islice
def running_median_insort(seq, window_size):
    """Contributed by Peter Otten"""
    seq = iter(seq)
    d = deque()
    s = []
    result = []
    for item in islice(seq, window_size):
        d.append(item)
        insort(s, item)
        result.append(s[len(d)//2])
    m = window_size // 2
    for item in seq:
        old = d.popleft()
        d.append(item)
        del s[bisect_left(s, old)]
        insort(s, item)
        result.append(s[m])
    return result

Take a look at the link: running_median

Efficient median calculation for small dataset in C++

This may not scale well to your data sizes, but it's a code snippet I found (can't remember where) and use in my image processing functions to get the median of 9 unsigned char pixels.

// optimised median search on 9 values
#define PIX_SWAP(a, b) { unsigned char uTemp = (a); (a) = (b); (b) = uTemp; }
#define PIX_SORT(a, b) { if ((a) > (b)) PIX_SWAP((a), (b)); }

unsigned char GetMedian9(unsigned char *pNine)
{
    // nb - this is theoretically the fastest way to get the median of 9 values
    PIX_SORT(pNine[1], pNine[2]); PIX_SORT(pNine[4], pNine[5]); PIX_SORT(pNine[7], pNine[8]); 
    PIX_SORT(pNine[0], pNine[1]); PIX_SORT(pNine[3], pNine[4]); PIX_SORT(pNine[6], pNine[7]); 
    PIX_SORT(pNine[1], pNine[2]); PIX_SORT(pNine[4], pNine[5]); PIX_SORT(pNine[7], pNine[8]); 
    PIX_SORT(pNine[0], pNine[3]); PIX_SORT(pNine[5], pNine[8]); PIX_SORT(pNine[4], pNine[7]); 
    PIX_SORT(pNine[3], pNine[6]); PIX_SORT(pNine[1], pNine[4]); PIX_SORT(pNine[2], pNine[5]); 
    PIX_SORT(pNine[4], pNine[7]); PIX_SORT(pNine[4], pNine[2]); PIX_SORT(pNine[6], pNine[4]); 
    PIX_SORT(pNine[4], pNine[2]); return(pNine[4]);
}

#undef PIX_SWAP
#undef PIX_SORT

EDIT - Ok, it's also referenced in this answer too

Calculate median for huge amount of data

I would use std::multiset, since it can handle duplicates and maintains sorted order automatically. I would insert the numbers one by one, maintaining an iterator pointing to the median (stepping forward or backward depending on whether the new element is greater or less than the median).

Note that if this gets too large to hold comfortably in memory, you can pack a lot of the highest and lowest elements into files; it's unlikely that the median will ever move that far, and if it does you can unpack and repack.

Fastest code C/C++ to select the median in a set of 27 floating point values

Since it sounds like you're performing a median filter on a large array of volume data, you might want to take a look at the Fast Median and Bilateral Filtering paper from SIGGRAPH 2006. That paper deals with 2D image processing, but you might be able to adapt the algorithm for 3D volumes. If nothing else, it might give you some ideas on how to step back and look at the problem from a slightly different perspective.

find median in a fixed-size moving window along a long sequence of data

O(n*lg m) is easy:

Just maintain your window as two std::sets, one for the lower half, one for the upper half. Insertion of a new element costs O(lg m), finding and removal of an old element costs the same. Determining the median using the method you described in your question costs O(1).

As you slide the window over your sequence, in each iteration you remove the item falling out of the window (O(lg m)), insert the new item (O(lg m)) and compute the median (O(1)), resulting in a total of O(n lg m).

This solution uses space O(m), of course but I don't think you can get away without storing the window's contents.

Data structure to find median

You can use 2 heaps, that we will call Left and Right.

Left is a Max-Heap.

Right is a Min-Heap.

Insertion is done like this:

If the new element x is smaller than the root of Left then we insert x to Left.
Else we insert x to Right.
If after insertion Left has count of elements that is greater than 1 from the count of elements of Right, then we call Extract-Max on Left and insert it to Right.
Else if after insertion Right has count of elements that is greater than the count of elements of Left, then we call Extract-Min on Right and insert it to Left.

The median is always the root of Left.

So insertion is done in O(lg n) time and getting the median is done in O(1) time.

C++ Efficiently Calculating a Running Median