Most Efficient Way to Sort an Array into Bins Specified by an Index Array

Binning then sorting arrays in each bin but keeping their indices together

Many numpy functions have arg... variants that don't operate "by value" but rather "by index". In your case argsort does what you want:

order = np.argsort(y)
# order is an array of indices such that
# y[order] is sorted
top50 = order[len(order) // 2 :]
top50x = x[top50]
# now top50x are the x corresponding 1-to-1 to the 50% best y

How to sort unique np array elements by occurence?

To exactly mimic sorted/list behavior @Divakar's soln can be used with a small modification:

al = [1,2,3,2,1,3,2]
aa = np.array(al)

sorted(al, key=al.count, reverse=True)
# [2, 2, 2, 1, 3, 1, 3]

u, ids, c = np.unique(aa, return_counts=True, return_inverse=True)
aa[(-c[ids]).argsort(kind="stable")]
# array([2, 2, 2, 1, 3, 1, 3])

If aa is large,

from scipy import sparse
sparse.csc_matrix((aa, (c.max()-c[ids]), np.arange(len(ids)+1))).tocsr().data
# array([2, 2, 2, 1, 3, 1, 3], dtype=int64)

may be slightly faster. Not much, though, because in both cases we first call the expensive unique, unless data are none too large integers in which case faster alternatives (to which @WarrenWeckesser appears to allude in the comments) are available including the sparse matrix trick we just used; see for example Most efficient way to sort an array into bins specified by an index array?.

aaa = np.tile(aa,10000)
timeit(lambda:aaa[(-c[ids]).argsort(kind="stable")], number=10)
# 0.040545254945755005
timeit(lambda:sparse.csc_matrix((aaa, (c.max()-c[ids]), np.arange(len(ids)+1))).tocsr().data, number=10)
# 0.0118721229955554

binning data in python with scipy/numpy

It's probably faster and easier to use numpy.digitize():

import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]

An alternative to this is to use numpy.histogram():

bin_means = (numpy.histogram(data, bins, weights=data)[0] /
numpy.histogram(data, bins)[0])

Try for yourself which one is faster... :)

Efficiently get indices of histogram bins in Python

I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster

import numpy as np
from scipy.sparse import csr_matrix

def binned_statistic(x, values, func, nbins, range):
'''The usage is nearly the same as scipy.stats.binned_statistic'''

N = len(values)
r0, r1 = range

digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))

return [func(group) for group in np.split(S.data, S.indptr[1:-1])]

I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.

Split an array into bins of equal numbers

Listed in this post is a NumPy based vectorized approach with the idea of creating equally spaced indices for the length of the input array using np.searchsorted -
Here's the implementation -

def equal_bin(N, m):
sep = (N.size/float(m))*np.arange(1,m+1)
idx = sep.searchsorted(np.arange(N.size))
return idx[N.argsort().argsort()]

Sample runs with bin-counting for each bin to verify results -

In [442]: N = np.arange(1,94)

In [443]: np.bincount(equal_bin(N, 4))
Out[443]: array([24, 23, 23, 23])

In [444]: np.bincount(equal_bin(N, 5))
Out[444]: array([19, 19, 18, 19, 18])

In [445]: np.bincount(equal_bin(N, 10))
Out[445]: array([10, 9, 9, 10, 9, 9, 10, 9, 9, 9])

Here's another approach using linspace to create those equally spaced numbers that could be used as indices, like so -

def equal_bin_v2(N, m):
idx = np.linspace(0,m,N.size+0.5, endpoint=0).astype(int)
return idx[N.argsort().argsort()]

Sample run -

In [689]: N
Out[689]: array([ 0.2, 1.5, 0.3, 1.7, 0.5])

In [690]: equal_bin_v2(N,2)
Out[690]: array([0, 1, 0, 1, 0])

In [691]: equal_bin_v2(N,3)
Out[691]: array([0, 1, 0, 2, 1])

In [692]: equal_bin_v2(N,4)
Out[692]: array([0, 2, 0, 3, 1])

In [693]: equal_bin_v2(N,5)
Out[693]: array([0, 3, 1, 4, 2])


Related Topics



Leave a reply



Submit