Binning then sorting arrays in each bin but keeping their indices together
Many numpy
functions have arg...
variants that don't operate "by value" but rather "by index". In your case argsort
does what you want:
order = np.argsort(y)
# order is an array of indices such that
# y[order] is sorted
top50 = order[len(order) // 2 :]
top50x = x[top50]
# now top50x are the x corresponding 1-to-1 to the 50% best y
How to sort unique np array elements by occurence?
To exactly mimic sorted
/list
behavior @Divakar's soln can be used with a small modification:
al = [1,2,3,2,1,3,2]
aa = np.array(al)
sorted(al, key=al.count, reverse=True)
# [2, 2, 2, 1, 3, 1, 3]
u, ids, c = np.unique(aa, return_counts=True, return_inverse=True)
aa[(-c[ids]).argsort(kind="stable")]
# array([2, 2, 2, 1, 3, 1, 3])
If aa
is large,from scipy import sparse
sparse.csc_matrix((aa, (c.max()-c[ids]), np.arange(len(ids)+1))).tocsr().data
# array([2, 2, 2, 1, 3, 1, 3], dtype=int64)
may be slightly faster. Not much, though, because in both cases we first call the expensive unique
, unless data are none too large integers in which case faster alternatives (to which @WarrenWeckesser appears to allude in the comments) are available including the sparse matrix trick we just used; see for example Most efficient way to sort an array into bins specified by an index array?.aaa = np.tile(aa,10000)
timeit(lambda:aaa[(-c[ids]).argsort(kind="stable")], number=10)
# 0.040545254945755005
timeit(lambda:sparse.csc_matrix((aaa, (c.max()-c[ids]), np.arange(len(ids)+1))).tocsr().data, number=10)
# 0.0118721229955554
binning data in python with scipy/numpy
It's probably faster and easier to use numpy.digitize()
:
import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
An alternative to this is to use numpy.histogram()
:bin_means = (numpy.histogram(data, bins, weights=data)[0] /
numpy.histogram(data, bins)[0])
Try for yourself which one is faster... :) Efficiently get indices of histogram bins in Python
I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster
import numpy as np
from scipy.sparse import csr_matrix
def binned_statistic(x, values, func, nbins, range):
'''The usage is nearly the same as scipy.stats.binned_statistic'''
N = len(values)
r0, r1 = range
digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))
return [func(group) for group in np.split(S.data, S.indptr[1:-1])]
I avoided np.digitize
because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly. Split an array into bins of equal numbers
Listed in this post is a NumPy based vectorized approach with the idea of creating equally spaced indices for the length of the input array using np.searchsorted
-
Here's the implementation -
def equal_bin(N, m):
sep = (N.size/float(m))*np.arange(1,m+1)
idx = sep.searchsorted(np.arange(N.size))
return idx[N.argsort().argsort()]
Sample runs with bin-counting for each bin to verify results -In [442]: N = np.arange(1,94)
In [443]: np.bincount(equal_bin(N, 4))
Out[443]: array([24, 23, 23, 23])
In [444]: np.bincount(equal_bin(N, 5))
Out[444]: array([19, 19, 18, 19, 18])
In [445]: np.bincount(equal_bin(N, 10))
Out[445]: array([10, 9, 9, 10, 9, 9, 10, 9, 9, 9])
Here's another approach using
linspace
to create those equally spaced numbers that could be used as indices, like so -def equal_bin_v2(N, m):
idx = np.linspace(0,m,N.size+0.5, endpoint=0).astype(int)
return idx[N.argsort().argsort()]
Sample run -In [689]: N
Out[689]: array([ 0.2, 1.5, 0.3, 1.7, 0.5])
In [690]: equal_bin_v2(N,2)
Out[690]: array([0, 1, 0, 1, 0])
In [691]: equal_bin_v2(N,3)
Out[691]: array([0, 1, 0, 2, 1])
In [692]: equal_bin_v2(N,4)
Out[692]: array([0, 2, 0, 3, 1])
In [693]: equal_bin_v2(N,5)
Out[693]: array([0, 3, 1, 4, 2])
Related Topics
"Permission Denied" Trying to Run Python on Windows 10
Using Only the Db Part of Django
Pandas Latitude-Longitude to Distance Between Successive Rows
How to Find the Maximum Value in a List of Tuples
When Should an Attribute Be Private and Made a Read-Only Property
Should I Be Adding the Django Migration Files in the .Gitignore File
In-Place Type Conversion of a Numpy Array
Loading Initial Data with Django 1.7 and Data Migrations
Django Post_Save() Signal Implementation
How to Get All the Request Headers in Django
Filename and Line Number of Python Script
How to Plot Empirical Cdf (Ecdf)
Get the String Within Brackets in Python
How to Use Numpy.Correlate to Do Autocorrelation
Why Use Sys.Path.Append(Path) Instead of Sys.Path.Insert(1, Path)
Python [Errno 98] Address Already in Use