Binning Data in Python with Scipy/Numpy

binning data in python with scipy/numpy

It's probably faster and easier to use numpy.digitize():

import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]

An alternative to this is to use numpy.histogram():

bin_means = (numpy.histogram(data, bins, weights=data)[0] /
numpy.histogram(data, bins)[0])

Try for yourself which one is faster... :)

Numpy (or scipy) binning of time series values based on timestamps

As mentioned in the comments, if your primary concern for not using Pandas is speed, I'd actually recommend using it, because it's not written entirely in Python, but it has many internal portions written using Cython (basically C), so they're very, very fast.

Binning a numpy array

Just use reshape and then mean(axis=1).

As the simplest possible example:

import numpy as np

data = np.array([4,2,5,6,7,5,4,3,5,7])

print data.reshape(-1, 2).mean(axis=1)

More generally, we'd need to do something like this to drop the last bin when it's not an even multiple:

import numpy as np

width=3
data = np.array([4,2,5,6,7,5,4,3,5,7])

result = data[:(data.size // width) * width].reshape(-1, width).mean(axis=1)

print result

Binning data and calculating MAE for each bin in Python

If your bins are all the same size you can use floor division to obtain bin indices from Obs, in your example.

idx = (Obs // 1).astype(int)

If not use np.digitize instead.

idx = np.digitize(Obs, bin_boundaries)

Once you have indices use them with np.bincount to obtain the means.

mn = np.bincount(idx, abs_error) / np.bincount(idx)

vectorized approach to binning with numpy/scipy in Python

If I understand your question correctly:

vals = array([[1, 10], [1, 11], [2, 20], [2, 21], [2, 22]])  # Example

(x, y) = vals.T # Shortcut
bin_limits = range(min(x)+1, max(x)+2) # Other limits could be chosen
points_by_bin = [ [] for _ in bin_limits ] # Final result
for (bin_num, y_value) in zip(searchsorted(bin_limits, x, "right"), y): # digitize() finds the correct bin number
points_by_bin[bin_num].append(y_value)

print points_by_bin # [[10, 11], [20, 21, 22]]

Numpy's fast array operation searchsorted() is used for maximum efficiency. Values are then added one by one (since the final result is not a rectangular array, Numpy cannot help much, for this). This solution should be faster than multiple where() calls in a loop, which force Numpy to re-read the same array many times.

Calculating a binned mean with SciPy: binned_statistic + handling NaNs (ValueError with SciPy and statistic=np.nanmean)

The answer to both questions 1 and 2 is to use np.nanmean to ignore the nans in the data. The regression you link to was a bug that I inadvertently introduced and then fixed after it was raised. I'm not sure why you have SciPy 1.5.2 in your environment, it looks like 1.5.4 is the latest 1.5.X version so you probably want to update the environment you're using. However, that backport was applied to version 1.5.0 release versions so there should not be an issue on those versions if you have the latest.

Also, I set this up with scipy version 1.7.3 and that also works as intended for me. Here are snippets.

Version 1.5.4

import scipy
scipy.__version__
'1.5.4'
import numpy as np
x = [0.5, 0.5, 1.5, 1.5]
values = [10, 20, np.nan, 40]
scipy.stats.binned_statistic(x, values, statistic='mean', bins=(0, 1, 2)).statistic
array([15., nan])
scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statistic
array([15., 40.])

Version 1.7.3

import scipy
scipy.__version__
'1.7.3'
import scipy.stats, numpy as np
x = [0.5, 0.5, 1.5, 1.5]
values = [10, 20, np.nan, 40]
scipy.stats.binned_statistic(x, values, statistic='mean', bins=(0, 1, 2)).statistic
array([15., nan])
scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statistic
array([15., 40.])

Version 1.5.2

import scipy
scipy.__version__
'1.5.2'
import scipy.stats, numpy as np
x = [0.5, 0.5, 1.5, 1.5]
values = [10, 20, np.nan, 40]
scipy.stats.binned_statistic(x, values, statistic='mean', bins=(0, 1, 2)).statistic
array([15., nan])
scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statistic
array([15., 40.])

Please try using the .__version__ to confirm your version of SciPy. If it is a 1.4.1 version-it does not have the change. I suspect that your version is 1.4.1 and not the other (higher) versions. Please use the portions of the code examples above to confirm the scipy version that is used in your environment.

Binning pandas/numpy array in unequal sizes with approx equal computational cost

I think a good approach has been found. Credits to a colleague.

The idea is to sort the group sizes (in descending order) and put groups into bins in a "backward S"-pattern. Let me illustrate with an example. Assume n = 3 (number of bins) and the following data:

groups
data
0 359
1 326
2 264
3 262
4 249
5 248
6 245
7 189
8 187
9 153
10 45

The idea is to put one group in one bin going "right to left" (and vica versa) between the bins in a "backward S"-pattern. First element in bin 0, second element in bin 1, etc. Then go backwards when reaching the last bin: fourth element in bin 2, fifth element in bin 1, etc. See below how the elements are put into bins by the group number in parenthesis. The values are the group sizes.

 Bins:  |    0    |    1    |    2    |
| 359 (0)| 326 (1)| 264 (2)|
| 248 (5)| 249 (4)| 262 (3)|
| 245 (6)| 189 (7)| 187 (8)|
| | 45(10)| 153 (9)|

The bins will have approximately the same number of values and, thus, approximately the same computational "cost". The bin sizes are: [852, 809, 866] for anyone interested. I have tried on a real-world dataset and the bins are of similar sizes. It is not guaranteed bins will be of similar size for all datasets.

The code can be made more efficient, but this is sufficient to get the idea out:

n = 3
size = 50
rng = np.random.default_rng(2021)
df = pd.DataFrame({
"one": np.linspace(0, 10, size, dtype=np.uint8),
"two": np.linspace(0, 5, size, dtype=np.uint8),
"data": rng.integers(0, 100, size)
})

groups = df.groupby(["one", "two"]).sum()
groups = groups.sort_values("data", ascending=False).reset_index(drop=True)

bins = [[] for i in range(n)]
backward = False
i = 0
for group in groups.iterrows():
bins[i].append(group)
i = i + 1 if not backward else i - 1
if i == n:
backward = True
i -= 1
if i == -1 and backward:
backward = False
i += 1

[sum([size[0] for (group, size) in bin]) for bin in bins]

Binning data into non-grid based bins

out = X*np.nan
for i in range(max(b)):
out[i] = np.nanmean(f[x_grid[b==i], y_grid[b==i]])

can be replaced by two calls to np.bincount:

total = np.bincount(b, weights=f[x_grid, y_grid], minlength=len(X))
count = np.bincount(b, minlength=len(X))
out = total/count

or one call to stats.binned_statistic:

out, bin_edges, binnumber = stats.binned_statistic(
x=b, values=f[x_grid, y_grid], statistic='mean', bins=np.arange(len(X)+1))

For example,

import numpy as np
from scipy.spatial import KDTree
import scipy.stats as stats
np.random.seed(2017)

def rebin(f, X, Y):
s = f.shape
x_grid = np.arange(s[0])
y_grid = np.arange(s[1])
x_grid, y_grid = np.meshgrid(x_grid,y_grid)
x_grid, y_grid = x_grid.flatten(), y_grid.flatten()

tree = KDTree(np.column_stack((X,Y)))
_, b = tree.query(np.column_stack((x_grid, y_grid)))

out, bin_edges, binnumber = stats.binned_statistic(
x=b, values=f[x_grid, y_grid], statistic='mean', bins=np.arange(len(X)+1))
# total = np.bincount(b, weights=f[x_grid, y_grid], minlength=len(X))
# count = np.bincount(b, minlength=len(X))
# out = total/count
return out

def orig(f, X, Y):
s = f.shape
x_grid = np.arange(s[0])
y_grid = np.arange(s[1])
x_grid, y_grid = np.meshgrid(x_grid,y_grid)
x_grid, y_grid = x_grid.flatten(), y_grid.flatten()

tree = KDTree(np.column_stack((X,Y)))
_, b = tree.query(np.column_stack((x_grid, y_grid)))

out = X*np.nan
for i in range(len(X)):
out[i] = np.nanmean(f[x_grid[b==i], y_grid[b==i]])
return out

N = 100
X, Y = np.random.random((2, N))
f = np.random.random((N, N))

expected = orig(f, X, Y)
result = rebin(f, X, Y)
print(np.allclose(expected, result, equal_nan=True))
# True

Averaging Data in Bins

One way is to use numpy.digitize to bin your categories.

Then use a dictionary or list comprehension to calculate results.

import numpy as np

chl = np.array([0.4,0.1,0.04,0.05,0.4,0.2,0.6,0.09,0.23,0.43,0.65,0.22,0.12,0.2,0.33])
depth = np.array([0.1,0.3,0.31,0.44,0.49,1.1,1.145,1.33,1.49,1.53,1.67,1.79,1.87,2.1,2.3])

bins = np.array([0,0.5,1.0,1.5,2.0,2.5])

A = np.vstack((np.digitize(depth, bins), chl)).T

res = {bins[int(i)]: np.mean(A[A[:, 0] == i, 1]) for i in np.unique(A[:, 0])}

# {0.5: 0.198, 1.5: 0.28, 2.0: 0.355, 2.5: 0.265}

Or for the precise format you are after:

res_lst = [np.mean(A[A[:, 0] == i, 1]) for i in range(len(bins))]

# [nan, 0.198, nan, 0.28, 0.355, 0.265]


Related Topics



Leave a reply



Submit