﻿ Binning Data in Python with Scipy/Numpy - ITCodar

Binning Data in Python with Scipy/Numpy

binning data in python with scipy/numpy

It's probably faster and easier to use `numpy.digitize()`:

``import numpydata = numpy.random.random(100)bins = numpy.linspace(0, 1, 10)digitized = numpy.digitize(data, bins)bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]``

An alternative to this is to use `numpy.histogram()`:

``bin_means = (numpy.histogram(data, bins, weights=data)[0] /             numpy.histogram(data, bins)[0])``

Try for yourself which one is faster... :)

Numpy (or scipy) binning of time series values based on timestamps

As mentioned in the comments, if your primary concern for not using Pandas is speed, I'd actually recommend using it, because it's not written entirely in Python, but it has many internal portions written using Cython (basically C), so they're very, very fast.

Binning a numpy array

Just use `reshape` and then `mean(axis=1)`.

As the simplest possible example:

``import numpy as npdata = np.array([4,2,5,6,7,5,4,3,5,7])print data.reshape(-1, 2).mean(axis=1)``

More generally, we'd need to do something like this to drop the last bin when it's not an even multiple:

``import numpy as npwidth=3data = np.array([4,2,5,6,7,5,4,3,5,7])result = data[:(data.size // width) * width].reshape(-1, width).mean(axis=1)print result``

Binning data and calculating MAE for each bin in Python

If your bins are all the same size you can use floor division to obtain bin indices from `Obs`, in your example.

``idx = (Obs // 1).astype(int)``

If not use `np.digitize` instead.

``idx = np.digitize(Obs, bin_boundaries)``

Once you have indices use them with `np.bincount` to obtain the means.

``mn = np.bincount(idx, abs_error) / np.bincount(idx)``

vectorized approach to binning with numpy/scipy in Python

If I understand your question correctly:

``vals = array([[1, 10], [1, 11], [2, 20], [2, 21], [2, 22]])  # Example(x, y) = vals.T  # Shortcutbin_limits = range(min(x)+1, max(x)+2)  # Other limits could be chosenpoints_by_bin = [ [] for _ in bin_limits ]  # Final resultfor (bin_num, y_value) in zip(searchsorted(bin_limits, x, "right"), y):  # digitize() finds the correct bin number    points_by_bin[bin_num].append(y_value)print points_by_bin  # [[10, 11], [20, 21, 22]]``

Numpy's fast array operation `searchsorted()` is used for maximum efficiency. Values are then added one by one (since the final result is not a rectangular array, Numpy cannot help much, for this). This solution should be faster than multiple `where()` calls in a loop, which force Numpy to re-read the same array many times.

Calculating a binned mean with SciPy: binned_statistic + handling NaNs (ValueError with SciPy and statistic=np.nanmean)

The answer to both questions 1 and 2 is to use `np.nanmean` to ignore the `nan`s in the data. The regression you link to was a bug that I inadvertently introduced and then fixed after it was raised. I'm not sure why you have SciPy 1.5.2 in your environment, it looks like 1.5.4 is the latest 1.5.X version so you probably want to update the environment you're using. However, that backport was applied to version 1.5.0 release versions so there should not be an issue on those versions if you have the latest.

Also, I set this up with scipy version 1.7.3 and that also works as intended for me. Here are snippets.

Version 1.5.4

``import scipyscipy.__version__'1.5.4'import numpy as np x = [0.5, 0.5, 1.5, 1.5]values = [10, 20, np.nan, 40]scipy.stats.binned_statistic(x, values, statistic='mean', bins=(0, 1, 2)).statisticarray([15., nan])scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statisticarray([15., 40.])``

Version 1.7.3

``import scipyscipy.__version__'1.7.3'import scipy.stats, numpy as npx = [0.5, 0.5, 1.5, 1.5]values = [10, 20, np.nan, 40]scipy.stats.binned_statistic(x, values, statistic='mean', bins=(0, 1, 2)).statisticarray([15., nan])scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statisticarray([15., 40.])``

Version 1.5.2

``import scipyscipy.__version__'1.5.2'import scipy.stats, numpy as npx = [0.5, 0.5, 1.5, 1.5]values = [10, 20, np.nan, 40]scipy.stats.binned_statistic(x, values, statistic='mean', bins=(0, 1, 2)).statisticarray([15., nan])scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statisticarray([15., 40.])``

Please try using the `.__version__` to confirm your version of SciPy. If it is a 1.4.1 version-it does not have the change. I suspect that your version is 1.4.1 and not the other (higher) versions. Please use the portions of the code examples above to confirm the scipy version that is used in your environment.

Binning pandas/numpy array in unequal sizes with approx equal computational cost

I think a good approach has been found. Credits to a colleague.

The idea is to sort the group sizes (in descending order) and put groups into bins in a "backward S"-pattern. Let me illustrate with an example. Assume `n = 3` (number of bins) and the following data:

``groups    data0    3591    3262    2643    2624    2495    2486    2457    1898    1879    15310    45``

The idea is to put one group in one bin going "right to left" (and vica versa) between the bins in a "backward S"-pattern. First element in bin 0, second element in bin 1, etc. Then go backwards when reaching the last bin: fourth element in bin 2, fifth element in bin 1, etc. See below how the elements are put into bins by the group number in parenthesis. The values are the group sizes.

`` Bins:  |    0    |    1    |    2    |        |  359 (0)|  326 (1)|  264 (2)|          |  248 (5)|  249 (4)|  262 (3)|        |  245 (6)|  189 (7)|  187 (8)|        |         |   45(10)|  153 (9)|``

The bins will have approximately the same number of values and, thus, approximately the same computational "cost". The bin sizes are: `[852, 809, 866]` for anyone interested. I have tried on a real-world dataset and the bins are of similar sizes. It is not guaranteed bins will be of similar size for all datasets.

The code can be made more efficient, but this is sufficient to get the idea out:

``n = 3size = 50rng = np.random.default_rng(2021)df = pd.DataFrame({    "one": np.linspace(0, 10, size, dtype=np.uint8),    "two": np.linspace(0, 5, size, dtype=np.uint8),    "data": rng.integers(0, 100, size)})groups = df.groupby(["one", "two"]).sum()groups = groups.sort_values("data", ascending=False).reset_index(drop=True)bins = [[] for i in range(n)]backward = Falsei = 0for group in groups.iterrows():    bins[i].append(group)    i = i + 1 if not backward else i - 1    if i == n:        backward = True        i -= 1    if i == -1 and backward:        backward = False        i += 1[sum([size[0] for (group, size) in bin]) for bin in bins]``

Binning data into non-grid based bins

``out = X*np.nanfor i in range(max(b)):    out[i] = np.nanmean(f[x_grid[b==i], y_grid[b==i]])``

can be replaced by two calls to `np.bincount`:

``total = np.bincount(b, weights=f[x_grid, y_grid], minlength=len(X))count = np.bincount(b, minlength=len(X))out = total/count``

or one call to `stats.binned_statistic`:

``out, bin_edges, binnumber = stats.binned_statistic(    x=b, values=f[x_grid, y_grid], statistic='mean', bins=np.arange(len(X)+1))``

For example,

``import numpy as npfrom scipy.spatial import KDTreeimport scipy.stats as statsnp.random.seed(2017)def rebin(f, X, Y):    s = f.shape    x_grid = np.arange(s[0])    y_grid = np.arange(s[1])    x_grid, y_grid = np.meshgrid(x_grid,y_grid)    x_grid, y_grid = x_grid.flatten(),  y_grid.flatten()    tree = KDTree(np.column_stack((X,Y)))    _, b = tree.query(np.column_stack((x_grid, y_grid)))    out, bin_edges, binnumber = stats.binned_statistic(        x=b, values=f[x_grid, y_grid], statistic='mean', bins=np.arange(len(X)+1))    # total = np.bincount(b, weights=f[x_grid, y_grid], minlength=len(X))    # count = np.bincount(b, minlength=len(X))    # out = total/count    return outdef orig(f, X, Y):    s = f.shape    x_grid = np.arange(s[0])    y_grid = np.arange(s[1])    x_grid, y_grid = np.meshgrid(x_grid,y_grid)    x_grid, y_grid = x_grid.flatten(),  y_grid.flatten()    tree = KDTree(np.column_stack((X,Y)))    _, b = tree.query(np.column_stack((x_grid, y_grid)))    out = X*np.nan    for i in range(len(X)):        out[i] = np.nanmean(f[x_grid[b==i], y_grid[b==i]])    return outN = 100X, Y = np.random.random((2, N))f = np.random.random((N, N))expected = orig(f, X, Y)result = rebin(f, X, Y)print(np.allclose(expected, result, equal_nan=True))# True``

Averaging Data in Bins

One way is to use `numpy.digitize` to bin your categories.

Then use a dictionary or list comprehension to calculate results.

``import numpy as npchl  = np.array([0.4,0.1,0.04,0.05,0.4,0.2,0.6,0.09,0.23,0.43,0.65,0.22,0.12,0.2,0.33])depth = np.array([0.1,0.3,0.31,0.44,0.49,1.1,1.145,1.33,1.49,1.53,1.67,1.79,1.87,2.1,2.3])bins = np.array([0,0.5,1.0,1.5,2.0,2.5])A = np.vstack((np.digitize(depth, bins), chl)).Tres = {bins[int(i)]: np.mean(A[A[:, 0] == i, 1]) for i in np.unique(A[:, 0])}# {0.5: 0.198, 1.5: 0.28, 2.0: 0.355, 2.5: 0.265}``

Or for the precise format you are after:

``res_lst = [np.mean(A[A[:, 0] == i, 1]) for i in range(len(bins))]# [nan, 0.198, nan, 0.28, 0.355, 0.265]``