## binning data in python with scipy/numpy

It's probably faster and easier to use `numpy.digitize()`

:

`import numpy`

data = numpy.random.random(100)

bins = numpy.linspace(0, 1, 10)

digitized = numpy.digitize(data, bins)

bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]

An alternative to this is to use `numpy.histogram()`

:

`bin_means = (numpy.histogram(data, bins, weights=data)[0] /`

numpy.histogram(data, bins)[0])

Try for yourself which one is faster... :)

## Numpy (or scipy) binning of time series values based on timestamps

As mentioned in the comments, if your primary concern for not using Pandas is speed, I'd actually recommend using it, because it's not written entirely in Python, but it has many internal portions written using Cython (basically C), so they're very, very fast.

## Binning a numpy array

Just use `reshape`

and then `mean(axis=1)`

.

As the simplest possible example:

`import numpy as np`

data = np.array([4,2,5,6,7,5,4,3,5,7])

print data.reshape(-1, 2).mean(axis=1)

More generally, we'd need to do something like this to drop the last bin when it's not an even multiple:

`import numpy as np`

width=3

data = np.array([4,2,5,6,7,5,4,3,5,7])

result = data[:(data.size // width) * width].reshape(-1, width).mean(axis=1)

print result

## Binning data and calculating MAE for each bin in Python

If your bins are all the same size you can use floor division to obtain bin indices from `Obs`

, in your example.

`idx = (Obs // 1).astype(int)`

If not use `np.digitize`

instead.

`idx = np.digitize(Obs, bin_boundaries)`

Once you have indices use them with `np.bincount`

to obtain the means.

`mn = np.bincount(idx, abs_error) / np.bincount(idx)`

## vectorized approach to binning with numpy/scipy in Python

If I understand your question correctly:

`vals = array([[1, 10], [1, 11], [2, 20], [2, 21], [2, 22]]) # Example`

(x, y) = vals.T # Shortcut

bin_limits = range(min(x)+1, max(x)+2) # Other limits could be chosen

points_by_bin = [ [] for _ in bin_limits ] # Final result

for (bin_num, y_value) in zip(searchsorted(bin_limits, x, "right"), y): # digitize() finds the correct bin number

points_by_bin[bin_num].append(y_value)

print points_by_bin # [[10, 11], [20, 21, 22]]

Numpy's fast array operation `searchsorted()`

is used for maximum efficiency. Values are then added one by one (since the final result is not a rectangular array, Numpy cannot help much, for this). This solution should be faster than multiple `where()`

calls in a loop, which force Numpy to re-read the *same* array many times.

## Calculating a binned mean with SciPy: binned_statistic + handling NaNs (ValueError with SciPy and statistic=np.nanmean)

The answer to both questions 1 and 2 is to use `np.nanmean`

to ignore the `nan`

s in the data. The regression you link to was a bug that I inadvertently introduced and then fixed after it was raised. I'm not sure why you have SciPy 1.5.2 in your environment, it looks like 1.5.4 is the latest 1.5.X version so you probably want to update the environment you're using. However, that backport was applied to version 1.5.0 release versions so there should not be an issue on those versions if you have the latest.

Also, I set this up with scipy version 1.7.3 and that also works as intended for me. Here are snippets.

Version 1.5.4

`import scipy`

scipy.__version__

'1.5.4'

import numpy as np

x = [0.5, 0.5, 1.5, 1.5]

values = [10, 20, np.nan, 40]

scipy.stats.binned_statistic(x, values, statistic='mean', bins=(0, 1, 2)).statistic

array([15., nan])

scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statistic

array([15., 40.])

Version 1.7.3

`import scipy`

scipy.__version__

'1.7.3'

import scipy.stats, numpy as np

x = [0.5, 0.5, 1.5, 1.5]

values = [10, 20, np.nan, 40]

scipy.stats.binned_statistic(x, values, statistic='mean', bins=(0, 1, 2)).statistic

array([15., nan])

scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statistic

array([15., 40.])

Version 1.5.2

`import scipy`

scipy.__version__

'1.5.2'

import scipy.stats, numpy as np

x = [0.5, 0.5, 1.5, 1.5]

values = [10, 20, np.nan, 40]

scipy.stats.binned_statistic(x, values, statistic='mean', bins=(0, 1, 2)).statistic

array([15., nan])

scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statistic

array([15., 40.])

Please try using the `.__version__`

to confirm your version of SciPy. If it is a 1.4.1 version-it does not have the change. I suspect that your version is 1.4.1 and not the other (higher) versions. Please use the portions of the code examples above to confirm the scipy version that is used in your environment.

## Binning pandas/numpy array in unequal sizes with approx equal computational cost

I think a good approach has been found. Credits to a colleague.

The idea is to sort the group sizes (in descending order) and put groups into bins in a "backward S"-pattern. Let me illustrate with an example. Assume `n = 3`

(number of bins) and the following data:

`groups`

data

0 359

1 326

2 264

3 262

4 249

5 248

6 245

7 189

8 187

9 153

10 45

The idea is to put one group in one bin going "right to left" (and vica versa) between the bins in a "backward S"-pattern. First element in bin 0, second element in bin 1, etc. Then go backwards when reaching the last bin: fourth element in bin 2, fifth element in bin 1, etc. See below how the elements are put into bins by the group number in parenthesis. The values are the group sizes.

` Bins: | 0 | 1 | 2 |`

| 359 (0)| 326 (1)| 264 (2)|

| 248 (5)| 249 (4)| 262 (3)|

| 245 (6)| 189 (7)| 187 (8)|

| | 45(10)| 153 (9)|

The bins will have approximately the same number of values and, thus, approximately the same computational "cost". The bin sizes are: `[852, 809, 866]`

for anyone interested. I have tried on a real-world dataset and the bins are of similar sizes. It is not guaranteed bins will be of similar size for all datasets.

The code can be made more efficient, but this is sufficient to get the idea out:

`n = 3`

size = 50

rng = np.random.default_rng(2021)

df = pd.DataFrame({

"one": np.linspace(0, 10, size, dtype=np.uint8),

"two": np.linspace(0, 5, size, dtype=np.uint8),

"data": rng.integers(0, 100, size)

})

groups = df.groupby(["one", "two"]).sum()

groups = groups.sort_values("data", ascending=False).reset_index(drop=True)

bins = [[] for i in range(n)]

backward = False

i = 0

for group in groups.iterrows():

bins[i].append(group)

i = i + 1 if not backward else i - 1

if i == n:

backward = True

i -= 1

if i == -1 and backward:

backward = False

i += 1

[sum([size[0] for (group, size) in bin]) for bin in bins]

## Binning data into non-grid based bins

`out = X*np.nan`

for i in range(max(b)):

out[i] = np.nanmean(f[x_grid[b==i], y_grid[b==i]])

can be replaced by two calls to `np.bincount`

:

`total = np.bincount(b, weights=f[x_grid, y_grid], minlength=len(X))`

count = np.bincount(b, minlength=len(X))

out = total/count

or one call to `stats.binned_statistic`

:

`out, bin_edges, binnumber = stats.binned_statistic(`

x=b, values=f[x_grid, y_grid], statistic='mean', bins=np.arange(len(X)+1))

For example,

`import numpy as np`

from scipy.spatial import KDTree

import scipy.stats as stats

np.random.seed(2017)

def rebin(f, X, Y):

s = f.shape

x_grid = np.arange(s[0])

y_grid = np.arange(s[1])

x_grid, y_grid = np.meshgrid(x_grid,y_grid)

x_grid, y_grid = x_grid.flatten(), y_grid.flatten()

tree = KDTree(np.column_stack((X,Y)))

_, b = tree.query(np.column_stack((x_grid, y_grid)))

out, bin_edges, binnumber = stats.binned_statistic(

x=b, values=f[x_grid, y_grid], statistic='mean', bins=np.arange(len(X)+1))

# total = np.bincount(b, weights=f[x_grid, y_grid], minlength=len(X))

# count = np.bincount(b, minlength=len(X))

# out = total/count

return out

def orig(f, X, Y):

s = f.shape

x_grid = np.arange(s[0])

y_grid = np.arange(s[1])

x_grid, y_grid = np.meshgrid(x_grid,y_grid)

x_grid, y_grid = x_grid.flatten(), y_grid.flatten()

tree = KDTree(np.column_stack((X,Y)))

_, b = tree.query(np.column_stack((x_grid, y_grid)))

out = X*np.nan

for i in range(len(X)):

out[i] = np.nanmean(f[x_grid[b==i], y_grid[b==i]])

return out

N = 100

X, Y = np.random.random((2, N))

f = np.random.random((N, N))

expected = orig(f, X, Y)

result = rebin(f, X, Y)

print(np.allclose(expected, result, equal_nan=True))

# True

## Averaging Data in Bins

One way is to use `numpy.digitize`

to bin your categories.

Then use a dictionary or list comprehension to calculate results.

`import numpy as np`

chl = np.array([0.4,0.1,0.04,0.05,0.4,0.2,0.6,0.09,0.23,0.43,0.65,0.22,0.12,0.2,0.33])

depth = np.array([0.1,0.3,0.31,0.44,0.49,1.1,1.145,1.33,1.49,1.53,1.67,1.79,1.87,2.1,2.3])

bins = np.array([0,0.5,1.0,1.5,2.0,2.5])

A = np.vstack((np.digitize(depth, bins), chl)).T

res = {bins[int(i)]: np.mean(A[A[:, 0] == i, 1]) for i in np.unique(A[:, 0])}

# {0.5: 0.198, 1.5: 0.28, 2.0: 0.355, 2.5: 0.265}

Or for the precise format you are after:

`res_lst = [np.mean(A[A[:, 0] == i, 1]) for i in range(len(bins))]`

# [nan, 0.198, nan, 0.28, 0.355, 0.265]

### Related Topics

Executing Multi-Line Statements in the One-Line Command-Line

What's a Correct and Good Way to Implement _Hash_()

Python String.Replace Regular Expression

Convert Floats to Ints in Pandas

How to Add Multiple Columns to Pandas Dataframe in One Assignment

How to Make a Custom Activation Function with Only Python in Tensorflow

How to Use Jdbc Source to Write and Read Data in (Py)Spark

Plot Different Color for Different Categorical Levels Using Matplotlib

Why Does Foo.Append(Bar) Affect All Elements in a List of Lists

Truth Value of a String in Python

Lag When Win.Blit() Background Pygame

Concatenate Rows of Two Dataframes in Pandas

Skip Rows During CSV Import Pandas

Subprocess.Popen: Cloning Stdout and Stderr Both to Terminal and Variables

Make Sure Only a Single Instance of a Program Is Running

How to Zip Two Differently Sized Lists, Repeating the Shorter List