Is There Any Numpy Group by Function

Vectorized groupby with NumPy

If you want a more flexible implementation of groupby that can group using any of numpy's ufuncs:

def groupby_np(X, groups, axis = 0, uf = np.add, out = None, minlength = 0, identity = None):
    if minlength < groups.max() + 1:
        minlength = groups.max() + 1
    if identity is None:
        identity = uf.identity
    i = list(range(X.ndim))
    del i[axis]
    i = tuple(i)
    n = out is None
    if n:
        if identity is None:  # fallback to loops over 0-index for identity
            assert np.all(np.in1d(np.arange(minlength), groups)), "No valid identity for unassinged groups"
            s = [slice(None)] * X.ndim
            for i_ in i:
                s[i_] = 0
            out = np.array([uf.reduce(X[tuple(s)][groups == i]) for i in range(minlength)])
        else:
            out = np.full((minlength,), identity, dtype = X.dtype)
    uf.at(out, groups, uf.reduce(X, i))
    if n:
        return out

groupby_np(X, groups)
array([15, 30])

groupby_np(X, groups, uf = np.multiply)
array([   0, 3024])

groupby_np(X, groups, uf = np.maximum)
array([5, 9])

groupby_np(X, groups, uf = np.minimum)
array([0, 6])

NumPy apply function to groups of rows corresponding to another numpy array

The problem is that the groups of rows with the same Z can have different sizes so you cannot stack them into one 3D numpy array which would allow to easily apply a function along the third dimension. One solution is to use a for-loop, another is to use np.split:

a = np.array([[0, 0, 1],
              [1, 1, 2],
              [4, 5, 1],
              [4, 5, 2],
              [4, 3, 1]])


a_sorted = a[a[:,2].argsort()]

inds = np.unique(a_sorted[:,2], return_index=True)[1]

a_split = np.split(a_sorted, inds)[1:]

# [array([[0, 0, 1],
#         [4, 5, 1],
#         [4, 3, 1]]),

#  array([[1, 1, 2],
#         [4, 5, 2]])]

f = np.sum  # example of a function

result = list(map(f, a_split))
# [19, 15]

~~But imho the best solution is to use pandas and groupby as suggested by FBruzzesi. You can then convert the result to a numpy array.~~

EDIT: For completeness, here are the other two solutions

List comprehension:

b = np.unique(a[:,2])
result = [f(a[a[:,2] == z]) for z in b]

Pandas:

df = pd.DataFrame(a, columns=list('XYZ'))
result = df.groupby(['Z']).apply(lambda x: f(x.values)).tolist()

This is the performance plot I got for a = np.random.randint(0, 100, (n, 3)):

enter image description here

As you can see, approximately up to n = 10^5 the "split solution" is the fastest, but after that the pandas solution performs better.

Get mean of numpy array using pandas groupby

Mean it twice, one at array level and once at group level:

df['data'].map(np.mean).groupby(df['id']).mean().reset_index()

   id      data
0   1  0.278333
1   2  0.596667
2   3  0.298889
3   4  0.241667

Based on comment, you can do:

pd.DataFrame(df['data'].tolist(),index=df['id']).mean(level=0).agg(np.array,1)

id
1                                 [0.42, 0.215, 0.2]
2                 [0.86, 0.635, 0.29500000000000004]
3    [0.3433333333333333, 0.29, 0.26333333333333336]
4                                 [0.31, 0.315, 0.1]
dtype: object

Or:

df.groupby("id")['data'].apply(np.mean)

Applying a custom groupby aggregate function to find average of Numpy Array

What you need is possible with convert values to 2d array and then using np.mean:

f = lambda x: np.mean(np.array(x.tolist()), axis=0)
df2 = df.groupby('C')['B'].apply(f).reset_index()
print (df2)
   C                     B
0  X  [1.5, 2.5, 4.0, 5.0]
1  Y  [2.0, 3.0, 4.0, 4.0]
2  Z  [2.0, 3.0, 5.0, 6.0]

Last option solution is possible, but less effient (thank you @Abhik Sarkar for test):

df1 = pd.DataFrame(df.B.tolist()).groupby(df['C']).mean()
df2 = pd.DataFrame({'B': df1.values.tolist(), 'C': df1.index})
print (df2)
                      B  C
0  [1.5, 2.5, 4.0, 5.0]  X
1  [2.0, 3.0, 4.0, 4.0]  Y
2  [2.0, 3.0, 5.0, 6.0]  Z

numpy group by, returning original indexes sorted by the result

There is sadly no group-by in Numpy, but you can use np.unique to find the unique elements and their index which is enough to implement what you need. One the keys as been identified, you can perform a key-based reduction using np.add.at. For the sort by value, you can use np.argsort. See this post and this one for more information.

keys, index = np.unique(df[:,0], return_inverse=True) # Find the unique key to group
values = np.zeros(len(keys), dtype=np.int64)          # Sum-based accumulator
np.add.at(values, index, df[:,1])                     # Key-based accumulation
tmp = np.hstack([keys[:,None], values[:,None]])       # Build the key-sum 2D array
res = tmp[tmp[:, 1].argsort()[::-1]]                  # Sort by value

Note that the index can be easily obtained from the index variable (which is a reversed index). There is no way to build it with Numpy but this is possible using a simple python loop accumulating the index i in lists stored in a dictionary for in each key keys[index[i]]. Here is an example:

from collections import defaultdict
d = defaultdict(list)
for i in range(len(df)): d[keys[index[i]]].append(i)

Group by max or min in a numpy array

I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.

import numpy as np
def group_min(groups, data):
    # sort with major key groups, minor key data
    order = np.lexsort((data, groups))
    groups = groups[order] # this is only needed if groups is unsorted
    data = data[order]
    # construct an index which marks borders between groups
    index = np.empty(len(groups), 'bool')
    index[0] = True
    index[1:] = groups[1:] != groups[:-1]
    return data[index]

#max is very similar
def group_max(groups, data):
    order = np.lexsort((data, groups))
    groups = groups[order] #this is only needed if groups is unsorted
    data = data[order]
    index = np.empty(len(groups), 'bool')
    index[-1] = True
    index[:-1] = groups[1:] != groups[:-1]
    return data[index]

Product of array elements by group in numpy (Python)

If you groups are already sorted (if they are not you can do that with np.argsort), you can do this using the reduceat functionality to ufuncs (if they are not sorted, you would have to sort them first to do it efficiently):

# you could do the group_changes somewhat faster if you care a lot
group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
Vprods = np.multiply.reduceat(values, group_changes)

Or mgilson answer if you have few groups. But if you have many groups, then this is much more efficient. Since you avoid boolean indices for every element in the original array for every group. Plus you avoid slicing in a python loop with reduceat.

Of course pandas does these operations conveniently.

Edit: Sorry had prod in there. The ufunc is multiply. You can use this method for any binary ufunc. This means it works for basically all numpy functions that can work element wise on two input arrays. (ie. multiply normally multiplies two arrays elementwise, add adds them, maximum/minimum, etc. etc.)

Using pandas groupby and numpy where together in Python

Use GroupBy.transform with mean of boolean mask, so get Series with same size like original, so possible pass to np.where for new column:

df = pd.DataFrame({
        'Occupation':list('dddeee'),
         'Emp_Code':list('aabbcc'),
         'Gender':list('MFMFMF')
})
print (df)
  Occupation Emp_Code Gender
0          d        a      M
1          d        a      F
2          d        b      M
3          e        b      F
4          e        c      M
5          e        c      F

m = df['Gender'].eq('M')
df['new'] = np.where(m, m.groupby(df['Occupation']).transform('mean').mul(100), 0)
print (df)
  Occupation Emp_Code Gender        new
0          d        a      M  66.666667
1          d        a      F   0.000000
2          d        b      M  66.666667
3          e        b      F   0.000000
4          e        c      M  33.333333
5          e        c      F   0.000000

If want new DataFrame filled by normlize values one possible solution with crosstab and normalize parameter:

df2 = pd.crosstab(df['Occupation'], df['Gender'],normalize='index')
print (df2)
Gender             F         M
Occupation                    
d           0.333333  0.666667
e           0.666667  0.333333