Vectorized groupby with NumPy
If you want a more flexible implementation of groupby
that can group using any of numpy
's ufunc
s:
def groupby_np(X, groups, axis = 0, uf = np.add, out = None, minlength = 0, identity = None):
if minlength < groups.max() + 1:
minlength = groups.max() + 1
if identity is None:
identity = uf.identity
i = list(range(X.ndim))
del i[axis]
i = tuple(i)
n = out is None
if n:
if identity is None: # fallback to loops over 0-index for identity
assert np.all(np.in1d(np.arange(minlength), groups)), "No valid identity for unassinged groups"
s = [slice(None)] * X.ndim
for i_ in i:
s[i_] = 0
out = np.array([uf.reduce(X[tuple(s)][groups == i]) for i in range(minlength)])
else:
out = np.full((minlength,), identity, dtype = X.dtype)
uf.at(out, groups, uf.reduce(X, i))
if n:
return out
groupby_np(X, groups)
array([15, 30])
groupby_np(X, groups, uf = np.multiply)
array([ 0, 3024])
groupby_np(X, groups, uf = np.maximum)
array([5, 9])
groupby_np(X, groups, uf = np.minimum)
array([0, 6])
NumPy apply function to groups of rows corresponding to another numpy array
The problem is that the groups of rows with the same Z can have different sizes so you cannot stack them into one 3D numpy array which would allow to easily apply a function along the third dimension. One solution is to use a for-loop, another is to use np.split
:
a = np.array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2],
[4, 3, 1]])
a_sorted = a[a[:,2].argsort()]
inds = np.unique(a_sorted[:,2], return_index=True)[1]
a_split = np.split(a_sorted, inds)[1:]
# [array([[0, 0, 1],
# [4, 5, 1],
# [4, 3, 1]]),
# array([[1, 1, 2],
# [4, 5, 2]])]
f = np.sum # example of a function
result = list(map(f, a_split))
# [19, 15]
But imho the best solution is to use pandas and groupby as suggested by FBruzzesi. You can then convert the result to a numpy array.
EDIT: For completeness, here are the other two solutions
List comprehension:
b = np.unique(a[:,2])
result = [f(a[a[:,2] == z]) for z in b]
Pandas:
df = pd.DataFrame(a, columns=list('XYZ'))
result = df.groupby(['Z']).apply(lambda x: f(x.values)).tolist()
This is the performance plot I got for a = np.random.randint(0, 100, (n, 3))
:
As you can see, approximately up to n = 10^5
the "split solution" is the fastest, but after that the pandas solution performs better.
Get mean of numpy array using pandas groupby
Mean it twice, one at array level and once at group level:
df['data'].map(np.mean).groupby(df['id']).mean().reset_index()
id data
0 1 0.278333
1 2 0.596667
2 3 0.298889
3 4 0.241667
Based on comment, you can do:
pd.DataFrame(df['data'].tolist(),index=df['id']).mean(level=0).agg(np.array,1)
id
1 [0.42, 0.215, 0.2]
2 [0.86, 0.635, 0.29500000000000004]
3 [0.3433333333333333, 0.29, 0.26333333333333336]
4 [0.31, 0.315, 0.1]
dtype: object
Or:
df.groupby("id")['data'].apply(np.mean)
Applying a custom groupby aggregate function to find average of Numpy Array
What you need is possible with convert values to 2d array and then using np.mean
:
f = lambda x: np.mean(np.array(x.tolist()), axis=0)
df2 = df.groupby('C')['B'].apply(f).reset_index()
print (df2)
C B
0 X [1.5, 2.5, 4.0, 5.0]
1 Y [2.0, 3.0, 4.0, 4.0]
2 Z [2.0, 3.0, 5.0, 6.0]
Last option solution is possible, but less effient (thank you @Abhik Sarkar for test):
df1 = pd.DataFrame(df.B.tolist()).groupby(df['C']).mean()
df2 = pd.DataFrame({'B': df1.values.tolist(), 'C': df1.index})
print (df2)
B C
0 [1.5, 2.5, 4.0, 5.0] X
1 [2.0, 3.0, 4.0, 4.0] Y
2 [2.0, 3.0, 5.0, 6.0] Z
numpy group by, returning original indexes sorted by the result
There is sadly no group-by in Numpy, but you can use np.unique
to find the unique elements and their index which is enough to implement what you need. One the keys as been identified, you can perform a key-based reduction using np.add.at
. For the sort by value, you can use np.argsort
. See this post and this one for more information.
keys, index = np.unique(df[:,0], return_inverse=True) # Find the unique key to group
values = np.zeros(len(keys), dtype=np.int64) # Sum-based accumulator
np.add.at(values, index, df[:,1]) # Key-based accumulation
tmp = np.hstack([keys[:,None], values[:,None]]) # Build the key-sum 2D array
res = tmp[tmp[:, 1].argsort()[::-1]] # Sort by value
Note that the index can be easily obtained from the index
variable (which is a reversed index). There is no way to build it with Numpy but this is possible using a simple python loop accumulating the index i
in lists stored in a dictionary for in each key keys[index[i]]
. Here is an example:
from collections import defaultdict
d = defaultdict(list)
for i in range(len(df)): d[keys[index[i]]].append(i)
Group by max or min in a numpy array
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
import numpy as np
def group_min(groups, data):
# sort with major key groups, minor key data
order = np.lexsort((data, groups))
groups = groups[order] # this is only needed if groups is unsorted
data = data[order]
# construct an index which marks borders between groups
index = np.empty(len(groups), 'bool')
index[0] = True
index[1:] = groups[1:] != groups[:-1]
return data[index]
#max is very similar
def group_max(groups, data):
order = np.lexsort((data, groups))
groups = groups[order] #this is only needed if groups is unsorted
data = data[order]
index = np.empty(len(groups), 'bool')
index[-1] = True
index[:-1] = groups[1:] != groups[:-1]
return data[index]
Product of array elements by group in numpy (Python)
If you groups are already sorted (if they are not you can do that with np.argsort
), you can do this using the reduceat
functionality to ufunc
s (if they are not sorted, you would have to sort them first to do it efficiently):
# you could do the group_changes somewhat faster if you care a lot
group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
Vprods = np.multiply.reduceat(values, group_changes)
Or mgilson answer if you have few groups. But if you have many groups, then this is much more efficient. Since you avoid boolean indices for every element in the original array for every group. Plus you avoid slicing in a python loop with reduceat.
Of course pandas does these operations conveniently.
Edit: Sorry had prod
in there. The ufunc is multiply
. You can use this method for any binary ufunc
. This means it works for basically all numpy functions that can work element wise on two input arrays. (ie. multiply normally multiplies two arrays elementwise, add adds them, maximum/minimum, etc. etc.)
Using pandas groupby and numpy where together in Python
Use GroupBy.transform
with mean of boolean mask, so get Series
with same size like original, so possible pass to np.where
for new column:
df = pd.DataFrame({
'Occupation':list('dddeee'),
'Emp_Code':list('aabbcc'),
'Gender':list('MFMFMF')
})
print (df)
Occupation Emp_Code Gender
0 d a M
1 d a F
2 d b M
3 e b F
4 e c M
5 e c F
m = df['Gender'].eq('M')
df['new'] = np.where(m, m.groupby(df['Occupation']).transform('mean').mul(100), 0)
print (df)
Occupation Emp_Code Gender new
0 d a M 66.666667
1 d a F 0.000000
2 d b M 66.666667
3 e b F 0.000000
4 e c M 33.333333
5 e c F 0.000000
If want new DataFrame filled by normlize values one possible solution with crosstab
and normalize
parameter:
df2 = pd.crosstab(df['Occupation'], df['Gender'],normalize='index')
print (df2)
Gender F M
Occupation
d 0.333333 0.666667
e 0.666667 0.333333
Related Topics
Find Matching Rows in 2 Dimensional Numpy Array
How to Write 2 Lists of Items in 2 Columns Instead of 2 Arrays
Unpivot Multiple Columns With Same Name in Pandas Dataframe
Drop Rows Containing Empty Cells from a Pandas Dataframe
Import Local Module in Jupyter Notebook
How to Decompile a Compiled .Pyc File into a .Py File
How to Create Dynamic Workflows in Airflow
Pyspark Data Frame Converting False and True to 0 and 1
How to Make Tkinter Frames in a Loop and Update Object Values
How to Check for an Exact Word in a String in Python
Finding an Exact Substring in a String in Python
How to Display the Value of the Bar on Each Bar With Pyplot.Barh()
Get All the Diagonals in a Matrix/List of Lists in Python
Matrix Flip Horizontal or Vertical
Why Is This Python Script With Matplotlib So Slow
Pandas | Merge Rows With Same Id
Scheduling Python Script to Run Every Hour Accurately
Accuracy Score Valueerror: Can't Handle Mix of Binary and Continuous Target