Can pandas groupby aggregate into a list, rather than sum, mean, etc?
my solution is a bit longer than you may expect, I'm sure it could be shortened, but:
g = df.groupby("A").apply(lambda x: pd.concat((x["B"], x["C"])))
k = g.reset_index()
k["i"] = k1.index
k["rn"] = k1.groupby("A")["i"].rank()
k.pivot_table(rows="A", cols="rn", values=0)
# output
# rn 1 2 3 4 5 6
# A
# 1 10 12 11 22 20 8
# 2 10 11 10 13 NaN NaN
# 3 14 10 NaN NaN NaN NaN
A bit of explanation. First line, g = df.groupby("A").apply(lambda x: pd.concat((x["B"], x["C"])))
. This one group df
by A
and then put columns B
and C
into one column:
A
1 0 10
1 12
2 11
0 22
1 20
2 8
2 3 10
4 11
3 10
4 13
3 5 14
5 10
Then k = g.reset_index()
, creating sequential index, result is:
A level_1 0
0 1 0 10
1 1 1 12
2 1 2 11
3 1 0 22
4 1 1 20
5 1 2 8
6 2 3 10
7 2 4 11
8 2 3 10
9 2 4 13
10 3 5 14
11 3 5 10
Now I want to move this index into column (I'd like to hear how I can make a sequential column without resetting index), k["i"] = k1.index
:
A level_1 0 i
0 1 0 10 0
1 1 1 12 1
2 1 2 11 2
3 1 0 22 3
4 1 1 20 4
5 1 2 8 5
6 2 3 10 6
7 2 4 11 7
8 2 3 10 8
9 2 4 13 9
10 3 5 14 10
11 3 5 10 11
Now, k["rn"] = k1.groupby("A")["i"].rank()
will add row_number inside each A
(like row_number() over(partition by A order by i)
in SQL:
A level_1 0 i rn
0 1 0 10 0 1
1 1 1 12 1 2
2 1 2 11 2 3
3 1 0 22 3 4
4 1 1 20 4 5
5 1 2 8 5 6
6 2 3 10 6 1
7 2 4 11 7 2
8 2 3 10 8 3
9 2 4 13 9 4
10 3 5 14 10 1
11 3 5 10 11 2
And finally, just pivoting with k.pivot_table(rows="A", cols="rn", values=0)
:
rn 1 2 3 4 5 6
A
1 10 12 11 22 20 8
2 10 11 10 13 NaN NaN
3 14 10 NaN NaN NaN NaN
Multiple aggregations of the same column using pandas GroupBy.agg()
As of 2022-06-20, the below is the accepted practice for aggregations:
df.groupby('dummy').agg(
Mean=('returns', np.mean),
Sum=('returns', np.sum))
Below the fold included for historical versions of pandas
.
You can simply pass the functions as a list:
In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:
mean sum
dummy
1 0.036901 0.369012
or as a dictionary:
In [21]: df.groupby('dummy').agg({'returns':
{'Mean': np.mean, 'Sum': np.sum}})
Out[21]:
returns
Mean Sum
dummy
1 0.036901 0.369012
pandas groupby aggregate element-wise list addition
It's possible to use apply
on the grouped dataframe to get element-wise addition:
df.groupby('X')['Y'].apply(lambda x: [sum(y) for y in zip(*x)])
Which results in a pandas series object:
X
57674 [54.0, 114.0, 124.0, 103.0]
71455 [337.0, 327.0, 311.0, 333.0]
Group dataframe and get sum AND count?
try this:
In [110]: (df.groupby('Company Name')
.....: .agg({'Organisation Name':'count', 'Amount': 'sum'})
.....: .reset_index()
.....: .rename(columns={'Organisation Name':'Organisation Count'})
.....: )
Out[110]:
Company Name Amount Organisation Count
0 Vifor Pharma UK Ltd 4207.93 5
or if you don't want to reset index:
df.groupby('Company Name')['Amount'].agg(['sum','count'])
or
df.groupby('Company Name').agg({'Amount': ['sum','count']})
Demo:
In [98]: df.groupby('Company Name')['Amount'].agg(['sum','count'])
Out[98]:
sum count
Company Name
Vifor Pharma UK Ltd 4207.93 5
In [99]: df.groupby('Company Name').agg({'Amount': ['sum','count']})
Out[99]:
Amount
sum count
Company Name
Vifor Pharma UK Ltd 4207.93 5
Groupby with both sum() and mean()
Use DataFrameGroupBy.agg
with a dictionary:
df.groupby([pd.Grouper(freq="60min", key="ds"), "name"]).agg({'val1': 'mean', 'val2': sum, 'val3': 'mean'})
How can I aggregate (sum, mean, etc.) values and create a new Pandas dataframe based on that?
This operation can be done by;
agg_col = "Year"
new_df = df.groupby(by=agg_col, as_index=False).agg({"Count1": "sum", "Count2": "sum"})
And you can change agg_col
to Month
if you want to group by month.
When using Pandas .groupby, why use .agg versus directly using the function eg .sum()
Setup
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
The primary benefit of using agg
is stated in the docs:
Aggregate using one or more operations over the specified axis.
If you have separate operations that need to be applied to each individual column, agg
takes a dictionary (or a function, string, or list of strings/functions) that allows you to create that mapping in a single statement. So if you'd like the sum
of column a
, and the mean
of column b
:
df.agg({'a': 'sum', 'b': 'mean'})
a 6.0
b 5.0
dtype: float64
It also allows you to apply multiple operations to a single column in a single statement. For example, to find the sum
, mean
, and std
of column a
:
df.agg({'a': ['sum', 'mean', 'std']})
a
sum 6.0
mean 2.0
std 1.0
There's no difference in outcome when you use agg
with a single operation. I'd argue that df.agg('sum')
is less clear than df.sum()
, but the results will be the same:
df.agg('sum')
a 6
b 15
dtype: int64
df.sum()
a 6
b 15
dtype: int64
The main benefit agg
provides is the convenience of applying multiple operations.
Related Topics
Remove Reverse Duplicates from Dataframe
Why Is the Borg Pattern Better Than the Singleton Pattern in Python
Pythonic Way to Create Union of All Values Contained in Multiple Lists
How Come a File Doesn't Get Written Until I Stop the Program
Converting "Yield From" Statement to Python 2.7 Code
<Django Object > Is Not JSON Serializable
Convert Words Between Verb/Noun/Adjective Forms
Calling Dot Products and Linear Algebra Operations in Cython
How to Add Conda Environment to Jupyter Lab
Schedule a Repeating Event in Python 3
How to Import a Text File on Aws S3 into Pandas Without Writing to Disk
Python:2D Contour Plot from 3 Lists:X, Y and Rho
Python: Mocking a Context Manager