Can Pandas Groupby Aggregate into a List, Rather Than Sum, Mean, etc

Can pandas groupby aggregate into a list, rather than sum, mean, etc?

my solution is a bit longer than you may expect, I'm sure it could be shortened, but:

g = df.groupby("A").apply(lambda x: pd.concat((x["B"], x["C"])))
k = g.reset_index()
k["i"] = k1.index
k["rn"] = k1.groupby("A")["i"].rank()
k.pivot_table(rows="A", cols="rn", values=0)

# output
# rn 1 2 3 4 5 6
# A
# 1 10 12 11 22 20 8
# 2 10 11 10 13 NaN NaN
# 3 14 10 NaN NaN NaN NaN

A bit of explanation. First line, g = df.groupby("A").apply(lambda x: pd.concat((x["B"], x["C"]))). This one group df by A and then put columns B and C into one column:

A   
1 0 10
1 12
2 11
0 22
1 20
2 8
2 3 10
4 11
3 10
4 13
3 5 14
5 10

Then k = g.reset_index(), creating sequential index, result is:

    A  level_1   0
0 1 0 10
1 1 1 12
2 1 2 11
3 1 0 22
4 1 1 20
5 1 2 8
6 2 3 10
7 2 4 11
8 2 3 10
9 2 4 13
10 3 5 14
11 3 5 10

Now I want to move this index into column (I'd like to hear how I can make a sequential column without resetting index), k["i"] = k1.index:

    A  level_1   0   i
0 1 0 10 0
1 1 1 12 1
2 1 2 11 2
3 1 0 22 3
4 1 1 20 4
5 1 2 8 5
6 2 3 10 6
7 2 4 11 7
8 2 3 10 8
9 2 4 13 9
10 3 5 14 10
11 3 5 10 11

Now, k["rn"] = k1.groupby("A")["i"].rank() will add row_number inside each A (like row_number() over(partition by A order by i) in SQL:

    A  level_1   0   i  rn
0 1 0 10 0 1
1 1 1 12 1 2
2 1 2 11 2 3
3 1 0 22 3 4
4 1 1 20 4 5
5 1 2 8 5 6
6 2 3 10 6 1
7 2 4 11 7 2
8 2 3 10 8 3
9 2 4 13 9 4
10 3 5 14 10 1
11 3 5 10 11 2

And finally, just pivoting with k.pivot_table(rows="A", cols="rn", values=0):

rn   1   2   3   4   5   6
A
1 10 12 11 22 20 8
2 10 11 10 13 NaN NaN
3 14 10 NaN NaN NaN NaN

Multiple aggregations of the same column using pandas GroupBy.agg()

As of 2022-06-20, the below is the accepted practice for aggregations:

df.groupby('dummy').agg(
Mean=('returns', np.mean),
Sum=('returns', np.sum))

Below the fold included for historical versions of pandas.

You can simply pass the functions as a list:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:
mean sum
dummy
1 0.036901 0.369012

or as a dictionary:

In [21]: df.groupby('dummy').agg({'returns':
{'Mean': np.mean, 'Sum': np.sum}})
Out[21]:
returns
Mean Sum
dummy
1 0.036901 0.369012

pandas groupby aggregate element-wise list addition

It's possible to use apply on the grouped dataframe to get element-wise addition:

df.groupby('X')['Y'].apply(lambda x: [sum(y) for y in zip(*x)])

Which results in a pandas series object:

X
57674 [54.0, 114.0, 124.0, 103.0]
71455 [337.0, 327.0, 311.0, 333.0]

Group dataframe and get sum AND count?

try this:

In [110]: (df.groupby('Company Name')
.....: .agg({'Organisation Name':'count', 'Amount': 'sum'})
.....: .reset_index()
.....: .rename(columns={'Organisation Name':'Organisation Count'})
.....: )
Out[110]:
Company Name Amount Organisation Count
0 Vifor Pharma UK Ltd 4207.93 5

or if you don't want to reset index:

df.groupby('Company Name')['Amount'].agg(['sum','count'])

or

df.groupby('Company Name').agg({'Amount': ['sum','count']})

Demo:

In [98]: df.groupby('Company Name')['Amount'].agg(['sum','count'])
Out[98]:
sum count
Company Name
Vifor Pharma UK Ltd 4207.93 5

In [99]: df.groupby('Company Name').agg({'Amount': ['sum','count']})
Out[99]:
Amount
sum count
Company Name
Vifor Pharma UK Ltd 4207.93 5

Groupby with both sum() and mean()

Use DataFrameGroupBy.agg with a dictionary:

df.groupby([pd.Grouper(freq="60min", key="ds"), "name"]).agg({'val1': 'mean', 'val2': sum, 'val3': 'mean'})

How can I aggregate (sum, mean, etc.) values and create a new Pandas dataframe based on that?

This operation can be done by;

agg_col = "Year"

new_df = df.groupby(by=agg_col, as_index=False).agg({"Count1": "sum", "Count2": "sum"})

And you can change agg_col to Month if you want to group by month.

When using Pandas .groupby, why use .agg versus directly using the function eg .sum()

Setup

df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

The primary benefit of using agg is stated in the docs:

Aggregate using one or more operations over the specified axis.

If you have separate operations that need to be applied to each individual column, agg takes a dictionary (or a function, string, or list of strings/functions) that allows you to create that mapping in a single statement. So if you'd like the sum of column a, and the mean of column b:

df.agg({'a': 'sum', 'b': 'mean'})

a 6.0
b 5.0
dtype: float64

It also allows you to apply multiple operations to a single column in a single statement. For example, to find the sum, mean, and std of column a:

df.agg({'a': ['sum', 'mean', 'std']})

a
sum 6.0
mean 2.0
std 1.0

There's no difference in outcome when you use agg with a single operation. I'd argue that df.agg('sum') is less clear than df.sum(), but the results will be the same:

df.agg('sum')

a 6
b 15
dtype: int64

df.sum()

a 6
b 15
dtype: int64

The main benefit agg provides is the convenience of applying multiple operations.



Related Topics



Leave a reply



Submit