Naming returned columns in Pandas aggregate function?
For pandas >= 0.25
The functionality to name returned aggregate columns has been reintroduced in the master branch and is targeted for pandas 0.25. The new syntax is .agg(new_col_name=('col_name', 'agg_func')
. Detailed example from the PR linked above:
In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
...: 'height': [9.1, 6.0, 9.5, 34.0],
...: 'weight': [7.9, 7.5, 9.9, 198.0]})
...:
In [3]: df
Out[3]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [4]: df.groupby('kind').agg(min_height=('height', 'min'),
max_weight=('weight', 'max'))
Out[4]:
min_height max_weight
kind
cat 9.1 9.9
dog 6.0 198.0
It will also be possible to use multiple lambda expressions with this syntax and the two-step rename syntax I suggested earlier (below) as per this PR. Again, copying from the example in the PR:
In [2]: df = pd.DataFrame({"A": ['a', 'a'], 'B': [1, 2], 'C': [3, 4]})
In [3]: df.groupby("A").agg({'B': [lambda x: 0, lambda x: 1]})
Out[3]:
B
<lambda> <lambda 1>
A
a 0 1
and then .rename()
, or in one go:
In [4]: df.groupby("A").agg(b=('B', lambda x: 0), c=('B', lambda x: 1))
Out[4]:
b c
A
a 0 0
For pandas < 0.25
The currently accepted answer by unutbu describes are great way of doing this in pandas versions <= 0.20. However, as of pandas 0.20, using this method raises a warning indicating that the syntax will not be available in future versions of pandas.
Series:
FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version
DataFrames:
FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
According to the pandas 0.20 changelog, the recommended way of renaming columns while aggregating is as follows.
# Create a sample data frame
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': range(5),
'C': range(5)})
# ==== SINGLE COLUMN (SERIES) ====
# Syntax soon to be deprecated
df.groupby('A').B.agg({'foo': 'count'})
# Recommended replacement syntax
df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})
# ==== MULTI COLUMN ====
# Syntax soon to be deprecated
df.groupby('A').agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})
# Recommended replacement syntax
df.groupby('A').agg({'B': 'sum', 'C': 'min'}).rename(columns={'B': 'foo', 'C': 'bar'})
# As the recommended syntax is more verbose, parentheses can
# be used to introduce line breaks and increase readability
(df.groupby('A')
.agg({'B': 'sum', 'C': 'min'})
.rename(columns={'B': 'foo', 'C': 'bar'})
)
Please see the 0.20 changelog for additional details.
Update 2017-01-03 in response to @JunkMechanic's comment.
With the old style dictionary syntax, it was possible to pass multiple lambda
functions to .agg
, since these would be renamed with the key in the passed dictionary:
>>> df.groupby('A').agg({'B': {'min': lambda x: x.min(), 'max': lambda x: x.max()}})
B
max min
A
1 2 0
2 4 3
Multiple functions can also be passed to a single column as a list:
>>> df.groupby('A').agg({'B': [np.min, np.max]})
B
amin amax
A
1 0 2
2 3 4
However, this does not work with lambda functions, since they are anonymous and all return <lambda>
, which causes a name collision:
>>> df.groupby('A').agg({'B': [lambda x: x.min(), lambda x: x.max]})
SpecificationError: Function names must be unique, found multiple named <lambda>
To avoid the SpecificationError
, named functions can be defined a priori instead of using lambda
. Suitable function names also avoid calling .rename
on the data frame afterwards. These functions can be passed with the same list syntax as above:
>>> def my_min(x):
>>> return x.min()
>>> def my_max(x):
>>> return x.max()
>>> df.groupby('A').agg({'B': [my_min, my_max]})
B
my_min my_max
A
1 0 2
2 3 4
Rename result columns from Pandas aggregation (FutureWarning: using a dict with renaming is deprecated)
Use groupby apply
and return a Series to rename columns
Use the groupby apply
method to perform an aggregation that
- Renames the columns
- Allows for spaces in the names
- Allows you to order the returned columns in any way you choose
- Allows for interactions between columns
- Returns a single level index and NOT a MultiIndex
To do this:
- create a custom function that you pass to
apply
- This custom function is passed each group as a DataFrame
- Return a Series
- The index of the Series will be the new columns
Create fake data
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1", "user3"],
"Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0, 9],
'Score': [9, 1, 8, 7, 7, 6, 9]})
create custom function that returns a Series
The variable x
inside of my_agg
is a DataFrame
def my_agg(x):
names = {
'Amount mean': x['Amount'].mean(),
'Amount std': x['Amount'].std(),
'Amount range': x['Amount'].max() - x['Amount'].min(),
'Score Max': x['Score'].max(),
'Score Sum': x['Score'].sum(),
'Amount Score Sum': (x['Amount'] * x['Score']).sum()}
return pd.Series(names, index=['Amount range', 'Amount std', 'Amount mean',
'Score Sum', 'Score Max', 'Amount Score Sum'])
Pass this custom function to the groupby apply
method
df.groupby('User').apply(my_agg)
The big downside is that this function will be much slower than agg
for the cythonized aggregations
Using a dictionary with groupby agg
method
Using a dictionary of dictionaries was removed because of its complexity and somewhat ambiguous nature. There is an ongoing discussion on how to improve this functionality in the future on github Here, you can directly access the aggregating column after the groupby call. Simply pass a list of all the aggregating functions you wish to apply.
df.groupby('User')['Amount'].agg(['sum', 'count'])
Output
sum count
User
user1 18.0 2
user2 20.5 3
user3 10.5 1
It is still possible to use a dictionary to explicitly denote different aggregations for different columns, like here if there was another numeric column named Other
.
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1"],
"Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0],
'Other': [1,2,3,4,5,6]})
df.groupby('User').agg({'Amount' : ['sum', 'count'], 'Other':['max', 'std']})
Output
Amount Other
sum count max std
User
user1 18.0 2 6 3.535534
user2 20.5 3 5 1.527525
user3 10.5 1 4 NaN
Naming returned aggregate columns in Dask dataframes
Dask has docs on performing custom aggregations. They discuss the case of computing the mean, and how it's more complex than the pandas counterpart:
Many reductions can only be implemented with multiple temporaries. To implement these reductions, the steps should return tuples and expect multiple arguments. A mean function can be implemented as:
custom_mean = dd.Aggregation(
'custom_mean',
lambda s: (s.count(), s.sum()),
lambda count, sum: (count.sum(), sum.sum()),
lambda count, sum: sum / count,
)
df.groupby('g').agg(custom_mean)
This hints at the complexity involved in handling all types of user-defined aggregation, but provides a pretty good overview of how to implement them.
As for renaming the column, I don't see a way to do that in one step (at the moment). Could be wrong about this, and I'm sure this will probably change in the future. Complex reshape operations in dask are significantly different from their pandas counterparts because they need to work with data partitions and account for a variety of data locations, so it's not trivial to replicate the full pandas API. Balancing performance considerations on a laptop, a distributed cluster, and a high-performance computing facility (the range of dask deployments is quite broad) with the many feature requests from users is a real challenge for the dask developers.
Generally, the answer "why does this not exist" for open source projects is "if you want to contribute it, PRs are welcome!". See dask's development guidelines for a nice intro to contributing.
Renaming Column Names in Pandas Groupby function
For the first question I think answer would be:
<your DataFrame>.rename(columns={'count':'Total_Numbers'})
or
<your DataFrame>.columns = ['ID', 'Region', 'Total_Numbers']
As for second one I'd say the answer would be no. It's possible to use it like 'df.ID' because of python datamodel:
Attribute references are translated to lookups in this dictionary,
e.g., m.x is equivalent to m.dict["x"]
pandas, groupby.agg to return with a given column name
I think you need rename
or define name
parameter in reset_index
:
Check also deprecate groupby agg with a dictionary when renaming.
1.
df = df.groupby('A')['B'].sum().reset_index(name='total')
2.
df = df.groupby('A', as_index=False)['B'].sum().rename(columns={'B':'total'})
3.
df = df.groupby('A').agg({'B' : 'sum'}).rename(columns={'B':'total'}).reset_index()
Naming sublevels in aggregation function in Pandas
With pandas 0.25.0+ renaming aggregation is possible and you don't have to deal with MultiIndex column headers.
data.groupby("Gender").agg(Math_min=('Math score','min'),
Math_max=('Math score','max'),
Math_diff=('Math score',np.ptp),
Lit_mean=('Literature score','mean'))
Output:
Math_min Math_max Math_diff Lit_mean
Gender
F 5 9 4 8.000000
M 3 8 5 7.333333
I think lambda's not working in this format is a reported bug.
data.groupby("Gender").agg(Math_min=('Math score','min'),
Math_max=('Math score','max'),
Math_diff=('Math score',lambda x: np.max(x)-np.min(x)),
Lit_mean=('Literature score', 'mean'))
Yields
KeyError: "[('Math score', '<lambda>')] not in index"
Should be fixed soon.
Python pandas perform same aggregation on all other columns, whithout naming columns
The solution you linked uses df.groupby('id')['x1', 'x2'].agg('sum')
.
So, to use every one of many columns but except a few ones:
columns_to_exclude = ['year', 'month' ,'day']
columns_to_aggregate = [col for col in df.columns if col not in columns_to_exclude]
df.groupby('id')[columns_to_aggregate].agg('sum')
renaming columns after group by and sum in pandas dataframe
You cannot rename it, because it is index
. You can add as_index=False
for return DataFrame
or add reset_index
:
pdf_chart_data1=pdf_chart_data.groupby('sell', as_index=False)['value'].sum()
.rename(columns={'sum':'valuesum','sell' : 'selltime'})
Or:
pdf_chart_data1=pdf_chart_data.groupby('sell')['value'].sum()
.reset_index()
.rename(columns={'sum':'valuesum','sell' : 'selltime'})
Multiple aggregations of the same column using pandas GroupBy.agg()
As of 2022-06-20, the below is the accepted practice for aggregations:
df.groupby('dummy').agg(
Mean=('returns', np.mean),
Sum=('returns', np.sum))
Below the fold included for historical versions of pandas
.
You can simply pass the functions as a list:
In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:
mean sum
dummy
1 0.036901 0.369012
or as a dictionary:
In [21]: df.groupby('dummy').agg({'returns':
{'Mean': np.mean, 'Sum': np.sum}})
Out[21]:
returns
Mean Sum
dummy
1 0.036901 0.369012
Related Topics
How to Escape Curly-Brackets in F-Strings
How to Make Custom Legend in Matplotlib
How to Check If a Column Exists in Pandas
Difference Between Type(Obj) and Obj._Class_
Installing Scipy in Python 3.5 on 32-Bit Windows 7 MAChine
Builtin Function Not Working with Spyder
How to Compare Dates in Django Templates
How to Redirect Print Statements to Tkinter Text Widget
Pandas - Convert Strings to Time Without Date
Why Is the Value of _Name_ Changing After Assignment to Sys.Modules[_Name_]
How to Add Conda Environment to Jupyter Lab
Except-Clause Deletes Local Variable
Plotting Networkx Graph with Node Labels Defaulting to Node Name