Rename result columns from Pandas aggregation ("FutureWarning: using a dict with renaming is deprecated")
Use groupbyapply
and return a Series to rename columnsUse the groupby apply
method to perform an aggregation that
- Renames the columns
- Allows for spaces in the names
- Allows you to order the returned columns in any way you choose
- Allows for interactions between columns
- Returns a single level index and NOT a MultiIndex
To do this:
- create a custom function that you pass to
apply
- This custom function is passed each group as a DataFrame
- Return a Series
- The index of the Series will be the new columns
Create fake data
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1", "user3"],
"Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0, 9],
'Score': [9, 1, 8, 7, 7, 6, 9]})
create custom function that returns a Series
The variable x
inside of my_agg
is a DataFrame
def my_agg(x):
names = {
'Amount mean': x['Amount'].mean(),
'Amount std': x['Amount'].std(),
'Amount range': x['Amount'].max() - x['Amount'].min(),
'Score Max': x['Score'].max(),
'Score Sum': x['Score'].sum(),
'Amount Score Sum': (x['Amount'] * x['Score']).sum()}
return pd.Series(names, index=['Amount range', 'Amount std', 'Amount mean',
'Score Sum', 'Score Max', 'Amount Score Sum'])
Pass this custom function to the groupby apply
method
df.groupby('User').apply(my_agg)
The big downside is that this function will be much slower than agg
for the cythonized aggregations
agg
methodUsing a dictionary of dictionaries was removed because of its complexity and somewhat ambiguous nature. There is an ongoing discussion on how to improve this functionality in the future on github Here, you can directly access the aggregating column after the groupby call. Simply pass a list of all the aggregating functions you wish to apply.
df.groupby('User')['Amount'].agg(['sum', 'count'])
Output
sum count
User
user1 18.0 2
user2 20.5 3
user3 10.5 1
It is still possible to use a dictionary to explicitly denote different aggregations for different columns, like here if there was another numeric column named Other
.
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1"],
"Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0],
'Other': [1,2,3,4,5,6]})
df.groupby('User').agg({'Amount' : ['sum', 'count'], 'Other':['max', 'std']})
Output
Amount Other
sum count max std
User
user1 18.0 2 6 3.535534
user2 20.5 3 5 1.527525
user3 10.5 1 4 NaN
Pandas renaming with dict is deprecated
For Reference, this was something deprecated in 2017 (https://github.com/pandas-dev/pandas/issues/18366)
There are multiple approaches (from this SO Answer)
You can change your code
agg = long_df.reset_index().groupby(['RegionVariable', 'EXP'])[features].agg(['count','mean'])
or use something like a custom pd.Series.apply
function.
Naming returned columns in Pandas aggregate function?
For pandas >= 0.25The functionality to name returned aggregate columns has been reintroduced in the master branch and is targeted for pandas 0.25. The new syntax is .agg(new_col_name=('col_name', 'agg_func')
. Detailed example from the PR linked above:
In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
...: 'height': [9.1, 6.0, 9.5, 34.0],
...: 'weight': [7.9, 7.5, 9.9, 198.0]})
...:
In [3]: df
Out[3]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [4]: df.groupby('kind').agg(min_height=('height', 'min'),
max_weight=('weight', 'max'))
Out[4]:
min_height max_weight
kind
cat 9.1 9.9
dog 6.0 198.0
It will also be possible to use multiple lambda expressions with this syntax and the two-step rename syntax I suggested earlier (below) as per this PR. Again, copying from the example in the PR:
In [2]: df = pd.DataFrame({"A": ['a', 'a'], 'B': [1, 2], 'C': [3, 4]})
In [3]: df.groupby("A").agg({'B': [lambda x: 0, lambda x: 1]})
Out[3]:
B
<lambda> <lambda 1>
A
a 0 1
and then .rename()
, or in one go:
In [4]: df.groupby("A").agg(b=('B', lambda x: 0), c=('B', lambda x: 1))
Out[4]:
b c
A
a 0 0
For pandas < 0.25
The currently accepted answer by unutbu describes are great way of doing this in pandas versions <= 0.20. However, as of pandas 0.20, using this method raises a warning indicating that the syntax will not be available in future versions of pandas.
Series:
FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version
DataFrames:
FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
According to the pandas 0.20 changelog, the recommended way of renaming columns while aggregating is as follows.
# Create a sample data frame
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': range(5),
'C': range(5)})
# ==== SINGLE COLUMN (SERIES) ====
# Syntax soon to be deprecated
df.groupby('A').B.agg({'foo': 'count'})
# Recommended replacement syntax
df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})
# ==== MULTI COLUMN ====
# Syntax soon to be deprecated
df.groupby('A').agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})
# Recommended replacement syntax
df.groupby('A').agg({'B': 'sum', 'C': 'min'}).rename(columns={'B': 'foo', 'C': 'bar'})
# As the recommended syntax is more verbose, parentheses can
# be used to introduce line breaks and increase readability
(df.groupby('A')
.agg({'B': 'sum', 'C': 'min'})
.rename(columns={'B': 'foo', 'C': 'bar'})
)
Please see the 0.20 changelog for additional details.
Update 2017-01-03 in response to @JunkMechanic's comment.
With the old style dictionary syntax, it was possible to pass multiple lambda
functions to .agg
, since these would be renamed with the key in the passed dictionary:
>>> df.groupby('A').agg({'B': {'min': lambda x: x.min(), 'max': lambda x: x.max()}})
B
max min
A
1 2 0
2 4 3
Multiple functions can also be passed to a single column as a list:
>>> df.groupby('A').agg({'B': [np.min, np.max]})
B
amin amax
A
1 0 2
2 3 4
However, this does not work with lambda functions, since they are anonymous and all return <lambda>
, which causes a name collision:
>>> df.groupby('A').agg({'B': [lambda x: x.min(), lambda x: x.max]})
SpecificationError: Function names must be unique, found multiple named <lambda>
To avoid the SpecificationError
, named functions can be defined a priori instead of using lambda
. Suitable function names also avoid calling .rename
on the data frame afterwards. These functions can be passed with the same list syntax as above:
>>> def my_min(x):
>>> return x.min()
>>> def my_max(x):
>>> return x.max()
>>> df.groupby('A').agg({'B': [my_min, my_max]})
B
my_min my_max
A
1 0 2
2 3 4
Solution for SpecificationError: nested renamer is not supported while agg() along with groupby()
change
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
to
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg(total='count')).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg(Avg='mean')).reset_index()['Avg']
reason: in new pandas version named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).
source: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html
Pandas aggregation warning with lambdas (FutureWarning: using a dict with renaming is deprecated)
Use map
for flatten
columns names:
df.columns = df.columns.map('_'.join)
How to update pandas DataFrame.drop() for Future Warning - all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
From the documentation, pandas.DataFrame.drop
has the following parameters:
Parameters
labels: single label or list-like Index or column labels to drop.
axis: {0 or ‘index’, 1 or ‘columns’}, default 0 Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
index: single label or list-like Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).
columns: single label or list-like Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
level: int or level name, optional For MultiIndex, level from which the labels will be removed.
inplace: bool, default False If False, return a copy. Otherwise, do operation inplace and return None.
errors: {‘ignore’, ‘raise’}, default ‘raise’ If ‘ignore’, suppress error and only existing labels are dropped.
Moving forward, only labels
(the first parameter) can be positional.
So, for this example, the drop
code should be as follows:
df = df.drop('market', axis=1)
or (more legibly) with columns
:
df = df.drop(columns='market')
Related Topics
Pandas: How to Assign Values Based on Multiple Conditions for Existing Columns
How to Find Rows of One Dataframe in Another Dataframe
How to Start a Background Process in Python
Python: Plotting Percentage in Seaborn Bar Plot
Vscode Import Error for Python Module
Find Value in Dictionary Using Regex in Python
The Right Way to Limit Maximum Number of Threads Running At Once
How to Plot Date and Time in X Axis Against Y Value (Python)
Print Floating Point Values Without Leading Zero
Combine Date and Time Columns Using Python Pandas
Receiving Integers from the User Until They Enter 0
Finding Out Who Got the Highest Mark Among the Students
Json.Decoder.Jsondecodeerror: Expecting Value: Line 1 Column 1 (Char 0) Python
How to Check If a String Column in Pyspark Dataframe Is All Numeric
Print() Prints Only Every Second Input