Pandas: peculiar performance drop for inplace rename after dropna
This is a copy of the explanation on github.
There is no guarantee that an inplace
operation is actually faster. Often they are actually the same operation that works on a copy, but the top-level reference is reassigned.
The reason for the difference in performance in this case is as follows.
The (df1-df2).dropna()
call creates a slice of the dataframe. When you apply a new operation, this triggers a SettingWithCopy
check because it could be a copy (but often is not).
This check must perform a garbage collection to wipe out some cache references to see if it's a copy. Unfortunately python syntax makes this unavoidable.
You can not have this happen, by simply making a copy first.
df = (df1-df2).dropna().copy()
followed by an inplace
operation will be as performant as before.
My personal opinion: I never use in-place operations. The syntax is harder to read and it does not offer any advantages.
Any issues with renaming dataframe columns using inplace=True?
The time renaming in place takes does not depend on the size of the dataframe. Can I conclude that no copies are made behind the scenes?
Yes, you can conclude that, except that a copy of the column names series may be made. Obviously the performance of that should be immaterial as the number of columns is usually not huge.
Trying to understand changing column name with inplace = true vs false
As you said yourself, inplace=True
mutates the original dataframe, thus you don't have to re-assign it.
On the other hand, the default setting is inplace=False
, thus you can rename (and re-assign ) as:
df_new = df_original.rename(columns = {'WrongName': 'CorrectName'})
I am not sure if this is also too cumbersome for you.
Moreover as explained here inplace performance, there is no guarantee that an inplace operation runs faster.
Why is renaming columns in pandas so slow?
I don't think inplace=True
doesn't copy your data. There are some discussion on SO saying it actually does copy, and then assign back. Also see this github issue.
You can just override the columns with:
df.columns = df.columns.to_series().replace({'a':'b'})
Related Topics
Why Does Python Assignment Not Return a Value
How to Set the Current Working Directory
Best Way to Set Entry Background Color in Python Gtk3 and Set Back to Default
The Simplest Possible Reverse Proxy
Call Python Function from Matlab
How to Overcome Typeerror: Unhashable Type: 'List'
How to Sort Python List of Strings of Numbers
How to Get Value Counts for Multiple Columns at Once in Pandas Dataframe
Python Parse CSV Ignoring Comma with Double-Quotes
How to Convert SQL Query Result to Pandas Data Structure
How to Send Non-English Unicode String Using Http Header
Learning Python from Ruby; Differences and Similarities
Which of These Scripting Languages Is More Appropriate for Pen-Testing