Pandas: Peculiar Performance Drop for Inplace Rename After Dropna

Pandas: peculiar performance drop for inplace rename after dropna

This is a copy of the explanation on github.

There is no guarantee that an inplace operation is actually faster. Often they are actually the same operation that works on a copy, but the top-level reference is reassigned.

The reason for the difference in performance in this case is as follows.

The (df1-df2).dropna() call creates a slice of the dataframe. When you apply a new operation, this triggers a SettingWithCopy check because it could be a copy (but often is not).

This check must perform a garbage collection to wipe out some cache references to see if it's a copy. Unfortunately python syntax makes this unavoidable.

You can not have this happen, by simply making a copy first.

df = (df1-df2).dropna().copy()

followed by an inplace operation will be as performant as before.

My personal opinion: I never use in-place operations. The syntax is harder to read and it does not offer any advantages.

Any issues with renaming dataframe columns using inplace=True?

The time renaming in place takes does not depend on the size of the dataframe. Can I conclude that no copies are made behind the scenes?

Yes, you can conclude that, except that a copy of the column names series may be made. Obviously the performance of that should be immaterial as the number of columns is usually not huge.

Trying to understand changing column name with inplace = true vs false

As you said yourself, inplace=True mutates the original dataframe, thus you don't have to re-assign it.
On the other hand, the default setting is inplace=False, thus you can rename (and re-assign ) as:

df_new = df_original.rename(columns = {'WrongName': 'CorrectName'})

I am not sure if this is also too cumbersome for you.

Moreover as explained here inplace performance, there is no guarantee that an inplace operation runs faster.

Why is renaming columns in pandas so slow?

I don't think inplace=True doesn't copy your data. There are some discussion on SO saying it actually does copy, and then assign back. Also see this github issue.

You can just override the columns with:

df.columns = df.columns.to_series().replace({'a':'b'})


Related Topics



Leave a reply



Submit