Pandas: Chained Assignments

Pandas: Chained assignments

The point of the SettingWithCopy is to warn the user that you may be doing something that will not update the original data frame as one might expect.

Here, data is a dataframe, possibly of a single dtype (or not). You are then taking a reference to this data['amount'] which is a Series, and updating it. This probably works in your case because you are returning the same dtype of data as existed.

However it could create a copy which updates a copy of data['amount'] which you would not see; Then you would be wondering why it is not updating.

Pandas returns a copy of an object in almost all method calls. The inplace operations are a convience operation which work, but in general are not clear that data is being modified and could potentially work on copies.

Much more clear to do this:

data['amount'] = data["amount"].fillna(data.groupby("num")["amount"].transform("mean"))

data["amount"] = data['amount'].fillna(mean_avg)

One further plus to working on copies. You can chain operations, this is not possible with inplace ones.

e.g.

data['amount'] = data['amount'].fillna(mean_avg)*2

And just an FYI. inplace operations are neither faster nor more memory efficient. my2c they should be banned. But too late on that API.

You can of course turn this off:

pd.set_option('chained_assignment',None)

Pandas runs with the entire test suite with this set to raise (so we know if chaining is happening) on, FYI.

How to deal with SettingWithCopyWarning in Pandas

The SettingWithCopyWarning was created to flag potentially confusing "chained" assignments, such as the following, which does not always work as expected, particularly when the first selection returns a copy. [see GH5390 and GH5597 for background discussion.]

df[df['A'] > 2]['B'] = new_val  # new_val not set in df

The warning offers a suggestion to rewrite as follows:

df.loc[df['A'] > 2, 'B'] = new_val

However, this doesn't fit your usage, which is equivalent to:

df = df[df['A'] > 2]
df['B'] = new_val

While it's clear that you don't care about writes making it back to the original frame (since you are overwriting the reference to it), unfortunately this pattern cannot be differentiated from the first chained assignment example. Hence the (false positive) warning. The potential for false positives is addressed in the docs on indexing, if you'd like to read further. You can safely disable this new warning with the following assignment.

import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'


Other Resources

  • pandas User Guide: Indexing and selecting data
  • Python Data Science Handbook: Data Indexing and Selection
  • Real Python: SettingWithCopyWarning in Pandas: Views vs Copies
  • Dataquest: SettingwithCopyWarning: How to Fix This Warning in Pandas
  • Towards Data Science: Explaining the SettingWithCopyWarning in pandas

python pandas: how to avoid chained assignment

You should use loc to ensure you're working on a view, on your example the following will work and not raise a warning:

df.loc[df['x'] == 10, 'value'] = 1000

So the general form is:

df.loc[<mask or index label values>, <optional column>] = < new scalar value or array like>

The docs highlights the errors and there is the intro, granted some of the function docs are sparse, feel free to submit improvements.

pandas chained_assignment warning exception handling

After a long search, the "bad guy" was found.
Another developer included the following lines in his module

import warnings
warnings.filterwarnings('error')

This turns warnings into exceptions. For more details see warnings package documentation

Hence my warnings were treated as exceptions, although the pandas option was set to "warn"

Pandas method chaining when df not assigned yet

You need some reference to the dataframe in order to use it in multiple independent places. That means binding a reusable name to the value returned by pd.DataFrame.

A "functional" way to create such a binding is to use a lambda expression instead of an assignment statement.

df = (lambda df: df.drop(df.tail(1).index)....)(pd.DataFrame(...))

The lambda expression defines some function that uses whatever value is passed as an argument as the value of the name df; you then immediately call that function on your original dataframe.



Related Topics



Leave a reply



Submit