What Rules Does Pandas Use to Generate a View VS a Copy

What rules does Pandas use to generate a view vs a copy?

Here's the rules, subsequent override:

  • All operations generate a copy

  • If inplace=True is provided, it will modify in-place; only some operations support this

  • An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.

  • An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)

  • An indexer that gets on a multiple-dtyped object is always a copy.

Your example of chained indexing

df[df.C <= df.B].loc[:,'B':'E']

is not guaranteed to work (and thus you shoulld never do this).

Instead do:

df.loc[df.C <= df.B, 'B':'E']

as this is faster and will always work

The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.

Pandas: Subindexing dataframes: Copies vs views

Your answer lies in the pandas docs: returning-a-view-versus-a-copy.

Whenever an array of labels or a boolean vector are involved
in the indexing operation, the result will be a copy.
With single label / scalar indexing and slicing,
e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.

In your example, bar is a view of slices of foo. If you wanted a copy, you could have used the copy method. Modifying bar also modifies foo. pandas does not appear to have a copy-on-write mechanism.

See my code example below to illustrate:

In [1]: import pandas as pd
...: import numpy as np
...: foo = pd.DataFrame(np.random.random((10,5)))
...:

In [2]: pd.__version__
Out[2]: '0.12.0.dev-35312e4'

In [3]: np.__version__
Out[3]: '1.7.1'

In [4]: # DataFrame has copy method
...: foo_copy = foo.copy()

In [5]: bar = foo.iloc[3:5,1:4]

In [6]: bar == foo.iloc[3:5,1:4] == foo_copy.iloc[3:5,1:4]
Out[6]:
1 2 3
3 True True True
4 True True True

In [7]: # Changing the view
...: bar.ix[3,1] = 5

In [8]: # View and DataFrame still equal
...: bar == foo.iloc[3:5,1:4]
Out[8]:
1 2 3
3 True True True
4 True True True

In [9]: # It is now different from a copy of original
...: bar == foo_copy.iloc[3:5,1:4]
Out[9]:
1 2 3
3 False True True
4 True True True

Checking whether data frame is copy or view in Pandas

Answers from HYRY and Marius in comments!

One can check either by:

  • testing equivalence of the values.base attribute rather than the values attribute, as in:

    df.values.base is df2.values.base instead of df.values is df2.values.

  • or using the (admittedly internal) _is_view attribute (df2._is_view is True).

Thanks everyone!

In Pandas, does .iloc method give a copy or view?

In general, you can get a view if the data-frame has a single dtype, which is not the case with your original data-frame:

In [4]: df
Out[4]:
age name
student1 21 Marry
student2 24 John

In [5]: df.dtypes
Out[5]:
age int64
name object
dtype: object

However, when you do:

In [6]: df.loc['student3'] = ['old','Tom']
...:

The first column get's coerced to object, since columns cannot have mixed dtypes:

In [7]: df.dtypes
Out[7]:
age object
name object
dtype: object

In this case, the underlying .values will always return an array with the same underlying buffer, and changes to that array will be reflected in the data-frame:

In [11]: vals = df.values

In [12]: vals
Out[12]:
array([[21, 'Marry'],
[24, 'John'],
['old', 'Tom']], dtype=object)

In [13]: vals[0,0] = 'foo'

In [14]: vals
Out[14]:
array([['foo', 'Marry'],
[24, 'John'],
['old', 'Tom']], dtype=object)

In [15]: df
Out[15]:
age name
student1 foo Marry
student2 24 John
student3 old Tom

On the other hand, with mixed types like with your original data-frame:

In [26]: df = pd.DataFrame([{'name':'Marry', 'age':21},{'name':'John','age':24}]
...: ,index=['student1','student2'])
...:

In [27]: vals = df.values

In [28]: vals
Out[28]:
array([[21, 'Marry'],
[24, 'John']], dtype=object)

In [29]: vals[0,0] = 'foo'

In [30]: vals
Out[30]:
array([['foo', 'Marry'],
[24, 'John']], dtype=object)

In [31]: df
Out[31]:
age name
student1 21 Marry
student2 24 John

Note, however, that a view will only be returned if it is possible to be a view, i.e. if it is a proper slice, otherwise, a copy will be made regardless of the dtypes:

In [39]: df.loc['student3'] = ['old','Tom']

In [40]: df2
Out[40]:
name
student3 Tom
student2 John

In [41]: df2.loc[:] = 'foo'

In [42]: df2
Out[42]:
name
student3 foo
student2 foo

In [43]: df
Out[43]:
age name
student1 21 Marry
student2 24 John
student3 old Tom

pandas dataframe view vs copy, how do I tell?

If your DataFrame has a simple column index, then there is no difference.
For example,

In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))

In [9]: df.loc[:, ['A','B']]
Out[9]:
A B
0 0 1
1 3 4
2 6 7
3 9 10

In [10]: df.loc[:, ('A','B')]
Out[10]:
A B
0 0 1
1 3 4
2 6 7
3 9 10

But if the DataFrame has a MultiIndex, there can be a big difference:

df = pd.DataFrame(np.random.randint(10, size=(5,4)),
columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
list('ABAB')]),
index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
list('CDCDC')]))

# foo bar
# A B A B
# baz C 7 9 9 9
# D 7 5 5 4
# qux C 5 0 5 1
# D 1 7 7 4
# C 6 4 3 5

In [27]: df.loc[:, ('foo','B')]
Out[27]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64

In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'

The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:

In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64

In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]:
foo
A B
baz C 7 9
D 7 5
qux C 5 0
D 1 7
C 6 4

Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')] is selecting the column where the first column level equals foo, and the second column level is B.

In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']] is selecting the columns where the first column level is either foo or B. With respect to the first column level, there are no B columns, but there are two foo columns.

I think the operating principle with Pandas is that if you use df.loc[...] as
an expression, you should assume df.loc may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect.
However, if you make an assignment of the form

df.loc[...] = value

then you can trust Pandas to alter df itself.

The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form

df.loc[...][...] = value

Here, Pandas evaluates df.loc[...] first, which may be a view or a copy. Now if it is a copy, then

df.loc[...][...] = value

is altering a copy of some portion of df, and thus has no effect on df itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.


I do not know of a practical fool-proof a priori way to determine if df.loc[...] is going to return a view or a copy.

However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):

  • If the resultant NDFrame can not be expressed as a basic slice of the
    underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.
  • If the resultant NDFrame has columns of different dtypes, then df.loc
    will again probably return a copy.

However, there is an easy way to determine if x = df.loc[..] is a view a postiori: Simply see if changing a value in x affects df. If it does, it is a view, if not, x is a copy.

Select part of Dataframe to obtain a view, not a copy

The first point of importance is that loc (and by extension, iloc, at, and iat) will always return a copy.

If you want a view, you'll have to index frame via __getitem__. Now, even this isn't guaranteed to return a view or copy -- that is an implementation detail and it isn't easy to tell.

Between the following indexing operations,

frame2 = frame[frame.iloc[:,0] == 1]
frame3 = frame[frame > 0]
frame4 = frame2[[0, 1]]

frame2._is_view
# False
frame3._is_view
# True
frame4._is_view
# False

only frame3 will be a view. The specifics also depend on dtypes and other factors (such as the shape of the slice), but this is an obvious distinction.

Despite frame3 being a view, modifications to it may either work or not, but they will never result in a change to frame. The devs have put a lot of checks in place (most notably the SettingWithCopyWarning) to prevent unintended side effects arising from modifying views.

frame3.iloc[:, 1] = 12345
frame3
0 1 2
0 1 12345 3
1 1 12345 6
2 7 12345 9

frame
0 1 2
0 1 2 3
1 1 5 6
2 7 8 9

TLDR; please look for a different way to do whatever it is you're trying to do.

why should I make a copy of a data frame in pandas

This expands on Paul's answer. In Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the initial DataFrame. Thus, you'd want to use the copy if you want to make sure the initial DataFrame shouldn't change. Consider the following code:

df = DataFrame({'x': [1,2]})
df_sub = df[0:1]
df_sub.x = -1
print(df)

You'll get:

   x
0 -1
1 2

In contrast, the following leaves df unchanged:

df_sub_copy = df[0:1].copy()
df_sub_copy.x = -1

This answer has been deprecated in newer versions of pandas. See docs

Force Return of View rather than copy in Pandas?

There are two parts to your question: (1) how to make a view (see bottom of this answer), and (2) how to make a copy.

I'll demonstrate with some example data:

import pandas as pd

df = pd.DataFrame([[1,2,3],[4,5,6],[None,10,20],[7,8,9]], columns=['x','y','z'])

# which looks like this:
x y z
0 1 2 3
1 4 5 6
2 NaN 10 20
3 7 8 9

How to make a copy: One option is to explicitly copy your DataFrame after whatever operations you perform. For instance, lets say we are selecting rows that do not have NaN:

df2 = df[~df['x'].isnull()]
df2 = df2.copy()

Then, if you modify values in df2 you will find that the modifications do not propagate back to the original data (df), and that Pandas does not warn that "A value is trying to be set on a copy of a slice from a DataFrame"

df2['x'] *= 100

# original data unchanged
print(df)

x y z
0 1 2 3
1 4 5 6
2 NaN 10 20
3 7 8 9

# modified data
print(df2)

x y z
0 100 2 3
1 400 5 6
3 700 8 9

Note: you may take a performance hit by explicitly making a copy.

How to ignore warnings: Alternatively, in some cases you might not care whether a view or copy is returned, because your intention is to permanently modify the data and never go back to the original data. In this case, you can suppress the warning and go merrily on your way (just don't forget that you've turned it off, and that the original data may or may not be modified by your code, because df2 may or may not be a copy):

pd.options.mode.chained_assignment = None  # default='warn'

For more information, see the answers at How to deal with SettingWithCopyWarning in Pandas?

How to make a view: Pandas will implicitly make views wherever and whenever possible. The key to this is to use the df.loc[row_indexer,col_indexer] method. For example, to multiply the values of column y by 100 for only the rows where column x is not null, we would write:

mask = ~df['x'].isnull()
df.loc[mask, 'y'] *= 100

# original data has changed
print(df)

x y z
0 1.0 200 3
1 4.0 500 6
2 NaN 10 20
3 7.0 800 9


Related Topics



Leave a reply



Submit