What rules does Pandas use to generate a view vs a copy?
Here's the rules, subsequent override:
All operations generate a copy
If
inplace=True
is provided, it will modify in-place; only some operations support thisAn indexer that sets, e.g.
.loc/.iloc/.iat/.at
will set inplace.An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for
.query
; this will always return a copy as its evaluated bynumexpr
)An indexer that gets on a multiple-dtyped object is always a copy.
Your example of chained indexing
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you shoulld never do this).
Instead do:
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning
, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.
Pandas: Subindexing dataframes: Copies vs views
Your answer lies in the pandas docs: returning-a-view-versus-a-copy.
Whenever an array of labels or a boolean vector are involved
in the indexing operation, the result will be a copy.
With single label / scalar indexing and slicing,
e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.
In your example, bar
is a view of slices of foo
. If you wanted a copy, you could have used the copy
method. Modifying bar
also modifies foo
. pandas does not appear to have a copy-on-write mechanism.
See my code example below to illustrate:
In [1]: import pandas as pd
...: import numpy as np
...: foo = pd.DataFrame(np.random.random((10,5)))
...:
In [2]: pd.__version__
Out[2]: '0.12.0.dev-35312e4'
In [3]: np.__version__
Out[3]: '1.7.1'
In [4]: # DataFrame has copy method
...: foo_copy = foo.copy()
In [5]: bar = foo.iloc[3:5,1:4]
In [6]: bar == foo.iloc[3:5,1:4] == foo_copy.iloc[3:5,1:4]
Out[6]:
1 2 3
3 True True True
4 True True True
In [7]: # Changing the view
...: bar.ix[3,1] = 5
In [8]: # View and DataFrame still equal
...: bar == foo.iloc[3:5,1:4]
Out[8]:
1 2 3
3 True True True
4 True True True
In [9]: # It is now different from a copy of original
...: bar == foo_copy.iloc[3:5,1:4]
Out[9]:
1 2 3
3 False True True
4 True True True
Checking whether data frame is copy or view in Pandas
Answers from HYRY and Marius in comments!
One can check either by:
testing equivalence of the
values.base
attribute rather than thevalues
attribute, as in:df.values.base is df2.values.base
instead ofdf.values is df2.values
.or using the (admittedly internal)
_is_view
attribute (df2._is_view
isTrue
).
Thanks everyone!
In Pandas, does .iloc method give a copy or view?
In general, you can get a view if the data-frame has a single dtype
, which is not the case with your original data-frame:
In [4]: df
Out[4]:
age name
student1 21 Marry
student2 24 John
In [5]: df.dtypes
Out[5]:
age int64
name object
dtype: object
However, when you do:
In [6]: df.loc['student3'] = ['old','Tom']
...:
The first column get's coerced to object
, since columns cannot have mixed dtypes:
In [7]: df.dtypes
Out[7]:
age object
name object
dtype: object
In this case, the underlying .values
will always return an array with the same underlying buffer, and changes to that array will be reflected in the data-frame:
In [11]: vals = df.values
In [12]: vals
Out[12]:
array([[21, 'Marry'],
[24, 'John'],
['old', 'Tom']], dtype=object)
In [13]: vals[0,0] = 'foo'
In [14]: vals
Out[14]:
array([['foo', 'Marry'],
[24, 'John'],
['old', 'Tom']], dtype=object)
In [15]: df
Out[15]:
age name
student1 foo Marry
student2 24 John
student3 old Tom
On the other hand, with mixed types like with your original data-frame:
In [26]: df = pd.DataFrame([{'name':'Marry', 'age':21},{'name':'John','age':24}]
...: ,index=['student1','student2'])
...:
In [27]: vals = df.values
In [28]: vals
Out[28]:
array([[21, 'Marry'],
[24, 'John']], dtype=object)
In [29]: vals[0,0] = 'foo'
In [30]: vals
Out[30]:
array([['foo', 'Marry'],
[24, 'John']], dtype=object)
In [31]: df
Out[31]:
age name
student1 21 Marry
student2 24 John
Note, however, that a view will only be returned if it is possible to be a view, i.e. if it is a proper slice, otherwise, a copy will be made regardless of the dtypes:
In [39]: df.loc['student3'] = ['old','Tom']
In [40]: df2
Out[40]:
name
student3 Tom
student2 John
In [41]: df2.loc[:] = 'foo'
In [42]: df2
Out[42]:
name
student3 foo
student2 foo
In [43]: df
Out[43]:
age name
student1 21 Marry
student2 24 John
student3 old Tom
pandas dataframe view vs copy, how do I tell?
If your DataFrame has a simple column index, then there is no difference.
For example,
In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))
In [9]: df.loc[:, ['A','B']]
Out[9]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
In [10]: df.loc[:, ('A','B')]
Out[10]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
But if the DataFrame has a MultiIndex, there can be a big difference:
df = pd.DataFrame(np.random.randint(10, size=(5,4)),
columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
list('ABAB')]),
index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
list('CDCDC')]))
# foo bar
# A B A B
# baz C 7 9 9 9
# D 7 5 5 4
# qux C 5 0 5 1
# D 1 7 7 4
# C 6 4 3 5
In [27]: df.loc[:, ('foo','B')]
Out[27]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:
In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]:
foo
A B
baz C 7 9
D 7 5
qux C 5 0
D 1 7
C 6 4
Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')]
is selecting the column where the first column level equals foo
, and the second column level is B
.
In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']]
is selecting the columns where the first column level is either foo
or B
. With respect to the first column level, there are no B
columns, but there are two foo
columns.
I think the operating principle with Pandas is that if you use df.loc[...]
as
an expression, you should assume df.loc
may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect.
However, if you make an assignment of the form
df.loc[...] = value
then you can trust Pandas to alter df
itself.
The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form
df.loc[...][...] = value
Here, Pandas evaluates df.loc[...]
first, which may be a view or a copy. Now if it is a copy, then
df.loc[...][...] = value
is altering a copy of some portion of df
, and thus has no effect on df
itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.
I do not know of a practical fool-proof a priori way to determine if df.loc[...]
is going to return a view or a copy.
However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):
- If the resultant NDFrame can not be expressed as a basic slice of the
underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view. - If the resultant NDFrame has columns of different dtypes, then
df.loc
will again probably return a copy.
However, there is an easy way to determine if x = df.loc[..]
is a view a postiori: Simply see if changing a value in x
affects df
. If it does, it is a view, if not, x
is a copy.
Select part of Dataframe to obtain a view, not a copy
The first point of importance is that loc
(and by extension, iloc
, at
, and iat
) will always return a copy.
If you want a view, you'll have to index frame
via __getitem__
. Now, even this isn't guaranteed to return a view or copy -- that is an implementation detail and it isn't easy to tell.
Between the following indexing operations,
frame2 = frame[frame.iloc[:,0] == 1]
frame3 = frame[frame > 0]
frame4 = frame2[[0, 1]]
frame2._is_view
# False
frame3._is_view
# True
frame4._is_view
# False
only frame3
will be a view. The specifics also depend on dtypes and other factors (such as the shape of the slice), but this is an obvious distinction.
Despite frame3
being a view, modifications to it may either work or not, but they will never result in a change to frame
. The devs have put a lot of checks in place (most notably the SettingWithCopyWarning) to prevent unintended side effects arising from modifying views.
frame3.iloc[:, 1] = 12345
frame3
0 1 2
0 1 12345 3
1 1 12345 6
2 7 12345 9
frame
0 1 2
0 1 2 3
1 1 5 6
2 7 8 9
TLDR; please look for a different way to do whatever it is you're trying to do.
why should I make a copy of a data frame in pandas
This expands on Paul's answer. In Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the initial DataFrame. Thus, you'd want to use the copy if you want to make sure the initial DataFrame shouldn't change. Consider the following code:
df = DataFrame({'x': [1,2]})
df_sub = df[0:1]
df_sub.x = -1
print(df)
You'll get:
x
0 -1
1 2
In contrast, the following leaves df unchanged:
df_sub_copy = df[0:1].copy()
df_sub_copy.x = -1
This answer has been deprecated in newer versions of pandas. See docs
Force Return of View rather than copy in Pandas?
There are two parts to your question: (1) how to make a view (see bottom of this answer), and (2) how to make a copy.
I'll demonstrate with some example data:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[None,10,20],[7,8,9]], columns=['x','y','z'])
# which looks like this:
x y z
0 1 2 3
1 4 5 6
2 NaN 10 20
3 7 8 9
How to make a copy: One option is to explicitly copy your DataFrame after whatever operations you perform. For instance, lets say we are selecting rows that do not have NaN:
df2 = df[~df['x'].isnull()]
df2 = df2.copy()
Then, if you modify values in df2 you will find that the modifications do not propagate back to the original data (df), and that Pandas does not warn that "A value is trying to be set on a copy of a slice from a DataFrame"
df2['x'] *= 100
# original data unchanged
print(df)
x y z
0 1 2 3
1 4 5 6
2 NaN 10 20
3 7 8 9
# modified data
print(df2)
x y z
0 100 2 3
1 400 5 6
3 700 8 9
Note: you may take a performance hit by explicitly making a copy.
How to ignore warnings: Alternatively, in some cases you might not care whether a view or copy is returned, because your intention is to permanently modify the data and never go back to the original data. In this case, you can suppress the warning and go merrily on your way (just don't forget that you've turned it off, and that the original data may or may not be modified by your code, because df2 may or may not be a copy):
pd.options.mode.chained_assignment = None # default='warn'
For more information, see the answers at How to deal with SettingWithCopyWarning in Pandas?
How to make a view: Pandas will implicitly make views wherever and whenever possible. The key to this is to use the df.loc[row_indexer,col_indexer]
method. For example, to multiply the values of column y
by 100 for only the rows where column x
is not null, we would write:
mask = ~df['x'].isnull()
df.loc[mask, 'y'] *= 100
# original data has changed
print(df)
x y z
0 1.0 200 3
1 4.0 500 6
2 NaN 10 20
3 7.0 800 9
Related Topics
Set Matplotlib Colorbar Size to Match Graph
Writing Unicode Text to a Text File
In Python, How to Convert All of the Items in a List to Floats
Append Multiple Values for One Key in a Dictionary
Why Don't These List Operations Return the Resulting List
Getting the Index of the Returned Max or Min Item Using Max()/Min() on a List
Elegant Python Code for Integer Partitioning
How to Un-Escape a Backslash-Escaped String
How to Create a Custom String Representation for a Class Object
How to Open a File Using the Open with Statement
How to Time a Code Segment for Testing Performance with Pythons Timeit
Convert Pandas Timezone-Aware Datetimeindex to Naive Timestamp, But in Certain Timezone
Sort() and Reverse() Functions Do Not Work
How to Extract Multiple JSON Objects from One File
How to Convert Surrogate Pairs to Normal String in Python