Pandas: Subindexing Dataframes: Copies VS Views

Pandas: Subindexing dataframes: Copies vs views

Your answer lies in the pandas docs: returning-a-view-versus-a-copy.

Whenever an array of labels or a boolean vector are involved
in the indexing operation, the result will be a copy.
With single label / scalar indexing and slicing,
e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.

In your example, bar is a view of slices of foo. If you wanted a copy, you could have used the copy method. Modifying bar also modifies foo. pandas does not appear to have a copy-on-write mechanism.

See my code example below to illustrate:

In [1]: import pandas as pd
...: import numpy as np
...: foo = pd.DataFrame(np.random.random((10,5)))
...:

In [2]: pd.__version__
Out[2]: '0.12.0.dev-35312e4'

In [3]: np.__version__
Out[3]: '1.7.1'

In [4]: # DataFrame has copy method
...: foo_copy = foo.copy()

In [5]: bar = foo.iloc[3:5,1:4]

In [6]: bar == foo.iloc[3:5,1:4] == foo_copy.iloc[3:5,1:4]
Out[6]:
1 2 3
3 True True True
4 True True True

In [7]: # Changing the view
...: bar.ix[3,1] = 5

In [8]: # View and DataFrame still equal
...: bar == foo.iloc[3:5,1:4]
Out[8]:
1 2 3
3 True True True
4 True True True

In [9]: # It is now different from a copy of original
...: bar == foo_copy.iloc[3:5,1:4]
Out[9]:
1 2 3
3 False True True
4 True True True

What rules does Pandas use to generate a view vs a copy?

Here's the rules, subsequent override:

  • All operations generate a copy

  • If inplace=True is provided, it will modify in-place; only some operations support this

  • An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.

  • An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)

  • An indexer that gets on a multiple-dtyped object is always a copy.

Your example of chained indexing

df[df.C <= df.B].loc[:,'B':'E']

is not guaranteed to work (and thus you shoulld never do this).

Instead do:

df.loc[df.C <= df.B, 'B':'E']

as this is faster and will always work

The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.

Checking whether data frame is copy or view in Pandas

Answers from HYRY and Marius in comments!

One can check either by:

  • testing equivalence of the values.base attribute rather than the values attribute, as in:

    df.values.base is df2.values.base instead of df.values is df2.values.

  • or using the (admittedly internal) _is_view attribute (df2._is_view is True).

Thanks everyone!

Pandas: What is a view?

To understand what a View is, you have to know what an arrays is. An array is not only the "stuff" (items) you put in it. It needs (besides others) also information about the number of elements, the shape of your array and how to interpret the elements.

So an array would be an object at least containing these attributes:

class Series:
data # A pointer to where your array is stored
size # The number of items in your array
shape # The shape of your array
dtype # How to interpret the array

So when you create a view a new array object is created but (and that's important) the View's data pointer points to the original array. It could be offset but it still points to one memory location that belongs to the original array. But even though it shares some data with the original the size, shape, dtype (, ...) might have changed so it requires a new object. That's why they have different ids.

Think of it like windows. You have a garden (the array) and you have several windows, each window is a different object but all of them look out at the same (your) garden. Ok, granted, with some slicing operations you would have more escher-like windows but a metaphor always lacks some details :-)

How to create a view of dataframe in pandas?

You generally can't return a view.

Your answer lies in the pandas docs:
returning-a-view-versus-a-copy.

Whenever an array of labels or a boolean vector are involved in the
indexing operation, the result will be a copy. With single label /
scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view
will be returned.

This answer was found in the following post: Link.



Related Topics



Leave a reply



Submit