How can I tell if NumPy creates a view or a copy?
This question is very similar to a question that I asked a while back:
You can check the base
attribute.
a = np.arange(50)
b = a.reshape((5, 10))
print (b.base is a)
However, that's not perfect. You can also check to see if they share memory using np.may_share_memory
.
print (np.may_share_memory(a, b))
There's also the flags attribute that you can check:
print (b.flags['OWNDATA']) #False -- apparently this is a view
e = np.ravel(b[:, 2])
print (e.flags['OWNDATA']) #True -- Apparently this is a new numpy object.
But this last one seems a little fishy to me, although I can't quite put my finger on why...
View of a view of a numpy array is a copy?
Selection by basic slicing always returns a view. Selection by advanced
indexing always returns a copy. Selection by boolean mask is a form of advanced
indexing. (The other form of advanced indexing is selection by integer array.)
However, assignment by advanced indexing affects the original array.
So
mask = np.array([True, False, False])
arr[mask] = 0
affects arr
because it is an assignment. In contrast,
mask_1_arr = arr[mask_1]
is selection by boolean mask, so mask_1_arr
is a copy of part of arr
.
Once you have a copy, the jig is up. When Python executes
mask_2 = np.array([True])
mask_1_arr[mask_2] = 0
the assignment affects mask_1_arr
, but since mask_1_arr
is a copy,
it has no effect on arr
.
| | basic slicing | advanced indexing |
|------------+------------------+-------------------|
| selection | view | copy |
| assignment | affects original | affects original |
Under the hood, arr[mask] = something
causes Python to callarr.__setitem__(mask, something)
. The ndarray.__setitem__
method is
implemented to modify arr
. After all, that is the natural thing one should expect__setitem__
to do.
In contrast, as an expression arr[indexer]
causes Python to callarr.__getitem__(indexer)
. When indexer
is a slice, the regularity of the
elements allows NumPy to return a view (by modifying the strides and offset). When indexer
is an arbitrary boolean mask or arbitrary array of integers, there is in general
no regularity to the elements selected, so there is no way to return a
view. Hence a copy must be returned.
Which numpy index is copy and which is view?
It's true that in order to get a good grasp of what returns a view and what returns a copy, you need to be thorough with the documentation (which sometimes doesn't really mention it as well). I will not be able to provide you a complete set of operations and their output types (view or copy) however, maybe this could help you on your quest.
You can use np.shares_memory()
to check whether a function returns a view or a copy of the original array.
x = np.array([1, 2, 3, 4])
x1 = x
x2 = np.sqrt(x)
x3 = x[1:2]
x4 = x[1::2]
x5 = x.reshape(-1,2)
x6 = x[:,None]
x7 = x[None,:]
x8 = x6+x7
x9 = x5[1,0:2]
x10 = x5[[0,1],0:2]
print(np.shares_memory(x, x1))
print(np.shares_memory(x, x2))
print(np.shares_memory(x, x3))
print(np.shares_memory(x, x4))
print(np.shares_memory(x, x5))
print(np.shares_memory(x, x6))
print(np.shares_memory(x, x7))
print(np.shares_memory(x, x8))
print(np.shares_memory(x, x9))
print(np.shares_memory(x, x10))
True
False
True
True
True
True
True
False
True
False
Notice the last 2 advance+basic indexing examples. One is a view while other is a copy. The explaination of this difference as mentioned in the documentation (also provides insight on how these are implemented) is -
When there is at least one
slice (:)
,ellipsis (...)
ornewaxis
in the index (or the array has more dimensions than there are advanced indexes), then the behaviour can be more complicated. It is like concatenating the indexing result for each advanced index element
Is there a way to check if NumPy arrays share the same data?
I think jterrace's answer is probably the best way to go, but here is another possibility.
def byte_offset(a):
"""Returns a 1-d array of the byte offset of every element in `a`.
Note that these will not in general be in order."""
stride_offset = np.ix_(*map(range,a.shape))
element_offset = sum(i*s for i, s in zip(stride_offset,a.strides))
element_offset = np.asarray(element_offset).ravel()
return np.concatenate([element_offset + x for x in range(a.itemsize)])
def share_memory(a, b):
"""Returns the number of shared bytes between arrays `a` and `b`."""
a_low, a_high = np.byte_bounds(a)
b_low, b_high = np.byte_bounds(b)
beg, end = max(a_low,b_low), min(a_high,b_high)
if end - beg > 0:
# memory overlaps
amem = a_low + byte_offset(a)
bmem = b_low + byte_offset(b)
return np.intersect1d(amem,bmem).size
else:
return 0
Example:
>>> a = np.arange(10)
>>> b = a.reshape((5,2))
>>> c = a[::2]
>>> d = a[1::2]
>>> e = a[0:1]
>>> f = a[0:1]
>>> f = f.reshape(())
>>> share_memory(a,b)
80
>>> share_memory(a,c)
40
>>> share_memory(a,d)
40
>>> share_memory(c,d)
0
>>> share_memory(a,e)
8
>>> share_memory(a,f)
8
Here is a plot showing the time for each share_memory(a,a[::2])
call as a function of the number of elements in a
on my computer.
What's the difference between a view and a shallow copy of a numpy array?
Unlike a Python list object which contains references to the first level of element objects (which in turn may reference deeper levels of objects), a NumPy array references only a single data buffer which stores all the element values for all the dimensions of the array, and there is no hierarchy of element objects beyond this data buffer.
A shallow copy of a list would contain copies of the first level of element references, and share the referenced element objects with the original list. It is less obvious what a shallow copy of a NumPy array should contain. Should it (A) share the data buffer with the original, or (B) have its own copy (which effectively makes it a deep copy)?
A view of a NumPy array is a shallow copy in sense A, i.e. it references the same data buffer as the original, so changes to the original data affect the view data and vice versa.
The library function copy.copy() is supposed to create a shallow copy of its argument, but when applied to a NumPy array it creates a shallow copy in sense B, i.e. the new array gets its own copy of the data buffer, so changes to one array do not affect the other.
Here's some code showing different ways to copy/view NumPy arrays:
import numpy as np
import copy
x = np.array([10, 11, 12, 13])
# Create views of x (shallow copies sharing data) in 2 different ways
x_view1 = x.view()
x_view2 = x[:] # Creates a view using a slice
# Create full copies of x (not sharing data) in 2 different ways
x_copy1 = x.copy()
x_copy2 = copy.copy(x) # Calls x.__copy__() which creates a full copy of x
# Change some array elements to see what happens
x[0] = 555 # Affects x, x_view1, and x_view2
x_view1[1] = 666 # Affects x, x_view1, and x_view2
x_view2[2] = 777 # Affects x, x_view1, and x_view2
x_copy1[0] = 888 # Affects only x_copy1
x_copy2[0] = 999 # Affects only x_copy2
print(x) # [555 666 777 13]
print(x_view1) # [555 666 777 13]
print(x_view2) # [555 666 777 13]
print(x_copy1) # [888 11 12 13]
print(x_copy2) # [999 11 12 13]
The above example creates views of the entire original array index range and with the same array attributes as the original, which is not very interesting (could be replaced with a simple alias, e.g. x_alias = x). What makes views powerful is that they can be views of chosen parts of the original, and have different attributes. This is demonstrated in the next few lines of code which extend the above example:
x_view3 = x[::2].reshape(2,1) # Creates a reshaped view of every 2nd element of x
print(x_view3) # [[555]
# [777]]
x_view3[1] = 333 # Affects 2nd element of x_view3 and 3rd element of x
print(x) # [555 666 333 13]
print(x_view3) # [[555]
# [333]]
How does a numpy view know where the values it's referencing are in the original numpy array?
A NumPy array knows its base address, data type, shape, and strides. Most applications don't need to explicitly deal with the strides, but they are what make some of this work. The strides indicate how many bytes must be added to increment a given dimension by one logical unit (e.g. row).
If you start with a 3x3 array of float64 (aka f8
) at address 0x1000, and you want a view of the 2x2 subarray which starts in the center of the original, all you need is to increment the base address by 4 elements (3 for the entire first row, 1 to move from the left to the center of the middle row) and remember that the each row starts 24 bytes after the previous one (despite being only 16 bytes long).
Conceptually we go from this:
base=0x1000
shape=(3,3)
strides=(24,8)
dtype='f8'
To this:
base=0x1020 (added 1*24 + 1*8 for [1:,1:] view)
shape=(2,2)
strides=(24,8)
dtype='f8'
And the view takes these elements:
. . .
. 4 5
. 7 8
Some flags are adjusted on the view, such as the C_CONTIGUOUS flag which needs to be unset because the view is not a contiguous region anymore.
Strides not only support views of NumPy arrays, but also views of data structures which did not originate in NumPy. For example if you have a C array of structs and the first member of each is a point (x,y)
, you can construct a view of only these points by setting the stride to the size of the entire struct, despite the dtype being just the two numbers.
Numpy: views vs copy by slicing
All that matters is whether you slice by rows or by columns. Slicing by rows can return a view because it is a contiguous segment of the original array. Slicing by column must return a copy because it is not a contiguous segment. For example:
A1 A2 A3
B1 B2 B3
C1 C2 C3
By default, it is stored in memory this way:
A1 A2 A3 B1 B2 B3 C1 C2 C3
So if you want to choose every second row, it is:
[A1 A2 A3] B1 B2 B3 [C1 C2 C3]
That can be described as {start: 0, size: 3, stride: 6}
.
But if you want to choose every second column:
[A1] A2 [A3 B1] B2 [B3 C1] C2 [C3]
And there is no way to describe that using a single start, size, and stride. So there is no way to construct such a view.
If you want to be able to view every second column instead of every second row, you can construct your array in column-major aka Fortran order instead:
np.array(a, order='F')
Then it will be stored as such:
A1 B1 C1 A2 B2 C2 A3 B3 C3
How can I verify when a copy is made in Python?
You can use np.ndarray.flags
:
>>> a = np.arange(5)
>>> a.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
For example, you can set an array to not be writeable, by using np.setflags
; In that case an attempt to modify the array will fail:
>>> a.setflags(write=False) # sets the WRITEABLE flag to False
>>> a[2] = 10 # the modification will fail
ValueError: assignment destination is read-only
Another useful flag is the OWNDATA
, which for example can indicate that the array is in fact a view on another array, so does not own its data:
>>> a = np.arange(5)
>>> b = a[::2]
>>> a.flags['OWNDATA']
True
>>> b.flags['OWNDATA']
False
Checking whether data frame is copy or view in Pandas
Answers from HYRY and Marius in comments!
One can check either by:
testing equivalence of the
values.base
attribute rather than thevalues
attribute, as in:df.values.base is df2.values.base
instead ofdf.values is df2.values
.or using the (admittedly internal)
_is_view
attribute (df2._is_view
isTrue
).
Thanks everyone!
Related Topics
What Does the Term "Broadcasting" Mean in Pandas Documentation
Asyncio.Sleep() VS Time.Sleep()
Nonlocal Keyword in Python 2.X
Cast Base Class to Derived Class Python (Or More Pythonic Way of Extending Classes)
How to Increment a Shared Counter from Multiple Processes
Websocket VS Rest API for Real Time Data
Looping Over All Member Variables of a Class in Python
Split a List into Parts Based on a Set of Indexes in Python
Numpy: Find First Index of Value Fast
Check What Files Are Open in Python
Importing from a Relative Path in Python
Getting Only Element from a Single-Element List in Python
Check If a Process Is Running or Not on Windows
How to Create an Encrypted Zip File
Efficient Numpy 2D Array Construction from 1D Array