How to Tell If Numpy Creates a View or a Copy

How can I tell if NumPy creates a view or a copy?

This question is very similar to a question that I asked a while back:

You can check the base attribute.

a = np.arange(50)
b = a.reshape((5, 10))
print (b.base is a)

However, that's not perfect. You can also check to see if they share memory using np.may_share_memory.

print (np.may_share_memory(a, b))

There's also the flags attribute that you can check:

print (b.flags['OWNDATA'])  #False -- apparently this is a view
e = np.ravel(b[:, 2])
print (e.flags['OWNDATA']) #True -- Apparently this is a new numpy object.

But this last one seems a little fishy to me, although I can't quite put my finger on why...

View of a view of a numpy array is a copy?

Selection by basic slicing always returns a view. Selection by advanced
indexing always returns a copy. Selection by boolean mask is a form of advanced
indexing. (The other form of advanced indexing is selection by integer array.)

However, assignment by advanced indexing affects the original array.

So

mask = np.array([True, False, False])
arr[mask] = 0

affects arr because it is an assignment. In contrast,

mask_1_arr = arr[mask_1]

is selection by boolean mask, so mask_1_arr is a copy of part of arr.
Once you have a copy, the jig is up. When Python executes

mask_2 = np.array([True])
mask_1_arr[mask_2] = 0

the assignment affects mask_1_arr, but since mask_1_arr is a copy,
it has no effect on arr.


|            | basic slicing    | advanced indexing |
|------------+------------------+-------------------|
| selection | view | copy |
| assignment | affects original | affects original |

Under the hood, arr[mask] = something causes Python to call
arr.__setitem__(mask, something). The ndarray.__setitem__ method is
implemented to modify arr. After all, that is the natural thing one should expect
__setitem__ to do.

In contrast, as an expression arr[indexer] causes Python to call
arr.__getitem__(indexer). When indexer is a slice, the regularity of the
elements allows NumPy to return a view (by modifying the strides and offset). When indexer
is an arbitrary boolean mask or arbitrary array of integers, there is in general
no regularity to the elements selected, so there is no way to return a
view. Hence a copy must be returned.

Which numpy index is copy and which is view?

It's true that in order to get a good grasp of what returns a view and what returns a copy, you need to be thorough with the documentation (which sometimes doesn't really mention it as well). I will not be able to provide you a complete set of operations and their output types (view or copy) however, maybe this could help you on your quest.

You can use np.shares_memory() to check whether a function returns a view or a copy of the original array.

x = np.array([1, 2, 3, 4])
x1 = x
x2 = np.sqrt(x)
x3 = x[1:2]
x4 = x[1::2]
x5 = x.reshape(-1,2)
x6 = x[:,None]
x7 = x[None,:]
x8 = x6+x7
x9 = x5[1,0:2]
x10 = x5[[0,1],0:2]

print(np.shares_memory(x, x1))
print(np.shares_memory(x, x2))
print(np.shares_memory(x, x3))
print(np.shares_memory(x, x4))
print(np.shares_memory(x, x5))
print(np.shares_memory(x, x6))
print(np.shares_memory(x, x7))
print(np.shares_memory(x, x8))
print(np.shares_memory(x, x9))
print(np.shares_memory(x, x10))
True
False
True
True
True
True
True
False
True
False

Notice the last 2 advance+basic indexing examples. One is a view while other is a copy. The explaination of this difference as mentioned in the documentation (also provides insight on how these are implemented) is -

When there is at least one slice (:), ellipsis (...) or newaxis in the index (or the array has more dimensions than there are advanced indexes), then the behaviour can be more complicated. It is like concatenating the indexing result for each advanced index element

Is there a way to check if NumPy arrays share the same data?

I think jterrace's answer is probably the best way to go, but here is another possibility.

def byte_offset(a):
"""Returns a 1-d array of the byte offset of every element in `a`.
Note that these will not in general be in order."""
stride_offset = np.ix_(*map(range,a.shape))
element_offset = sum(i*s for i, s in zip(stride_offset,a.strides))
element_offset = np.asarray(element_offset).ravel()
return np.concatenate([element_offset + x for x in range(a.itemsize)])

def share_memory(a, b):
"""Returns the number of shared bytes between arrays `a` and `b`."""
a_low, a_high = np.byte_bounds(a)
b_low, b_high = np.byte_bounds(b)

beg, end = max(a_low,b_low), min(a_high,b_high)

if end - beg > 0:
# memory overlaps
amem = a_low + byte_offset(a)
bmem = b_low + byte_offset(b)

return np.intersect1d(amem,bmem).size
else:
return 0

Example:

>>> a = np.arange(10)
>>> b = a.reshape((5,2))
>>> c = a[::2]
>>> d = a[1::2]
>>> e = a[0:1]
>>> f = a[0:1]
>>> f = f.reshape(())
>>> share_memory(a,b)
80
>>> share_memory(a,c)
40
>>> share_memory(a,d)
40
>>> share_memory(c,d)
0
>>> share_memory(a,e)
8
>>> share_memory(a,f)
8

Here is a plot showing the time for each share_memory(a,a[::2]) call as a function of the number of elements in a on my computer.

share_memory function

What's the difference between a view and a shallow copy of a numpy array?

Unlike a Python list object which contains references to the first level of element objects (which in turn may reference deeper levels of objects), a NumPy array references only a single data buffer which stores all the element values for all the dimensions of the array, and there is no hierarchy of element objects beyond this data buffer.

A shallow copy of a list would contain copies of the first level of element references, and share the referenced element objects with the original list. It is less obvious what a shallow copy of a NumPy array should contain. Should it (A) share the data buffer with the original, or (B) have its own copy (which effectively makes it a deep copy)?

A view of a NumPy array is a shallow copy in sense A, i.e. it references the same data buffer as the original, so changes to the original data affect the view data and vice versa.

The library function copy.copy() is supposed to create a shallow copy of its argument, but when applied to a NumPy array it creates a shallow copy in sense B, i.e. the new array gets its own copy of the data buffer, so changes to one array do not affect the other.

Here's some code showing different ways to copy/view NumPy arrays:

import numpy as np
import copy

x = np.array([10, 11, 12, 13])

# Create views of x (shallow copies sharing data) in 2 different ways
x_view1 = x.view()
x_view2 = x[:] # Creates a view using a slice

# Create full copies of x (not sharing data) in 2 different ways
x_copy1 = x.copy()
x_copy2 = copy.copy(x) # Calls x.__copy__() which creates a full copy of x

# Change some array elements to see what happens
x[0] = 555 # Affects x, x_view1, and x_view2
x_view1[1] = 666 # Affects x, x_view1, and x_view2
x_view2[2] = 777 # Affects x, x_view1, and x_view2
x_copy1[0] = 888 # Affects only x_copy1
x_copy2[0] = 999 # Affects only x_copy2

print(x) # [555 666 777 13]
print(x_view1) # [555 666 777 13]
print(x_view2) # [555 666 777 13]
print(x_copy1) # [888 11 12 13]
print(x_copy2) # [999 11 12 13]

The above example creates views of the entire original array index range and with the same array attributes as the original, which is not very interesting (could be replaced with a simple alias, e.g. x_alias = x). What makes views powerful is that they can be views of chosen parts of the original, and have different attributes. This is demonstrated in the next few lines of code which extend the above example:

x_view3 = x[::2].reshape(2,1) # Creates a reshaped view of every 2nd element of x
print(x_view3) # [[555]
# [777]]
x_view3[1] = 333 # Affects 2nd element of x_view3 and 3rd element of x
print(x) # [555 666 333 13]
print(x_view3) # [[555]
# [333]]

How does a numpy view know where the values it's referencing are in the original numpy array?

A NumPy array knows its base address, data type, shape, and strides. Most applications don't need to explicitly deal with the strides, but they are what make some of this work. The strides indicate how many bytes must be added to increment a given dimension by one logical unit (e.g. row).

If you start with a 3x3 array of float64 (aka f8) at address 0x1000, and you want a view of the 2x2 subarray which starts in the center of the original, all you need is to increment the base address by 4 elements (3 for the entire first row, 1 to move from the left to the center of the middle row) and remember that the each row starts 24 bytes after the previous one (despite being only 16 bytes long).

Conceptually we go from this:

base=0x1000
shape=(3,3)
strides=(24,8)
dtype='f8'

To this:

base=0x1020 (added 1*24 + 1*8 for [1:,1:] view)
shape=(2,2)
strides=(24,8)
dtype='f8'

And the view takes these elements:

. . .
. 4 5
. 7 8

Some flags are adjusted on the view, such as the C_CONTIGUOUS flag which needs to be unset because the view is not a contiguous region anymore.

Strides not only support views of NumPy arrays, but also views of data structures which did not originate in NumPy. For example if you have a C array of structs and the first member of each is a point (x,y), you can construct a view of only these points by setting the stride to the size of the entire struct, despite the dtype being just the two numbers.

Numpy: views vs copy by slicing

All that matters is whether you slice by rows or by columns. Slicing by rows can return a view because it is a contiguous segment of the original array. Slicing by column must return a copy because it is not a contiguous segment. For example:

A1 A2 A3
B1 B2 B3
C1 C2 C3

By default, it is stored in memory this way:

A1 A2 A3 B1 B2 B3 C1 C2 C3

So if you want to choose every second row, it is:

[A1 A2 A3] B1 B2 B3 [C1 C2 C3]

That can be described as {start: 0, size: 3, stride: 6}.

But if you want to choose every second column:

[A1] A2 [A3 B1] B2 [B3 C1] C2 [C3]

And there is no way to describe that using a single start, size, and stride. So there is no way to construct such a view.

If you want to be able to view every second column instead of every second row, you can construct your array in column-major aka Fortran order instead:

np.array(a, order='F')

Then it will be stored as such:

A1 B1 C1 A2 B2 C2 A3 B3 C3

How can I verify when a copy is made in Python?

You can use np.ndarray.flags:

>>> a = np.arange(5)
>>> a.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

For example, you can set an array to not be writeable, by using np.setflags; In that case an attempt to modify the array will fail:

>>> a.setflags(write=False)  # sets the WRITEABLE flag to False
>>> a[2] = 10 # the modification will fail
ValueError: assignment destination is read-only

Another useful flag is the OWNDATA, which for example can indicate that the array is in fact a view on another array, so does not own its data:

>>> a = np.arange(5)
>>> b = a[::2]
>>> a.flags['OWNDATA']
True
>>> b.flags['OWNDATA']
False

Checking whether data frame is copy or view in Pandas

Answers from HYRY and Marius in comments!

One can check either by:

  • testing equivalence of the values.base attribute rather than the values attribute, as in:

    df.values.base is df2.values.base instead of df.values is df2.values.

  • or using the (admittedly internal) _is_view attribute (df2._is_view is True).

Thanks everyone!



Related Topics



Leave a reply



Submit