Python/Numpy First Occurrence of Subarray

Python/NumPy first occurrence of subarray

I'm assuming you're looking for a numpy-specific solution, rather than a simple list comprehension or for loop. One straightforward approach is to use the rolling window technique to search for windows of the appropriate size.

This approach is simple, works correctly, and is much faster than any pure Python solution. It should be sufficient for many use cases. However, it is not the most efficient approach possible, for a number of reasons. For an approach that is more complicated, but asymptotically optimal in the expected case, see the numba-based rolling hash implementation in norok2's answer.

Here's the rolling_window function:

>>> def rolling_window(a, size):
... shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
... strides = a.strides + (a. strides[-1],)
... return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
...

Then you could do something like

>>> a = numpy.arange(10)
>>> numpy.random.shuffle(a)
>>> a
array([7, 3, 6, 8, 4, 0, 9, 2, 1, 5])
>>> rolling_window(a, 3) == [8, 4, 0]
array([[False, False, False],
[False, False, False],
[False, False, False],
[ True, True, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)

To make this really useful, you'd have to reduce it along axis 1 using all:

>>> numpy.all(rolling_window(a, 3) == [8, 4, 0], axis=1)
array([False, False, False, True, False, False, False, False], dtype=bool)

Then you could use that however you'd use a boolean array. A simple way to get the index out:

>>> bool_indices = numpy.all(rolling_window(a, 3) == [8, 4, 0], axis=1)
>>> numpy.mgrid[0:len(bool_indices)][bool_indices]
array([3])

For lists you could adapt one of these rolling window iterators to use a similar approach.

For very large arrays and subarrays, you could save memory like this:

>>> windows = rolling_window(a, 3)
>>> sub = [8, 4, 0]
>>> hits = numpy.ones((len(a) - len(sub) + 1,), dtype=bool)
>>> for i, x in enumerate(sub):
... hits &= numpy.in1d(windows[:,i], [x])
...
>>> hits
array([False, False, False, True, False, False, False, False], dtype=bool)
>>> hits.nonzero()
(array([3]),)

On the other hand, this will probably be somewhat slower.

Is there a NumPy function to return the first index of something in an array?

Yes, given an array, array, and a value, item to search for, you can use np.where as:

itemindex = numpy.where(array == item)

The result is a tuple with first all the row indices, then all the column indices.

For example, if an array is two dimensions and it contained your item at two locations then

array[itemindex[0][0]][itemindex[1][0]]

would be equal to your item and so would be:

array[itemindex[0][1]][itemindex[1][1]]

Python/NumPy first occurrence of masked subarray

Use np.ma.equal instead of == - see end.

========================

A masked array consists of a data array and a mask array. Often the masked array is used in other operations by 'filling' the masked values with something innocuous, or by compressing them out. I'm not entirely sure what's going on with this == test, but let's look at the calculations.

Your striding produces an array:

In [614]: A
Out[614]:
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])

In [615]: b
Out[615]: array([2, 3, 4])

In [612]: A==b
Out[612]:
array([[False, False, False],
[ True, True, True],
[False, False, False]], dtype=bool)

The masked array has data and mask

In [616]: c
Out[616]:
masked_array(data = [2 -- 4],
mask = [False True False],
fill_value = 999999)
In [617]: c.data
Out[617]: array([ 2, 99, 4])
In [618]: c.mask
Out[618]: array([False, True, False], dtype=bool)
In [619]: (A==c).data
Out[619]:
array([[False, False, False],
[ True, False, True],
[False, False, False]], dtype=bool)

This data is we'd expect from A==c.data. The center 99 does not match.

But it looks like the mask is applied to the whole boolean array as though c where a column array - it's masking the 2nd row, rather than the 2nd column.

In [624]: A==c
Out[624]:
masked_array(data =
[[False False False]
[-- -- --]
[False False False]],
mask =
[False True False],
fill_value = True)

My first impression is that that is an error. But I'll have to dig more.

The data of A==c is 2d, but the mask is 1d.

If I replicated c to 3 rows, then I get the desired results:

In [638]: c[None,:]*np.array([1,1,1])[:,None]
Out[638]:
masked_array(data =
[[2 -- 4]
[2 -- 4]
[2 -- 4]],
mask =
[[False True False]
[False True False]
[False True False]],
fill_value = 999999)
In [639]: c1=c[None,:]*np.array([1,1,1])[:,None]
In [640]: A==c1
Out[640]:
masked_array(data =
[[False -- False]
[True -- True]
[False -- False]],
mask =
[[False True False]
[False True False]
[False True False]],
fill_value = True)
In [641]: (A==c1).all(axis=1)
Out[641]:
masked_array(data = [False True False],
mask = [False False False],
fill_value = True)

I don't know if there's a cleaner way of doing this, but it indicates the direction such as solution needs to take.

============

np.ma.equal does what we want (== comparison with correct mask)

In [645]: np.ma.equal(A,c)
Out[645]:
masked_array(data =
[[False -- False]
[True -- True]
[False -- False]],
mask =
[[False True False]
[False True False]
[False True False]],
fill_value = 999999)
In [646]: np.ma.equal(A,c).any(axis=1)
Out[646]:
masked_array(data = [False True False],
mask = [False False False],
fill_value = True)

np.ma.equal is a masked-aware version of np.equal, which a ufunc version of the element by element == operator.

Optimal way to find index of first occurrence of subarray in each frame of batch data without for loop

There is no simple numpy solution for this. However what you can do if you really need it to be fast is the following using numba:

The function find_first does basically what you would do with the for loop. But since you are using numba, the method is compiled, thus much faster.
Then you just apply the method to each batch using np.apply_along_axis:

import numpy as np
from numba import jit

@jit(nopython=True)
def find_first(seq, arr):
"""return the index of the first occurence of item in arr"""
for i in range(len(arr)-2):
if np.all(seq == arr[i:i+3]):
return i
return -1

# construct test array
test = np.round(np.random.random((64,400)))

# this will give you the array of indices
np.apply_along_axis(lambda m: find_first(np.array([1,1,1]), m), axis=1, arr = test)

I modified the method from this answer

Deleting numpy subarray based off of first element in subarray

A possible solution using mode:

>>> from scipy.stats import mode
>>> eps = 5
>>> most_freq = mode(circles[:, 0])[0][0]
>>> mask = np.abs(circles[:, 0] - most_freq) <= eps
>>> circles[mask]
array([[288, 300, 25],
[288, 362, 25],
[288, 238, 24]])

Edit:
if your circles array is limited to to non-negative integers, you can use the following expression for most_freq:

most_freq = np.bincount(circles[:, 0]).argmax()

Efficiently count the number of occurrences of unique subarrays in NumPy?

The question states that the input array is of shape (128, 36, 8) and we are interested in finding unique subarrays of length 8 in the last dimension.
So, I am assuming that the uniqueness is along the first two dimensions being merged together. Let us assume A as the input 3D array.

Get the number of unique subarrays

# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])

# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar = Ar[sorted_idx,:]

# Get the count of rows that have at least one TRUE value
# indicating presence of unique subarray there
unq_out = np.any(np.diff(sorted_Ar,axis=0),1).sum()+1

Sample run -

In [159]: A # A is (2,2,3)
Out[159]:
array([[[0, 0, 0],
[0, 0, 2]],

[[0, 0, 2],
[2, 0, 1]]])

In [160]: unq_out
Out[160]: 3

Get the count of occurrences of unique subarrays

# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])

# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar = Ar[sorted_idx,:]

# Get IDs for each element based on their uniqueness
id = np.append([0],np.any(np.diff(sorted_Ar,axis=0),1).cumsum())

# Get counts for each ID as the final output
unq_count = np.bincount(id)

Sample run -

In [64]: A
Out[64]:
array([[[0, 0, 2],
[1, 1, 1]],

[[1, 1, 1],
[1, 2, 0]]])

In [65]: unq_count
Out[65]: array([1, 2, 1], dtype=int64)

Python/NumPy first occurrence of subarray

I'm assuming you're looking for a numpy-specific solution, rather than a simple list comprehension or for loop. One straightforward approach is to use the rolling window technique to search for windows of the appropriate size.

This approach is simple, works correctly, and is much faster than any pure Python solution. It should be sufficient for many use cases. However, it is not the most efficient approach possible, for a number of reasons. For an approach that is more complicated, but asymptotically optimal in the expected case, see the numba-based rolling hash implementation in norok2's answer.

Here's the rolling_window function:

>>> def rolling_window(a, size):
... shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
... strides = a.strides + (a. strides[-1],)
... return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
...

Then you could do something like

>>> a = numpy.arange(10)
>>> numpy.random.shuffle(a)
>>> a
array([7, 3, 6, 8, 4, 0, 9, 2, 1, 5])
>>> rolling_window(a, 3) == [8, 4, 0]
array([[False, False, False],
[False, False, False],
[False, False, False],
[ True, True, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)

To make this really useful, you'd have to reduce it along axis 1 using all:

>>> numpy.all(rolling_window(a, 3) == [8, 4, 0], axis=1)
array([False, False, False, True, False, False, False, False], dtype=bool)

Then you could use that however you'd use a boolean array. A simple way to get the index out:

>>> bool_indices = numpy.all(rolling_window(a, 3) == [8, 4, 0], axis=1)
>>> numpy.mgrid[0:len(bool_indices)][bool_indices]
array([3])

For lists you could adapt one of these rolling window iterators to use a similar approach.

For very large arrays and subarrays, you could save memory like this:

>>> windows = rolling_window(a, 3)
>>> sub = [8, 4, 0]
>>> hits = numpy.ones((len(a) - len(sub) + 1,), dtype=bool)
>>> for i, x in enumerate(sub):
... hits &= numpy.in1d(windows[:,i], [x])
...
>>> hits
array([False, False, False, True, False, False, False, False], dtype=bool)
>>> hits.nonzero()
(array([3]),)

On the other hand, this will probably be somewhat slower.

Subarray' as condition to numpy.where()

np.cumsum only provides a solution specific to this problem; I'll try addressing a more general solution for any kind of pattern. You could try thinking of this as a sort of string matching problem. You have a big string (say, your array of 1s and 0s) and a particular noise you want to find, which is 11111. Moreover, you want to find that index where the pattern first appears. This can be done easily in a single line of code, in several ways.

import re

def find_idx_of_first_noise(A, N):
return re.search(''.join(N.astype(str)),''.join(A.astype(str))).start()

A = np.array([0,1,0,0,1,0,0,1,1,1,1,1,1,1,1,1])
N = np.array([1,1,1,1,1])

print(find_idx_of_first_noise(A, noise))

Out:

7

A and N are numpy arrays of integers, so I'm converting them to arrays of strings with .astype(str). I then join the entire arrays of strings into a single string by calling ''.join(). Thus effectively, I'm running the line: re.search('11111','0100100111111111').start(), which finds 11111 in A, and gives me the index of its first occurrence.

Another pythonic way of writing code you should get on top of are list comprehensions. I'll do the above task again in a single line of code:

print([i for i in range(len(A)-len(N)+1) if (A[i:i+len(N)]==N).all()][0])

Out:

7

Although convenient, list comprehensions are still a brute force method; it's basically a for loop inside a list.

Now the fastest and most pythonic method in my opinion, is to use tostring.

print(A.tostring().index(N.tostring())//A.itemsize)

Out:

7

Make the numpy arrays into bytes strings, then use .index to find the position of whatever pattern/noise you have. Divide by the size of the items in A, and you have your result.

Count the number of exact co-occurences of items in a numpy list

You coud do this using np.diff and np.where:

import numpy as np

mylist = [1, 5, 4, 1, 2, 4, 6, 7, 2, 1, 3, 3, 1, 2]

# Turn your list into a numpy array
myarray = np.array(mylist)

# find occurences where myarray is 2 and the following element is 2 minus 1
np.sum((myarray[:-1] == 2) & (np.diff(myarray) == -1))

Which returns 1

Timings on a large array:

On a small list, the time difference between an iterative method and numpy methods will not be noticeable. But on a large array, as in the example below, the performance of numpy is much better.

import timeit

mylist = np.random.choice(range(0,9), 1000000)

def np_method(mylist = mylist):
return np.sum((mylist[:-1] == 2) & (np.diff(mylist) == -1))

def zip_loop(a = mylist):
return len( [1 for i,j in zip(a, a[1:]) if i == 2 and j == 1] )

def for_loop(list1 = mylist):
count=0
desired_num=2
follower_num=1
for i in range(len(list1)-1):
if list1[i]==desired_num:
if list1[i+1]==follower_num:
count+=1
return count

>>> timeit.timeit(np_method, number = 100) / 100
0.006748438189970329

>>> timeit.timeit(zip_loop, number = 100) / 100
0.3811768989200209

>>> timeit.timeit(for_loop, number = 100) / 100
0.3774999916599336


Related Topics



Leave a reply



Submit