Testing Whether a Numpy Array Contains a Given Row

testing whether a Numpy array contains a given row

Numpys __contains__ is, at the time of writing this, (a == b).any() which is arguably only correct if b is a scalar (it is a bit hairy, but I believe – works like this only in 1.7. or later – this would be the right general method (a == b).all(np.arange(a.ndim - b.ndim, a.ndim)).any(), which makes sense for all combinations of a and b dimensionality)...

EDIT: Just to be clear, this is not necessarily the expected result when broadcasting is involved. Also someone might argue that it should handle the items in a separately as np.in1d does. I am not sure there is one clear way it should work.

Now you want numpy to stop when it finds the first occurrence. This AFAIK does not exist at this time. It is difficult because numpy is based mostly on ufuncs, which do the same thing over the whole array.
Numpy does optimize these kind of reductions, but effectively that only works when the array being reduced is already a boolean array (i.e. np.ones(10, dtype=bool).any()).

Otherwise it would need a special function for __contains__ which does not exist. That may seem odd, but you have to remember that numpy supports many data types and has a bigger machinery to select the correct ones and select the correct function to work on it. So in other words, the ufunc machinery cannot do it, and implementing __contains__ or such specially is not actually that trivial because of data types.

You can of course write it in python, or since you probably know your data type, writing it yourself in Cython/C is very simple.


That said. Often it is much better anyway to use sorting based approach for these things. That is a little tedious as well as there is no such thing as searchsorted for a lexsort, but it works (you could also abuse scipy.spatial.cKDTree if you like). This assumes you want to compare along the last axis only:

# Unfortunatly you need to use structured arrays:
sorted = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[-1]).ravel()

# Actually at this point, you can also use np.in1d, if you already have many b
# then that is even better.

sorted.sort()

b_comp = np.ascontiguousarray(b).view(sorted.dtype)
ind = sorted.searchsorted(b_comp)

result = sorted[ind] == b_comp

This works also for an array b, and if you keep the sorted array around, is also much better if you do it for a single value (row) in b at a time, when a stays the same (otherwise I would just np.in1d after viewing it as a recarray). Important: you must do the np.ascontiguousarray for safety. It will typically do nothing, but if it does, it would be a big potential bug otherwise.

How do you Check if each Row of a Numpy Array is Contained in a Secondary Array?

Broadcasting is an option:

import numpy as np

array = np.array([[1, 2, 4], [3, 5, 1], [5, 5, 1], [1, 2, 1]])

check_array = np.array([[1, 2, 4], [1, 2, 1]])
is_in_check = (check_array[:, None] == array).all(axis=2).any(axis=0)

Produces:

[ True False False  True]

Broadcasting the other way:

is_in_check = (check_array == array[:, None]).all(axis=2).any(axis=1)

Also Produces

[ True False False  True]

np.isin - testing whether a Numpy array contains a given row considering the order

For small b (less than 100 rows), try this instead:

a[(a[:, :3] == b[:, None]).all(axis=-1).any(axis=0)]

Example:

a = np.array([[1, 0, 5, 0],
[1, 2, 3, 1000],
[2, 1, 3, 2000],
[0, 0, 1, 1]])

b = np.array([[1, 2, 3], [0, 0, 1]])

>>> a[(a[:, :3] == b[:, None]).all(axis=-1).any(axis=0), 3]
array([1000, 1])

Explanation:

The key is to "distribute" equality tests for all rows of a (the first 3 columns) to all rows of b:

# on the example above

>>> a[:, :3] == b[:, None]
array([[[ True, False, False],
[ True, True, True], # <-- a[1,:3] matches b[0]
[False, False, True],
[False, False, False]],

[[False, True, False],
[False, False, False],
[False, False, False],
[ True, True, True]]]) # <-- a[3, :3] matches b[1]

Be warned that this can be large: the shape is (len(b), len(a), 3).

Then the first .all(axis=-1) means that we want all entire rows to match:

>>> (a[:, :3] == b[:, None]).all(axis=-1)
array([[False, True, False, False],
[False, False, False, True]])

The final bit .any(axis=0) means: "match any row in b":

>>> (a[:, :3] == b[:, None]).all(axis=-1).any(axis=0)
array([False, True, False, True])

I.e.: "a[2, :3] matches some row(s) of b as well as a[3, :3]".

Finally, use this as a mask in a and take the column 3.

Note on performance

The technique above distributes equality for the product of the rows of a over the rows of b. This can be slow and use a large amount of memory if both a and b have many rows.

As an alternative, you may use set membership in pure Python (without subsetting of columns --that can be done by the caller):

def py_rows_in(a, b):
z = set(map(tuple, b))
return [row in z for row in map(tuple, a)]

When b has more than 50~100 rows, then this may be faster, compared to the np version above, written here as a function:

def np_rows_in(a, b):
return (a == b[:, None]).all(axis=-1).any(axis=0)
import perfplot

fig, axes = plt.subplots(ncols=2, figsize=(16, 5))
plt.subplots_adjust(wspace=.5)
for ax, alen in zip(axes, [100, 10_000]):
a = np.random.randint(0, 20, (alen, 4))
plt.sca(ax)
ax.set_title(f'a: {a.shape[0]:_} rows')
perfplot.show(
setup=lambda n: np.random.randint(0, 20, (n, 3)),
kernels=[
lambda b: np_rows_in(a[:, :3], b),
lambda b: py_rows_in(a[:, :3], b),
],
labels=['np_rows_in', 'py_rows_in'],
n_range=[2 ** k for k in range(10)],
xlabel='len(b)',
)
plt.show()

comparative performance

Pythonic way to test if a row is in an array

You can just simply subtract your test row from the array. Then find out the zero elements, and sum over column wise. Then those are matches where the sum equals the number of columns.

For example:

In []: A= arange(12).reshape(4, 3)
In []: A
Out[]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
In []: 3== (0== (A- [3, 4, 5])).sum(1)
Out[]: array([False, True, False, False], dtype=bool)

Update: based on comments and other answers:

Paul's suggestion seems indeed to be able to streamline code:

In []: ~np.all(A- [3, 4, 5], 1)
Out[]: array([False, True, False, False], dtype=bool)

JoshAdel's answer emphasis more generally the problem related to determine 100% reliable manner the equality. So, obviously my answer is valid only in the situations where equality can be determined unambiguous manner.

Update 2: But as Emma figured it out, there exists corner cases where Paul's solution will not produce correct results.



Related Topics



Leave a reply



Submit