testing whether a Numpy array contains a given row
Numpys __contains__
is, at the time of writing this, (a == b).any()
which is arguably only correct if b
is a scalar (it is a bit hairy, but I believe – works like this only in 1.7. or later – this would be the right general method (a == b).all(np.arange(a.ndim - b.ndim, a.ndim)).any()
, which makes sense for all combinations of a
and b
dimensionality)...
EDIT: Just to be clear, this is not necessarily the expected result when broadcasting is involved. Also someone might argue that it should handle the items in a
separately as np.in1d
does. I am not sure there is one clear way it should work.
Now you want numpy to stop when it finds the first occurrence. This AFAIK does not exist at this time. It is difficult because numpy is based mostly on ufuncs, which do the same thing over the whole array.
Numpy does optimize these kind of reductions, but effectively that only works when the array being reduced is already a boolean array (i.e. np.ones(10, dtype=bool).any()
).
Otherwise it would need a special function for __contains__
which does not exist. That may seem odd, but you have to remember that numpy supports many data types and has a bigger machinery to select the correct ones and select the correct function to work on it. So in other words, the ufunc machinery cannot do it, and implementing __contains__
or such specially is not actually that trivial because of data types.
You can of course write it in python, or since you probably know your data type, writing it yourself in Cython/C is very simple.
That said. Often it is much better anyway to use sorting based approach for these things. That is a little tedious as well as there is no such thing as searchsorted
for a lexsort
, but it works (you could also abuse scipy.spatial.cKDTree
if you like). This assumes you want to compare along the last axis only:
# Unfortunatly you need to use structured arrays:
sorted = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[-1]).ravel()
# Actually at this point, you can also use np.in1d, if you already have many b
# then that is even better.
sorted.sort()
b_comp = np.ascontiguousarray(b).view(sorted.dtype)
ind = sorted.searchsorted(b_comp)
result = sorted[ind] == b_comp
This works also for an array b
, and if you keep the sorted array around, is also much better if you do it for a single value (row) in b
at a time, when a
stays the same (otherwise I would just np.in1d
after viewing it as a recarray). Important: you must do the np.ascontiguousarray
for safety. It will typically do nothing, but if it does, it would be a big potential bug otherwise.
How do you Check if each Row of a Numpy Array is Contained in a Secondary Array?
Broadcasting is an option:
import numpy as np
array = np.array([[1, 2, 4], [3, 5, 1], [5, 5, 1], [1, 2, 1]])
check_array = np.array([[1, 2, 4], [1, 2, 1]])
is_in_check = (check_array[:, None] == array).all(axis=2).any(axis=0)
Produces:
[ True False False True]
Broadcasting the other way:
is_in_check = (check_array == array[:, None]).all(axis=2).any(axis=1)
Also Produces
[ True False False True]
np.isin - testing whether a Numpy array contains a given row considering the order
For small b
(less than 100 rows), try this instead:
a[(a[:, :3] == b[:, None]).all(axis=-1).any(axis=0)]
Example:
a = np.array([[1, 0, 5, 0],
[1, 2, 3, 1000],
[2, 1, 3, 2000],
[0, 0, 1, 1]])
b = np.array([[1, 2, 3], [0, 0, 1]])
>>> a[(a[:, :3] == b[:, None]).all(axis=-1).any(axis=0), 3]
array([1000, 1])
Explanation:
The key is to "distribute" equality tests for all rows of a
(the first 3 columns) to all rows of b
:
# on the example above
>>> a[:, :3] == b[:, None]
array([[[ True, False, False],
[ True, True, True], # <-- a[1,:3] matches b[0]
[False, False, True],
[False, False, False]],
[[False, True, False],
[False, False, False],
[False, False, False],
[ True, True, True]]]) # <-- a[3, :3] matches b[1]
Be warned that this can be large: the shape is (len(b), len(a), 3)
.
Then the first .all(axis=-1)
means that we want all entire rows to match:
>>> (a[:, :3] == b[:, None]).all(axis=-1)
array([[False, True, False, False],
[False, False, False, True]])
The final bit .any(axis=0)
means: "match any row in b
":
>>> (a[:, :3] == b[:, None]).all(axis=-1).any(axis=0)
array([False, True, False, True])
I.e.: "a[2, :3]
matches some row(s) of b
as well as a[3, :3]
".
Finally, use this as a mask in a
and take the column 3.
Note on performance
The technique above distributes equality for the product of the rows of a over the rows of b. This can be slow and use a large amount of memory if both a
and b
have many rows.
As an alternative, you may use set
membership in pure Python (without subsetting of columns --that can be done by the caller):
def py_rows_in(a, b):
z = set(map(tuple, b))
return [row in z for row in map(tuple, a)]
When b
has more than 50~100 rows, then this may be faster, compared to the np
version above, written here as a function:
def np_rows_in(a, b):
return (a == b[:, None]).all(axis=-1).any(axis=0)
import perfplot
fig, axes = plt.subplots(ncols=2, figsize=(16, 5))
plt.subplots_adjust(wspace=.5)
for ax, alen in zip(axes, [100, 10_000]):
a = np.random.randint(0, 20, (alen, 4))
plt.sca(ax)
ax.set_title(f'a: {a.shape[0]:_} rows')
perfplot.show(
setup=lambda n: np.random.randint(0, 20, (n, 3)),
kernels=[
lambda b: np_rows_in(a[:, :3], b),
lambda b: py_rows_in(a[:, :3], b),
],
labels=['np_rows_in', 'py_rows_in'],
n_range=[2 ** k for k in range(10)],
xlabel='len(b)',
)
plt.show()
Pythonic way to test if a row is in an array
You can just simply subtract your test row from the array. Then find out the zero elements, and sum over column wise. Then those are matches where the sum equals the number of columns.
For example:
In []: A= arange(12).reshape(4, 3)
In []: A
Out[]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
In []: 3== (0== (A- [3, 4, 5])).sum(1)
Out[]: array([False, True, False, False], dtype=bool)
Update: based on comments and other answers:Paul
's suggestion seems indeed to be able to streamline code:
In []: ~np.all(A- [3, 4, 5], 1)
Out[]: array([False, True, False, False], dtype=bool)
JoshAdel
's answer emphasis more generally the problem related to determine 100% reliable manner the equality. So, obviously my answer is valid only in the situations where equality can be determined unambiguous manner.
Update 2: But as Emma
figured it out, there exists corner cases where Paul
's solution will not produce correct results.
Related Topics
Create a .CSV File with Values from a Python List
Pandas: Rolling Mean by Time Interval
Matplotlib: Format Axis Offset-Values to Whole Numbers or Specific Number
Python Glob Multiple Filetypes
Regular Expression Matching a Multiline Block of Text
Typeerror: Unhashable Type: 'Dict'
How to Construct a Timedelta Object from a Simple String
How to Get the Input from the Tkinter Text Widget
How to Remove Non-Ascii Characters But Leave Periods and Spaces
Python: Access Class Property from String
Opencv Giving Wrong Color to Colored Images on Loading
What Is the Fastest Way to Flatten Arbitrarily Nested Lists in Python
How to Activate an Anaconda Environment
Quick and Easy File Dialog in Python
Appending the Same String to a List of Strings in Python
Too Many Values to Unpack', Iterating Over a Dict. Key=>String, Value=>List
How to One-Hot-Encode from a Pandas Column Containing a List