Find Matching Rows in 2 Dimensional Numpy Array

Using np.where to find matching row in 2D array

You probably wanted all rows that are equal to your arr2:

>>> np.where(np.all(arr1 == arr2, axis=1))
(array([0], dtype=int64),)

Which means that the first row (zeroth index) matched.


The problem with your approach is that numpy broadcasts the arrays (visualized with np.broadcast_arrays):

>>> arr1_tmp, arr2_tmp = np.broadcast_arrays(arr1, arr2)
>>> arr2_tmp
array([[ 3., 0.],
[ 3., 0.],
[ 3., 0.],
[ 3., 0.],
[ 3., 0.],
[ 3., 0.]])

and then does elementwise-comparison:

>>> arr1 == arr2
array([[ True, True],
[ True, False],
[ True, False],
[ True, False],
[ True, False],
[ True, False]], dtype=bool)

and np.where then gives you the coordinates of every True:

>>> np.where(arr1 == arr2)
(array([0, 0, 1, 2, 3, 4, 5], dtype=int64),
array([0, 1, 0, 0, 0, 0, 0], dtype=int64))
# ^---- first match (0, 0)
# ^--- second match (0, 1)
# ^--- third match (1, 0)
# ...

Which means (0, 0) (first row left item) is the first True, then 0, 1 (first row right item), then 1, 0 (second row, left item), ....


If you use np.all along the first axis you get all rows that are completly equal:

>>> np.all(arr1 == arr2, axis=1)
array([ True, False, False, False, False, False], dtype=bool)

Can be better visualized if one keeps the dimensions:

>>> np.all(arr1 == arr2, axis=1, keepdims=True)
array([[ True],
[False],
[False],
[False],
[False],
[False]], dtype=bool)

Finding a matching row in a numpy matrix

You were just needed to look for ALL matches along each row, like so -

np.where((a==(1,3)).all(axis=1))[0]

Steps involved using given sample -

In [17]: a # Input matrix
Out[17]:
matrix([[0, 2],
[0, 0],
[1, 3],
[4, 6],
[0, 7],
[0, 3]])

In [18]: (a==(1,3)) # Matrix of broadcasted matches
Out[18]:
matrix([[False, False],
[False, False],
[ True, True],
[False, False],
[False, False],
[False, True]], dtype=bool)

In [19]: (a==(1,3)).all(axis=1) # Look for ALL matches along each row
Out[19]:
matrix([[False],
[False],
[ True],
[False],
[False],
[False]], dtype=bool)

In [20]: np.where((a==(1,3)).all(1))[0] # Use np.where to get row indices
Out[20]: array([2])

Finding indexes of matching rows in two numpy arrays

np.flatnonzero((x == y).all(1))
# array([0, 2])

or:

np.nonzero((x == y).all(1))[0]

or:

np.where((x == y).all(1))[0]

Match rows of two 2D arrays and get a row indices map using numpy

Approach #1

Here's one based on views. Makes use of np.argwhere (docs) to return the indices of an element that meet a condition, in this case, membership. -

def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()

def argwhere_nd(a,b):
A,B = view1D(a,b)
return np.argwhere(A[:,None] == B)

Approach #2

Here's another that would be O(n) and hence much better on performance, especially on large arrays -

def argwhere_nd_searchsorted(a,b):
A,B = view1D(a,b)
sidxB = B.argsort()
mask = np.isin(A,B)
cm = A[mask]
idx0 = np.flatnonzero(mask)
idx1 = sidxB[np.searchsorted(B,cm, sorter=sidxB)]
return idx0, idx1 # idx0 : indices in A, idx1 : indices in B

Approach #3

Another O(n) one using argsort() -

def argwhere_nd_argsort(a,b):
A,B = view1D(a,b)
c = np.r_[A,B]
idx = np.argsort(c,kind='mergesort')
cs = c[idx]
m0 = cs[:-1] == cs[1:]
return idx[:-1][m0],idx[1:][m0]-len(A)

Sample runs with same inputs as earlier -

In [650]: argwhere_nd_searchsorted(a,b)
Out[650]: (array([0, 1]), array([2, 0]))

In [651]: argwhere_nd_argsort(a,b)
Out[651]: (array([0, 1]), array([2, 0]))

Find indexes of matching rows in two 2-D arrays

This is an all numpy solution - not that is necessarily better than an iterative Python one. It still has to look at all combinations.

In [53]: np.array(np.all((x[:,None,:]==y[None,:,:]),axis=-1).nonzero()).T.tolist()
Out[53]: [[0, 4], [2, 1], [3, 2], [4, 3]]

The intermediate array is (5,5,4). The np.all reduces it to:

array([[False, False, False, False,  True],
[False, False, False, False, False],
[False, True, False, False, False],
[False, False, True, False, False],
[False, False, False, True, False]], dtype=bool)

The rest is just extracting the indices where this is True

In crude tests, this times at 47.8 us; the other answer with the L1 dictionary at 38.3 us; and a third with a double loop at 496 us.

How to retrieve rows matching a criteria from the multi dimensional numpy array?

If you're dealing with CSV files and tabular data handling, I'd recommend using Pandas.

Here's very briefly how that would work in your case (df is the usual variable name for a Pandas DataFrame, hence df).

df = pd.read_csv('datafile.csv')
print(df)

results in the output

            code                      filename     value1   value2 yesno  anothervalue  yetanothervalue
0 LIMS_AY60_51X AY60_51X_61536153d7cdc55.png 857.61389 291.227 NO 728.322 865.442
1 LIMS_AY60_52X AY60_52X_615f6r53d7cdc55.png 867.61389 292.227 NO 728.322 865.442
2 LIMS_AY60_53X AY60_53X_615ft153d7cdc55.png 877.61389 293.227 NO 728.322 865.442
3 LIMS_AY60_54X AY60_54X_615u6153d7cdc55.png 818.61389 294.227 NO 728.322 865.442
4 LIMS_AY60_55X AY60_55X_615f615od7cdc55.png 847.61389 295.227 NO 728.322 865.442

Note that the very first column is called the index. It is not in the CSV file, but automatically added by Pandas. You can ignore it here.
The column names are thought-up by me; usually, the first row of the CSV file will have column names, and otherwise Pandas will default to naming them something like "Unnamed: 0", "Unnamed: 1", "Unnamed: 2" etc.

Then, for the actual selection, you do

df['filename'] == 'AY60_52X_615f6r53d7cdc55.png'

which results in

0    False
1 True
2 False
3 False
4 False
Name: filename, dtype: bool

which is a one-dimensional dataframe, called a Series. Again, it has an index column, but more importantly, the second column shows for which row the comparison is true.

You can assign the result to a variable instead, and use that variable to access the rows that have True, as follows:

selection = df['filename'] == 'AY60_52X_615f6r53d7cdc55.png'
print(df[selection])

which yields

            code                      filename     value1   value2 yesno  anothervalue  yetanothervalue
1 LIMS_AY60_52X AY60_52X_615f6r53d7cdc55.png 867.61389 292.227 NO 728.322 865.442

Note that in this case, Pandas is smart enough to figure out whether you want to access a particular column (df['filename']) or a selection of rows (df[selection]). More complicated ways of accessing a dataframe are possible, but you'll have to read up on that.

You can merge some things together, and with the reading of the CSV file, it's just two lines:

df = pd.read_csv('datafile.csv')
df[ df['filename'] == 'AY60_52X_615f6r53d7cdc55.png' ]

which I think is a bit nicer than using purely NumPy. Essentially, use NumPy only when you are really dealing with (multi-dimensional) array data. Not when dealing with records / tabular structured data, as in your case. (Side note: under the hood, Pandas uses a lot of NumPy, so the speed is the same; it's largely a nicer interface with some extra functionality.)

Compare a N x 2 2D array row wise with 1 x 2 array

np.all(arr == [255, 0], axis=1)

Output

array([False, False, False, False, False, False,  True, False, False,
False, False, False, False, True, False, False])


Related Topics



Leave a reply



Submit