Using np.where to find matching row in 2D array
You probably wanted all rows that are equal to your arr2
:
>>> np.where(np.all(arr1 == arr2, axis=1))
(array([0], dtype=int64),)
Which means that the first row (zeroth index) matched.
The problem with your approach is that numpy broadcasts the arrays (visualized with np.broadcast_arrays
):
>>> arr1_tmp, arr2_tmp = np.broadcast_arrays(arr1, arr2)
>>> arr2_tmp
array([[ 3., 0.],
[ 3., 0.],
[ 3., 0.],
[ 3., 0.],
[ 3., 0.],
[ 3., 0.]])
and then does elementwise-comparison:
>>> arr1 == arr2
array([[ True, True],
[ True, False],
[ True, False],
[ True, False],
[ True, False],
[ True, False]], dtype=bool)
and np.where
then gives you the coordinates of every True
:
>>> np.where(arr1 == arr2)
(array([0, 0, 1, 2, 3, 4, 5], dtype=int64),
array([0, 1, 0, 0, 0, 0, 0], dtype=int64))
# ^---- first match (0, 0)
# ^--- second match (0, 1)
# ^--- third match (1, 0)
# ...
Which means (0, 0)
(first row left item) is the first True
, then 0, 1
(first row right item), then 1, 0
(second row, left item), ....
If you use np.all
along the first axis you get all rows that are completly equal:
>>> np.all(arr1 == arr2, axis=1)
array([ True, False, False, False, False, False], dtype=bool)
Can be better visualized if one keeps the dimensions:
>>> np.all(arr1 == arr2, axis=1, keepdims=True)
array([[ True],
[False],
[False],
[False],
[False],
[False]], dtype=bool)
Finding a matching row in a numpy matrix
You were just needed to look for ALL matches
along each row, like so -
np.where((a==(1,3)).all(axis=1))[0]
Steps involved using given sample -
In [17]: a # Input matrix
Out[17]:
matrix([[0, 2],
[0, 0],
[1, 3],
[4, 6],
[0, 7],
[0, 3]])
In [18]: (a==(1,3)) # Matrix of broadcasted matches
Out[18]:
matrix([[False, False],
[False, False],
[ True, True],
[False, False],
[False, False],
[False, True]], dtype=bool)
In [19]: (a==(1,3)).all(axis=1) # Look for ALL matches along each row
Out[19]:
matrix([[False],
[False],
[ True],
[False],
[False],
[False]], dtype=bool)
In [20]: np.where((a==(1,3)).all(1))[0] # Use np.where to get row indices
Out[20]: array([2])
Finding indexes of matching rows in two numpy arrays
np.flatnonzero((x == y).all(1))
# array([0, 2])
or:
np.nonzero((x == y).all(1))[0]
or:
np.where((x == y).all(1))[0]
Match rows of two 2D arrays and get a row indices map using numpy
Approach #1
Here's one based on views
. Makes use of np.argwhere
(docs) to return the indices of an element that meet a condition, in this case, membership. -
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
def argwhere_nd(a,b):
A,B = view1D(a,b)
return np.argwhere(A[:,None] == B)
Approach #2
Here's another that would be O(n)
and hence much better on performance, especially on large arrays -
def argwhere_nd_searchsorted(a,b):
A,B = view1D(a,b)
sidxB = B.argsort()
mask = np.isin(A,B)
cm = A[mask]
idx0 = np.flatnonzero(mask)
idx1 = sidxB[np.searchsorted(B,cm, sorter=sidxB)]
return idx0, idx1 # idx0 : indices in A, idx1 : indices in B
Approach #3
Another O(n)
one using argsort()
-
def argwhere_nd_argsort(a,b):
A,B = view1D(a,b)
c = np.r_[A,B]
idx = np.argsort(c,kind='mergesort')
cs = c[idx]
m0 = cs[:-1] == cs[1:]
return idx[:-1][m0],idx[1:][m0]-len(A)
Sample runs with same inputs as earlier -
In [650]: argwhere_nd_searchsorted(a,b)
Out[650]: (array([0, 1]), array([2, 0]))
In [651]: argwhere_nd_argsort(a,b)
Out[651]: (array([0, 1]), array([2, 0]))
Find indexes of matching rows in two 2-D arrays
This is an all numpy
solution - not that is necessarily better than an iterative Python one. It still has to look at all combinations.
In [53]: np.array(np.all((x[:,None,:]==y[None,:,:]),axis=-1).nonzero()).T.tolist()
Out[53]: [[0, 4], [2, 1], [3, 2], [4, 3]]
The intermediate array is (5,5,4)
. The np.all
reduces it to:
array([[False, False, False, False, True],
[False, False, False, False, False],
[False, True, False, False, False],
[False, False, True, False, False],
[False, False, False, True, False]], dtype=bool)
The rest is just extracting the indices where this is True
In crude tests, this times at 47.8 us; the other answer with the L1
dictionary at 38.3 us; and a third with a double loop at 496 us.
How to retrieve rows matching a criteria from the multi dimensional numpy array?
If you're dealing with CSV files and tabular data handling, I'd recommend using Pandas.
Here's very briefly how that would work in your case (df
is the usual variable name for a Pandas DataFrame, hence df
).
df = pd.read_csv('datafile.csv')
print(df)
results in the output
code filename value1 value2 yesno anothervalue yetanothervalue
0 LIMS_AY60_51X AY60_51X_61536153d7cdc55.png 857.61389 291.227 NO 728.322 865.442
1 LIMS_AY60_52X AY60_52X_615f6r53d7cdc55.png 867.61389 292.227 NO 728.322 865.442
2 LIMS_AY60_53X AY60_53X_615ft153d7cdc55.png 877.61389 293.227 NO 728.322 865.442
3 LIMS_AY60_54X AY60_54X_615u6153d7cdc55.png 818.61389 294.227 NO 728.322 865.442
4 LIMS_AY60_55X AY60_55X_615f615od7cdc55.png 847.61389 295.227 NO 728.322 865.442
Note that the very first column is called the index. It is not in the CSV file, but automatically added by Pandas. You can ignore it here.
The column names are thought-up by me; usually, the first row of the CSV file will have column names, and otherwise Pandas will default to naming them something like "Unnamed: 0", "Unnamed: 1", "Unnamed: 2" etc.
Then, for the actual selection, you do
df['filename'] == 'AY60_52X_615f6r53d7cdc55.png'
which results in
0 False
1 True
2 False
3 False
4 False
Name: filename, dtype: bool
which is a one-dimensional dataframe, called a Series. Again, it has an index column, but more importantly, the second column shows for which row the comparison is true.
You can assign the result to a variable instead, and use that variable to access the rows that have True
, as follows:
selection = df['filename'] == 'AY60_52X_615f6r53d7cdc55.png'
print(df[selection])
which yields
code filename value1 value2 yesno anothervalue yetanothervalue
1 LIMS_AY60_52X AY60_52X_615f6r53d7cdc55.png 867.61389 292.227 NO 728.322 865.442
Note that in this case, Pandas is smart enough to figure out whether you want to access a particular column (df['filename']
) or a selection of rows (df[selection]
). More complicated ways of accessing a dataframe are possible, but you'll have to read up on that.
You can merge some things together, and with the reading of the CSV file, it's just two lines:
df = pd.read_csv('datafile.csv')
df[ df['filename'] == 'AY60_52X_615f6r53d7cdc55.png' ]
which I think is a bit nicer than using purely NumPy. Essentially, use NumPy only when you are really dealing with (multi-dimensional) array data. Not when dealing with records / tabular structured data, as in your case. (Side note: under the hood, Pandas uses a lot of NumPy, so the speed is the same; it's largely a nicer interface with some extra functionality.)
Compare a N x 2 2D array row wise with 1 x 2 array
np.all(arr == [255, 0], axis=1)
Output
array([False, False, False, False, False, False, True, False, False,
False, False, False, False, True, False, False])
Related Topics
Python: How to Split a List Based on a Specific Element
How to Tell Python to Convert Integers into Words
Unit Testing a Method With No Return Value
Number of Common Letters in Two Strings
How to Generate and Open an Outlook Email With Python (But Do Not Send)
How to Retrieve Data from Dynamic Table - Selenium Python
How to Resolve Modulenotfounderror: No Module Named 'Google.Colab'
Fillna in Multiple Columns in Place in Python Pandas
Convert Float to Float Time in Python
How to Convert an Integer to Time
How to Assign and Use Column Headers in Spark
Filtering Date Column in Python
Python Works in Pycharm But Not from Terminal
I Want to Multiply Two Columns in a Pandas Dataframe and Add the Result into a New Column
Modulenotfounderror: What Does It Mean _Main_ Is Not a Package