Check for Identical Rows in Different Numpy Arrays

check for identical rows in different numpy arrays

Here's a vectorised solution:

res = (a[:, None] == b).all(-1).any(-1)

print(res)

array([ True, True, False, True])

Note that a[:, None] == b compares each row of a with b element-wise. We then use all + any to deduce if there are any rows which are all True for each sub-array:

print(a[:, None] == b)

[[[ True True]
[False True]
[False False]]

[[False True]
[ True True]
[False False]]

[[False False]
[False False]
[False False]]

[[False False]
[False False]
[ True True]]]

How to find identical rows of two arrays with different size?

Let's try broadcasting:

(a[None,:] == b[:,None]).all(-1).any(0)

Output:

array([False,  True, False, False, False,  True, False])

How can I find identical rows in different arrays regardless of element order using Numpy?

Here is a one way to do it:

Import numpy as np

A = np.random.randint(0,5,(8,3))
B = np.random.randint(0,5,(2,2))

C = (A[..., np.newaxis, np.newaxis] == B)
rows = np.where(C.any((3,1)).all(1))[0]
print(rows)

Output:

[0 2 3 4]

I want to check if there are any identical rows in a matrix

Approach 1: Using np.unique

  1. You can use np.unique over axis=0 to fetch unique rows.

  2. The return_counts=True will return the number of times each row repeats.

  3. Putting a condition c>1 and checking if any of the rows matches that condition with .any() will give you what you want.

def is_dup_simple(arr):
u, c = np.unique(arr, axis=0, return_counts=True)
return (c>1).any()

print(is_dup_simple(A))
print(is_dup_simple(B))
True
False

Approach 2: Broadcasting

Here is how you can do this with broadcasting operations. Its a slightly longer way, but let's you be very flexible with the approach (example, finding duplicates between different arrays)

def is_dup(arr):
mask = ~np.eye(arr.shape[0], dtype=bool)
out = ((arr[None,:,:] == arr[:,None,:]).all(-1)&mask).any()
return out

print(is_dup(A))
print(is_dup(B))
True
False

Step by step broadcasting table -

arr[None,:,:] -> 1 , 10, 7 (adding new first axis)
arr[:,None,:] -> 10, 1 , 7 (adding new second axis)
--------------------------
== -> 10, 10, 7 (compare elements row-wise)
all(-1) -> 10, 10 (compare rows x rows)
& mask -> 10, 10 (diagonals false)
any() -> 1 (reduce to single bool)
--------------------------

EXPLANATION

  1. From the 10,7 array (10 rows, 7 columns), you want to match elements such that you end up with a (10,10) matrix with boolean values which indicate if all the elements in the 7 rows match ANY of the elements in any other rows.

  2. The mask = ~np.eye(arr.shape[0], dtype=bool) is specifically a matrix of 10,10 shape with false values along the diagonal. The reason for this is, because you want to ignore comparing the row with itself. More about this later.

  3. Starting with the broadcasted boolean operation - (arr[None,:,:] == arr[:,None,:]). This results in a 10,10,7 boolean array which compares the elements of every row, with all the other rows (10 x 10 comparisons, 7 values matched).

  4. Now, with .all(-1) you reduce the last axis and get a 10,10 matrix which contains True, if all 7 elements match any other row, else false even if a single element is different.

  5. Next, as you realize, row 0 will always match row 0, so will row 1 match row 1. Therefore the diagonal will always be true in this matrix. For us to deduce if there are duplicate rows, we have to ignore the True values of the diagonal. This can be done by doing an & (and) operation between the mask (discussed above) and the 10,10 boolean array. The only change that happens because of this, is that diagonal elements become False instead of True.

  6. Finally, you can reduce the array to a single boolean by using .any() which will be True if even a single element in the new 10,10 matrix is True (which indicates that there is a row x that matches row y exactly AND row x is not the same as row y, thanks to the mask)

Fastest way to check if duplicates exist in a python list / numpy ndarray

Here are the four ways I thought of doing it.

TL;DR: if you expect very few (less than 1/1000) duplicates:

def contains_duplicates(X):
return len(np.unique(X)) != len(X)

If you expect frequent (more than 1/1000) duplicates:

def contains_duplicates(X):
seen = set()
seen_add = seen.add
for x in X:
if (x in seen or seen_add(x)):
return True
return False

The first method is an early exit from this answer which wants to return the unique values, and the second of which is the same idea applied to this answer.

>>> import numpy as np
>>> X = np.random.normal(0,1,[10000])
>>> def terhorst_early_exit(X):
...: elems = set()
...: for i in X:
...: if i in elems:
...: return True
...: elems.add(i)
...: return False
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 10.6 ms per loop
>>> def peterbe_early_exit(X):
...: seen = set()
...: seen_add = seen.add
...: for x in X:
...: if (x in seen or seen_add(x)):
...: return True
...: return False
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 9.35 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 4.54 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 967 µs per loop

Do things change if you start with an ordinary Python list, and not a numpy.ndarray?

>>> X = X.tolist()
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 9.34 ms per loop
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 8.07 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 3.09 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 1.83 ms per loop

Edit: what if we have a prior expectation of the number of duplicates?

The above comparison is functioning under the assumption that a) there are likely to be no duplicates, or b) we're more worried about the worst case than the average case.

>>> X = np.random.normal(0, 1, [10000])
>>> for n_duplicates in [1, 10, 100]:
>>> print("{} duplicates".format(n_duplicates))
>>> duplicate_idx = np.random.choice(len(X), n_duplicates, replace=False)
>>> X[duplicate_idx] = 0
>>> print("terhost_early_exit")
>>> %timeit terhorst_early_exit(X)
>>> print("peterbe_early_exit")
>>> %timeit peterbe_early_exit(X)
>>> print("set length")
>>> %timeit len(set(X)) != len(X)
>>> print("numpy unique length")
>>> %timeit len(np.unique(X)) != len(X)
1 duplicates
terhost_early_exit
100 loops, best of 3: 12.3 ms per loop
peterbe_early_exit
100 loops, best of 3: 9.55 ms per loop
set length
100 loops, best of 3: 4.71 ms per loop
numpy unique length
1000 loops, best of 3: 1.31 ms per loop
10 duplicates
terhost_early_exit
1000 loops, best of 3: 1.81 ms per loop
peterbe_early_exit
1000 loops, best of 3: 1.47 ms per loop
set length
100 loops, best of 3: 5.44 ms per loop
numpy unique length
1000 loops, best of 3: 1.37 ms per loop
100 duplicates
terhost_early_exit
10000 loops, best of 3: 111 µs per loop
peterbe_early_exit
10000 loops, best of 3: 99 µs per loop
set length
100 loops, best of 3: 5.16 ms per loop
numpy unique length
1000 loops, best of 3: 1.19 ms per loop

So if you expect very few duplicates, the numpy.unique function is the way to go. As the number of expected duplicates increases, the early exit methods dominate.

finding identical rows and columns in a numpy array

You can use np.array_equal():

for i in range(len(A)):  # generate pairs
for j in range(i + 1, len(A)):
if np.array_equal(A[i], A[j]): # compare rows
if np.array_equal(A[:,i], A[:,j]): # compare columns
print(i, j)
else:
pass

or using combinations():

import itertools

for pair in itertools.combinations(range(len(A)), 2):
if np.array_equal(A[pair[0]], A[pair[1]]) and np.array_equal(A[:,pair[0]], A[:,pair[1]]): # compare columns
print(pair)

Remove duplicate rows of a numpy array

You can use numpy unique. Since you want the unique rows, we need to put them into tuples:

import numpy as np

data = np.array([[1,8,3,3,4],
[1,8,9,9,4],
[1,8,3,3,4]])

just applying np.unique to the data array will result in this:

>>> uniques
array([1, 3, 4, 8, 9])

prints out the unique elements in the list. So putting them into tuples results in:

new_array = [tuple(row) for row in data]
uniques = np.unique(new_array)

which prints:

>>> uniques
array([[1, 8, 3, 3, 4],
[1, 8, 9, 9, 4]])

UPDATE

In the new version, you need to set np.unique(data, axis=0)



Related Topics



Leave a reply



Submit