check for identical rows in different numpy arrays
Here's a vectorised solution:
res = (a[:, None] == b).all(-1).any(-1)
print(res)
array([ True, True, False, True])
Note that a[:, None] == b
compares each row of a
with b
element-wise. We then use all
+ any
to deduce if there are any rows which are all True
for each sub-array:
print(a[:, None] == b)
[[[ True True]
[False True]
[False False]]
[[False True]
[ True True]
[False False]]
[[False False]
[False False]
[False False]]
[[False False]
[False False]
[ True True]]]
How to find identical rows of two arrays with different size?
Let's try broadcasting:
(a[None,:] == b[:,None]).all(-1).any(0)
Output:
array([False, True, False, False, False, True, False])
How can I find identical rows in different arrays regardless of element order using Numpy?
Here is a one way to do it:
Import numpy as np
A = np.random.randint(0,5,(8,3))
B = np.random.randint(0,5,(2,2))
C = (A[..., np.newaxis, np.newaxis] == B)
rows = np.where(C.any((3,1)).all(1))[0]
print(rows)
Output:
[0 2 3 4]
I want to check if there are any identical rows in a matrix
Approach 1: Using np.uniqueYou can use np.unique
over axis=0 to fetch unique rows.
The return_counts=True
will return the number of times each row repeats.
Putting a condition c>1
and checking if any of the rows matches that condition with .any()
will give you what you want.
def is_dup_simple(arr):
u, c = np.unique(arr, axis=0, return_counts=True)
return (c>1).any()
print(is_dup_simple(A))
print(is_dup_simple(B))
True
False
Approach 2: Broadcasting
You can use np.unique
over axis=0 to fetch unique rows.
The return_counts=True
will return the number of times each row repeats.
Putting a condition c>1
and checking if any of the rows matches that condition with .any()
will give you what you want.
def is_dup_simple(arr):
u, c = np.unique(arr, axis=0, return_counts=True)
return (c>1).any()
print(is_dup_simple(A))
print(is_dup_simple(B))
True
False
Here is how you can do this with broadcasting operations. Its a slightly longer way, but let's you be very flexible with the approach (example, finding duplicates between different arrays)
def is_dup(arr):
mask = ~np.eye(arr.shape[0], dtype=bool)
out = ((arr[None,:,:] == arr[:,None,:]).all(-1)&mask).any()
return out
print(is_dup(A))
print(is_dup(B))
True
False
Step by step broadcasting table -
arr[None,:,:] -> 1 , 10, 7 (adding new first axis)
arr[:,None,:] -> 10, 1 , 7 (adding new second axis)
--------------------------
== -> 10, 10, 7 (compare elements row-wise)
all(-1) -> 10, 10 (compare rows x rows)
& mask -> 10, 10 (diagonals false)
any() -> 1 (reduce to single bool)
--------------------------
EXPLANATION
From the 10,7 array (10 rows, 7 columns), you want to match elements such that you end up with a (10,10) matrix with boolean values which indicate if all the elements in the 7 rows match ANY of the elements in any other rows.
The
mask = ~np.eye(arr.shape[0], dtype=bool)
is specifically a matrix of 10,10 shape with false values along the diagonal. The reason for this is, because you want to ignore comparing the row with itself. More about this later.Starting with the broadcasted boolean operation -
(arr[None,:,:] == arr[:,None,:])
. This results in a 10,10,7 boolean array which compares the elements of every row, with all the other rows (10 x 10 comparisons, 7 values matched).Now, with
.all(-1)
you reduce the last axis and get a 10,10 matrix which contains True, if all 7 elements match any other row, else false even if a single element is different.Next, as you realize, row 0 will always match row 0, so will row 1 match row 1. Therefore the diagonal will always be true in this matrix. For us to deduce if there are duplicate rows, we have to ignore the True values of the diagonal. This can be done by doing an
&
(and) operation between the mask (discussed above) and the 10,10 boolean array. The only change that happens because of this, is that diagonal elements become False instead of True.Finally, you can reduce the array to a single boolean by using
.any()
which will be True if even a single element in the new 10,10 matrix is True (which indicates that there is a row x that matches row y exactly AND row x is not the same as row y, thanks to the mask)
Fastest way to check if duplicates exist in a python list / numpy ndarray
Here are the four ways I thought of doing it.
TL;DR: if you expect very few (less than 1/1000) duplicates:
def contains_duplicates(X):
return len(np.unique(X)) != len(X)
If you expect frequent (more than 1/1000) duplicates:
def contains_duplicates(X):
seen = set()
seen_add = seen.add
for x in X:
if (x in seen or seen_add(x)):
return True
return False
The first method is an early exit from this answer which wants to return the unique values, and the second of which is the same idea applied to this answer.
>>> import numpy as np
>>> X = np.random.normal(0,1,[10000])
>>> def terhorst_early_exit(X):
...: elems = set()
...: for i in X:
...: if i in elems:
...: return True
...: elems.add(i)
...: return False
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 10.6 ms per loop
>>> def peterbe_early_exit(X):
...: seen = set()
...: seen_add = seen.add
...: for x in X:
...: if (x in seen or seen_add(x)):
...: return True
...: return False
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 9.35 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 4.54 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 967 µs per loop
Do things change if you start with an ordinary Python list, and not a numpy.ndarray
?
>>> X = X.tolist()
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 9.34 ms per loop
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 8.07 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 3.09 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 1.83 ms per loop
Edit: what if we have a prior expectation of the number of duplicates?
The above comparison is functioning under the assumption that a) there are likely to be no duplicates, or b) we're more worried about the worst case than the average case.
>>> X = np.random.normal(0, 1, [10000])
>>> for n_duplicates in [1, 10, 100]:
>>> print("{} duplicates".format(n_duplicates))
>>> duplicate_idx = np.random.choice(len(X), n_duplicates, replace=False)
>>> X[duplicate_idx] = 0
>>> print("terhost_early_exit")
>>> %timeit terhorst_early_exit(X)
>>> print("peterbe_early_exit")
>>> %timeit peterbe_early_exit(X)
>>> print("set length")
>>> %timeit len(set(X)) != len(X)
>>> print("numpy unique length")
>>> %timeit len(np.unique(X)) != len(X)
1 duplicates
terhost_early_exit
100 loops, best of 3: 12.3 ms per loop
peterbe_early_exit
100 loops, best of 3: 9.55 ms per loop
set length
100 loops, best of 3: 4.71 ms per loop
numpy unique length
1000 loops, best of 3: 1.31 ms per loop
10 duplicates
terhost_early_exit
1000 loops, best of 3: 1.81 ms per loop
peterbe_early_exit
1000 loops, best of 3: 1.47 ms per loop
set length
100 loops, best of 3: 5.44 ms per loop
numpy unique length
1000 loops, best of 3: 1.37 ms per loop
100 duplicates
terhost_early_exit
10000 loops, best of 3: 111 µs per loop
peterbe_early_exit
10000 loops, best of 3: 99 µs per loop
set length
100 loops, best of 3: 5.16 ms per loop
numpy unique length
1000 loops, best of 3: 1.19 ms per loop
So if you expect very few duplicates, the numpy.unique
function is the way to go. As the number of expected duplicates increases, the early exit methods dominate.
finding identical rows and columns in a numpy array
You can use np.array_equal():
for i in range(len(A)): # generate pairs
for j in range(i + 1, len(A)):
if np.array_equal(A[i], A[j]): # compare rows
if np.array_equal(A[:,i], A[:,j]): # compare columns
print(i, j)
else:
pass
or using combinations():
import itertools
for pair in itertools.combinations(range(len(A)), 2):
if np.array_equal(A[pair[0]], A[pair[1]]) and np.array_equal(A[:,pair[0]], A[:,pair[1]]): # compare columns
print(pair)
Remove duplicate rows of a numpy array
You can use numpy unique
. Since you want the unique rows, we need to put them into tuples:
import numpy as np
data = np.array([[1,8,3,3,4],
[1,8,9,9,4],
[1,8,3,3,4]])
just applying np.unique
to the data
array will result in this:
>>> uniques
array([1, 3, 4, 8, 9])
prints out the unique elements in the list. So putting them into tuples results in:
new_array = [tuple(row) for row in data]
uniques = np.unique(new_array)
which prints:
>>> uniques
array([[1, 8, 3, 3, 4],
[1, 8, 9, 9, 4]])
UPDATE
In the new version, you need to set np.unique(data, axis=0)
Related Topics
Does Flask Support Regular Expressions in Its Url Routing
How to Change the Host and Port That the Flask Command Uses
Pip Broke. How to Fix Distributionnotfound Error
Mixed Slashes with Os.Path.Join on Windows
Is Generator.Next() Visible in Python 3
Python: Catching Specific Exception
Convert Pandas Series to Dataframe
Iterate Over All Combinations of Values in Multiple Lists in Python
Regular Expression: Match Start or Whitespace
Python: Random Selection Per Group
Differencebetween a Pandas Series and a Single-Column Dataframe
Split an Integer into Digits to Compute an Isbn Checksum
Sorting a 2D Numpy Array by Multiple Axes
Duplicate Items in Legend in Matplotlib
Copy File or Directories Recursively in Python
Python Selenium: Wait Until Element Is Clickable - Not Working