How to Get a List of All Indices of Repeated Elements in a Numpy Array

How to get a list of all indices of repeated elements in a numpy array

A vectorized solution with numpy, on the magic of unique().

import numpy as np

# create a test array
records_array = np.array([1, 2, 3, 1, 1, 3, 4, 3, 2])

# creates an array of indices, sorted by unique element
idx_sort = np.argsort(records_array)

# sorts records array so all unique elements are together
sorted_records_array = records_array[idx_sort]

# returns the unique values, the index of the first occurrence of a value, and the count for each element
vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True)

# splits the indices into separate arrays
res = np.split(idx_sort, idx_start[1:])

#filter them with respect to their size, keeping only items occurring more than once
vals = vals[count > 1]
res = filter(lambda x: x.size > 1, res)

The following code was the original answer, which required a bit more memory, using numpy broadcasting and calling unique twice:

records_array = array([1, 2, 3, 1, 1, 3, 4, 3, 2])
vals, inverse, count = unique(records_array, return_inverse=True,
return_counts=True)

idx_vals_repeated = where(count > 1)[0]
vals_repeated = vals[idx_vals_repeated]

rows, cols = where(inverse == idx_vals_repeated[:, newaxis])
_, inverse_rows = unique(rows, return_index=True)
res = split(cols, inverse_rows[1:])

with as expected res = [array([0, 3, 4]), array([1, 8]), array([2, 5, 7])]

Find indexes of repeated elements in an array (Python, NumPy)

Using np.diff and the method given here by @WarrenWeckesser for finding runs of zeros in an array:

import numpy as np

def zero_runs(a): # from link
iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges

a = [34,2,3,22,22,22,22,22,22,18,90,5,-55,-19,22,6,6,6,6,6,6,6,6,23,53,1,5,-42,82]

zero_runs(np.diff(a))
Out[87]:
array([[ 3, 8],
[15, 22]], dtype=int32)

This can then be filtered on the difference between the start & end of the run:

runs = zero_runs(np.diff(a))

runs[runs[:, 1]-runs[:, 0]>5] # runs of 7 or more, to illustrate filter
Out[96]: array([[15, 22]], dtype=int32)

Finding indices of duplicate items in Python

If you already have a numpy array, you can use np.unique and use the return_inverse flag. Use the inverse array to find all positions where count of unique elements exceeds 1, and find their indices.

import numpy as np
arr = np.array([[10,10],[3,6],[2,4],[10,10],[0,0],[2,4]])
vals, inverse, count = np.unique(arr,
return_inverse=True,
return_counts=True,
axis=0)
out = np.where(count[inverse] > 1)[0] #find all indices where counts > 1
print(out) #array([0, 2, 3, 5], dtype=int64)

Finding indices of non-unique elements in Numpy array

We want to find rows which are not duplicated in your array, while preserving the order.

I use this solution to combine each row of a into a single element, so that we can find the unique rows using np.unique(,return_index=True, return_inverse= True). Then, I modified this function to output the counts of the unique rows using the index and inverse. From there, I can select all unique rows which have counts == 1.

a = np.array([[1, 2, 3],
[2, 3, 4],
[1, 2, 3],
[3, 2, 1],
[3, 4, 5]])

#use a flexible data type, np.void, to combine the columns of `a`
#size of np.void is the number of bytes for an element in `a` multiplied by number of columns
b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, inv = np.unique(b, return_index = True, return_inverse = True)

def return_counts(index, inv):
count = np.zeros(len(index), np.int)
np.add.at(count, inv, 1)
return count

counts = return_counts(index, inv)

#if you want the indices to discard replace with: counts[i] > 1
index_keep = [i for i, j in enumerate(index) if counts[i] == 1]

>>>a[index_keep]
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])

#if you don't need the indices and just want the array returned while preserving the order
a_unique = np.vstack(a[idx] for i, idx in enumerate(index) if counts[i] == 1])
>>>a_unique
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])

For np.version >= 1.9

b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, counts = np.unique(b, return_index = True, return_counts = True)

index_keep = [i for i, j in enumerate(index) if counts[i] == 1]
>>>a[index_keep]
array([[2, 3, 4],
[3, 2, 1],
[3, 4, 5]])

How to find all occurrences of an element in a list

You can use a list comprehension with enumerate:

indices = [i for i, x in enumerate(my_list) if x == "whatever"]

The iterator enumerate(my_list) yields pairs (index, item) for each item in the list. Using i, x as loop variable target unpacks these pairs into the index i and the list item x. We filter down to all x that match our criterion, and select the indices i of these elements.

Declarative way to return all indices of matching elements for each element in numpy?

Your output is going to be a list (or worse, an array of objects), since your output is ragged in the general case. If you are OK with having each index corresponding to a repeated element point to the same underlying array, you can so something like the following.

The gist is to take a hint from np.unique, which is a sort-based operation. Instead of using np.unique, we can use np.argsort combined with some fancy index math to get the job done.

First, you get an array with [0, 1, 4], [2, 3] as the elements. This will be a 1D array of objects. Actually, if the split list is non-ragged, elements will recombine into a 2D array, but it doesn't matter because it will not affect the intended interface.

idx = np.argsort(arr)
split_points = np.flatnonzero(np.diff(arr[idx])) + 1
elements = np.array(np.split(idx, split_points))

Next you can index elements to produce the full array of objects.

inverse_index = np.argsort(idx)
counts = np.diff(np.r_[0, split_points, len(arr)])
result = np.repeat(elements, counts, axis=0)[inverse_index]

result will be a 2D array if you have equal numbers of each unique element. You can choose to turn it into a list if you want.

Notice that the last part works because np.argsort is its own inverse: the index that puts a sorted array into its original unsorted order is the argsort of the argsort. So we've implemented most of the features of np.unique (inverse_index, counts) with intermediate results to make your specific application possible. To complete np.unique, the forward index is np.r_[0, split_points], and the actual unique values are given by arr[np.r_[0, split_points]].

You can shorten the code from 6 lines to about 3 without recomputing any of the necessary arrays more than once. At the same time, you can say goodbye to any semblance of legibility that was there before:

idx = np.argsort(arr)
split_points = np.flatnonzero(np.diff(arr[idx])) + 1
result = np.repeat(np.array(np.split(idx, split_points)), np.diff(np.r_[0, split_points, len(arr)]), axis=0)[np.argsort(idx)]

Handling of duplicate indices in NumPy assignments

In NumPy 1.9 and later this will in general not be well defined.

The current implementation iterates over all (broadcasted) fancy indexes (and the assignment array) at the same time using separate iterators, and these iterators all use C-order. In other words, currently, yes you can. Since you maybe want to know it more exact. If you compare mapping.c in NumPy, which handles these things, you will see that it uses PyArray_ITER_NEXT, which is documented to be in C-order.

For the future I would paint the picture differently. I think it would be good to iterate all indices + the assignment array together using the newer iterator. If this is done, then the order could be kept open for the iterator to decide the fastest way. If you keep it open to the iterator, it is hard to say what would happen, but you cannot be certain that your example works (probably the 1-d case you still can, but...).

So, as far as I can tell it works currently, but it is undocumented (for all I know) so if you actually think that this should be ensured, you would need to lobby for it and best write some tests to make sure it can be guaranteed. Because at least am tempted to say: if it makes things faster, there is no reason to ensure C-order, but of course maybe there is a good reason hidden somewhere...

The real question here is: Why do you want that anyway? ;)

Determining duplicate values in an array

I think this is most clear done outside of numpy. You'll have to time it against your numpy solutions if you are concerned with speed.

>>> import numpy as np
>>> from collections import Counter
>>> a = np.array([1, 2, 1, 3, 3, 3, 0])
>>> [item for item, count in Counter(a).items() if count > 1]
[1, 3]

note: This is similar to Burhan Khalid's answer, but the use of items without subscripting in the condition should be faster.



Related Topics



Leave a reply



Submit