Find Indices of Duplicated Rows

Find indices of duplicate rows in pandas DataFrame

Use parameter duplicated with keep=False for all dupe rows and then groupby by all columns and convert index values to tuples, last convert output Series to list:

df = df[df.duplicated(keep=False)]

df = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist()
print (df)
[(1, 6), (2, 4), (3, 5)]

If you want also see duplicate values:

df1 = (df.groupby(df.columns.tolist())
.apply(lambda x: tuple(x.index))
.reset_index(name='idx'))
print (df1)
param_a param_b param_c idx
0 0 0 0 (1, 6)
1 0 2 1 (2, 4)
2 2 1 1 (3, 5)

How to store index of duplicated rows in pandas dataframe?

Try:

df.reset_index().groupby(df.columns.tolist())["index"].agg(list).reset_index()

To get exactly what you want:

res=df.reset_index().groupby(df.columns.tolist())["index"].agg(list).reset_index().rename(columns={"index": "duplicated"})
res.index=res["duplicated"].str[0].tolist()
res["duplicated"]=res["duplicated"].str[1:]

Outputs (dummy data):

#original df:
a b
a1 x 4
a2 y 3
b6 z 2
c7 x 4
d x 4
x y 3

#transformed one:
a b duplicated
a1 x 4 [c7, d]
a2 y 3 [x]
b6 z 2 []

Find indices of duplicated rows

Here's an example:

df <- data.frame(a = c(1,2,3,4,1,5,6,4,2,1))

duplicated(df) | duplicated(df, fromLast = TRUE)
#[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE

How it works?

The function duplicated(df) determines duplicate elements in the original data. The fromLast = TRUE indicates that "duplication should be considered from the reverse side". The two resulting logical vectors are combined using | since a TRUE in at least one of them indicates a duplicated value.

Finding indices of duplicate items in Python

If you already have a numpy array, you can use np.unique and use the return_inverse flag. Use the inverse array to find all positions where count of unique elements exceeds 1, and find their indices.

import numpy as np
arr = np.array([[10,10],[3,6],[2,4],[10,10],[0,0],[2,4]])
vals, inverse, count = np.unique(arr,
return_inverse=True,
return_counts=True,
axis=0)
out = np.where(count[inverse] > 1)[0] #find all indices where counts > 1
print(out) #array([0, 2, 3, 5], dtype=int64)

Find indices of duplicated rows in python ndarray

The numpy_indexed package has a lot of functionality to solve these type of problems efficiently.

For instance, (unlike numpy's builtin unique) this will find your unique images:

import numpy_indexed as npi
unique_training_images = npi.unique(train)

Or if you want to find all the indices of each unique group, you can use:

indices = npi.group_by(train).split(np.arange(len(train)))

Note that these functions do not have quadratic time complexity, like in your original post, and are fully vectorized, and thus in all likelihood a lot more efficient. Also, unlike pandas it does not have a preferred data format, and is fully nd-array capable, so acting on arrays with shape [n_images, 28, 28] 'just works'.

Pandas: Get duplicated indexes

df.groupby(level=0).filter(lambda x: len(x) > 1)['type']

We added filter method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.

Important:

The filter method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transform on Series -- was fixed for version 0.13, which should be released any day now.

Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.

There is no good reason why filter and transform should not be applied to nonunique indexes; it was just poorly implemented at first.

How to find row indices where duplicated pairs exist

Seemed to me there was an error in your unique.pairs construction, so I offer this alternate:

 unique.pairs <- unique( master[ duplicated(master) ,c(1,2)])

This then uses that to construct a vector of your first sort:

 grps <- apply( master, 1, function(x) if ( any( duplicated( rbind(unique.pairs, x))) ) { paste(x[1],x[2], sep="->")}else{NA} + )
grps

#[1] NA NA "2->3" "2->3" NA NA "4->6" "4->6" NA NA

You can then use that vector to group the other items of interest:

> locs <- tapply( rownames(master), grps, function(x) paste(x, collapse=",") )
> as.data.frame(locs)
locs
2->3 3,4
4->6 7,8

Find all indices of duplicates and write them in new columns

Here is one option with data.table. After grouping by 'string', get the sequence (seq_len(.N)) and row index (.I), then dcast to 'wide' format and join with the original dataset on the 'string'

library(data.table)
dcast(setDT(DT)[, .(seq_len(.N),.I), string],string ~ paste0("match", V1))[DT, on = "string"]
# string match1 match2 match3
# 1: A 1 7 11
# 2: B 2 NA NA
# 3: C 3 8 NA
# 4: D 4 NA NA
# 5: E 5 NA NA
# 6: F 6 9 NA
# 7: A 1 7 11
# 8: C 3 8 NA
# 9: F 6 9 NA
#10: Z 10 NA NA
#11: A 1 7 11

Or another option would be to split the sequence of rows with 'string', pad the list elements with NA for length that are less, and merge with the original dataset (using base R methods)

lst <- split(seq_len(nrow(DT)), DT$string)
merge(DT, do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))),
by.x = "string", by.y = "row.names")

data

DT<- data.frame(string=c("A","B","C","D","E","F","A","C",
"F","Z","A"), stringsAsFactors=FALSE)

c++ get indices of duplicating rows in 2D array

You could take advantage of the properties of a `std::unordered_set.

A small helper class will further ease up things.

So, we can store in a class the 2nd and 4th value and use a comparision function to detect duplicates.

The std::unordered_set has, besides the data type, 2 additional template parameters.

  1. A functor for equality and
  2. a functor for calculating a hash function.

So we will add 2 functions to our class an make it a functor for both parameters at the same time. In the below code you will see:

std::unordered_set<Dupl, Dupl, Dupl> dupl{};

So, we use our class additionally as 2 functors.

The rest of the functionality will be done by the std::unordered_set

Please see below one of many potential solutions:

#include <vector>
#include <unordered_set>
#include <iostream>

struct Dupl {
Dupl() {}
Dupl(const size_t row, const std::vector<int>& data) : index(row), firstValue(data[2]), secondValue(data[4]){};

size_t index{};
int firstValue{};
int secondValue{};

// Hash function
std::size_t operator()(const Dupl& d) const noexcept {
return d.firstValue + (d.secondValue << 8) + (d.index << 16);
}
// Comparison
bool operator()(const Dupl& lhs, const Dupl& rhs) const {
return (lhs.firstValue == rhs.firstValue) and (lhs.secondValue == rhs.secondValue);
}
};

std::vector<std::vector<int>> data{
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, // Index 0
{2, 3, 4, 5, 6, 7, 8, 9, 10, 11}, // Index 1
{3, 4, 42, 6, 42, 8, 9, 10, 11, 12}, // Index 2 ***
{4, 5, 6, 7, 8, 9, 10, 11, 12, 13}, // Index 3
{5, 6, 42, 8, 42, 10, 11, 12, 13, 14}, // Index 4 ***
{6, 7, 8, 9, 10, 11, 12, 13, 14, 15}, // Index 5
{7, 8, 9, 10, 11, 12, 13, 14, 15, 16}, // Index 6
{8, 9, 10, 11, 12, 13, 14, 15, 16, 17}, // Index 7
{9, 10, 42, 12, 42, 14, 15, 16, 17, 18}, // Index 8 ***
{10, 11, 12, 13, 14, 15, 16, 17, 18, 19}, // Index 9

};

int main() {

std::unordered_set<Dupl, Dupl, Dupl> dupl{};

// Find the unique rows
for (size_t i{}; i < data.size(); ++i)
dupl.insert({i, data[i]});

// Show some debug output
for (const Dupl& d : dupl) {
std::cout << "\nIndex:\t " << d.index << "\t\tData: ";
for (const int i : data[d.index]) std::cout << i << ' ';
}
}


Related Topics



Leave a reply



Submit