Find indices of duplicate rows in pandas DataFrame
Use parameter duplicated
with keep=False
for all dupe rows and then groupby
by all columns and convert index values to tuples, last convert output Series
to list
:
df = df[df.duplicated(keep=False)]
df = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist()
print (df)
[(1, 6), (2, 4), (3, 5)]
If you want also see duplicate values:
df1 = (df.groupby(df.columns.tolist())
.apply(lambda x: tuple(x.index))
.reset_index(name='idx'))
print (df1)
param_a param_b param_c idx
0 0 0 0 (1, 6)
1 0 2 1 (2, 4)
2 2 1 1 (3, 5)
How to store index of duplicated rows in pandas dataframe?
Try:
df.reset_index().groupby(df.columns.tolist())["index"].agg(list).reset_index()
To get exactly what you want:
res=df.reset_index().groupby(df.columns.tolist())["index"].agg(list).reset_index().rename(columns={"index": "duplicated"})
res.index=res["duplicated"].str[0].tolist()
res["duplicated"]=res["duplicated"].str[1:]
Outputs (dummy data):
#original df:
a b
a1 x 4
a2 y 3
b6 z 2
c7 x 4
d x 4
x y 3
#transformed one:
a b duplicated
a1 x 4 [c7, d]
a2 y 3 [x]
b6 z 2 []
Find indices of duplicated rows
Here's an example:
df <- data.frame(a = c(1,2,3,4,1,5,6,4,2,1))
duplicated(df) | duplicated(df, fromLast = TRUE)
#[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
How it works?
The function duplicated(df)
determines duplicate elements in the original data. The fromLast = TRUE
indicates that "duplication should be considered from the reverse side". The two resulting logical vectors are combined using |
since a TRUE
in at least one of them indicates a duplicated value.
Finding indices of duplicate items in Python
If you already have a numpy array, you can use np.unique
and use the return_inverse
flag. Use the inverse array to find all positions where count of unique elements exceeds 1, and find their indices.
import numpy as np
arr = np.array([[10,10],[3,6],[2,4],[10,10],[0,0],[2,4]])
vals, inverse, count = np.unique(arr,
return_inverse=True,
return_counts=True,
axis=0)
out = np.where(count[inverse] > 1)[0] #find all indices where counts > 1
print(out) #array([0, 2, 3, 5], dtype=int64)
Find indices of duplicated rows in python ndarray
The numpy_indexed package has a lot of functionality to solve these type of problems efficiently.
For instance, (unlike numpy's builtin unique) this will find your unique images:
import numpy_indexed as npi
unique_training_images = npi.unique(train)
Or if you want to find all the indices of each unique group, you can use:
indices = npi.group_by(train).split(np.arange(len(train)))
Note that these functions do not have quadratic time complexity, like in your original post, and are fully vectorized, and thus in all likelihood a lot more efficient. Also, unlike pandas it does not have a preferred data format, and is fully nd-array capable, so acting on arrays with shape [n_images, 28, 28] 'just works'.
Pandas: Get duplicated indexes
df.groupby(level=0).filter(lambda x: len(x) > 1)['type']
We added filter
method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.
Important:
The filter
method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transform
on Series -- was fixed for version 0.13, which should be released any day now.
Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform
workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.
There is no good reason why filter
and transform
should not be applied to nonunique indexes; it was just poorly implemented at first.
How to find row indices where duplicated pairs exist
Seemed to me there was an error in your unique.pairs
construction, so I offer this alternate:
unique.pairs <- unique( master[ duplicated(master) ,c(1,2)])
This then uses that to construct a vector of your first sort:
grps <- apply( master, 1, function(x) if ( any( duplicated( rbind(unique.pairs, x))) ) { paste(x[1],x[2], sep="->")}else{NA} + )
grps
#[1] NA NA "2->3" "2->3" NA NA "4->6" "4->6" NA NA
You can then use that vector to group the other items of interest:
> locs <- tapply( rownames(master), grps, function(x) paste(x, collapse=",") )
> as.data.frame(locs)
locs
2->3 3,4
4->6 7,8
Find all indices of duplicates and write them in new columns
Here is one option with data.table
. After grouping by 'string', get the sequence (seq_len(.N)
) and row index (.I
), then dcast
to 'wide' format and join with the original dataset on
the 'string'
library(data.table)
dcast(setDT(DT)[, .(seq_len(.N),.I), string],string ~ paste0("match", V1))[DT, on = "string"]
# string match1 match2 match3
# 1: A 1 7 11
# 2: B 2 NA NA
# 3: C 3 8 NA
# 4: D 4 NA NA
# 5: E 5 NA NA
# 6: F 6 9 NA
# 7: A 1 7 11
# 8: C 3 8 NA
# 9: F 6 9 NA
#10: Z 10 NA NA
#11: A 1 7 11
Or another option would be to split
the sequence of rows with 'string', pad the list
elements with NA
for length that are less, and merge
with the original dataset (using base R
methods)
lst <- split(seq_len(nrow(DT)), DT$string)
merge(DT, do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))),
by.x = "string", by.y = "row.names")
data
DT<- data.frame(string=c("A","B","C","D","E","F","A","C",
"F","Z","A"), stringsAsFactors=FALSE)
c++ get indices of duplicating rows in 2D array
You could take advantage of the properties of a `std::unordered_set.
A small helper class will further ease up things.
So, we can store in a class the 2nd and 4th value and use a comparision function to detect duplicates.
The std::unordered_set has, besides the data type, 2 additional template parameters.
- A functor for equality and
- a functor for calculating a hash function.
So we will add 2 functions to our class an make it a functor for both parameters at the same time. In the below code you will see:
std::unordered_set<Dupl, Dupl, Dupl> dupl{};
So, we use our class additionally as 2 functors.
The rest of the functionality will be done by the std::unordered_set
Please see below one of many potential solutions:
#include <vector>
#include <unordered_set>
#include <iostream>
struct Dupl {
Dupl() {}
Dupl(const size_t row, const std::vector<int>& data) : index(row), firstValue(data[2]), secondValue(data[4]){};
size_t index{};
int firstValue{};
int secondValue{};
// Hash function
std::size_t operator()(const Dupl& d) const noexcept {
return d.firstValue + (d.secondValue << 8) + (d.index << 16);
}
// Comparison
bool operator()(const Dupl& lhs, const Dupl& rhs) const {
return (lhs.firstValue == rhs.firstValue) and (lhs.secondValue == rhs.secondValue);
}
};
std::vector<std::vector<int>> data{
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, // Index 0
{2, 3, 4, 5, 6, 7, 8, 9, 10, 11}, // Index 1
{3, 4, 42, 6, 42, 8, 9, 10, 11, 12}, // Index 2 ***
{4, 5, 6, 7, 8, 9, 10, 11, 12, 13}, // Index 3
{5, 6, 42, 8, 42, 10, 11, 12, 13, 14}, // Index 4 ***
{6, 7, 8, 9, 10, 11, 12, 13, 14, 15}, // Index 5
{7, 8, 9, 10, 11, 12, 13, 14, 15, 16}, // Index 6
{8, 9, 10, 11, 12, 13, 14, 15, 16, 17}, // Index 7
{9, 10, 42, 12, 42, 14, 15, 16, 17, 18}, // Index 8 ***
{10, 11, 12, 13, 14, 15, 16, 17, 18, 19}, // Index 9
};
int main() {
std::unordered_set<Dupl, Dupl, Dupl> dupl{};
// Find the unique rows
for (size_t i{}; i < data.size(); ++i)
dupl.insert({i, data[i]});
// Show some debug output
for (const Dupl& d : dupl) {
std::cout << "\nIndex:\t " << d.index << "\t\tData: ";
for (const int i : data[d.index]) std::cout << i << ' ';
}
}
Related Topics
How to Prevent Ifelse() from Turning Date Objects into Numeric Objects
Is R'S Apply Family More Than Syntactic Sugar
Why Does Summarize or Mutate Not Work With Group_By When I Load 'Plyr' After 'Dplyr'
Split Column At Delimiter in Data Frame
Pass a Data.Frame Column Name to a Function
Extract Row Corresponding to Minimum Value of a Variable by Group
Unique Combination of All Elements from Two (Or More) Vectors
How to Read Data When Some Numbers Contain Commas as Thousand Separator
How to Use a Variable to Specify Column Name in Ggplot
Collapse Text by Group in Data Frame
Too Much White Space Between Caption and Figure Produced by Tikzdevice and Ggplot2 in Latex
How to Force R to Use a Specified Factor Level as Reference in a Regression
Selecting Only Duplicates Based on Multiple Columns in R
Convert Multiple Columns of Numeric Data to Dates in R
How to Change the Spacing Between Legend Items in Ggplot2
R: Error in Usemethod("Group_By_"):Applied to an Object of Class