Pair-Wise Duplicate Removal from Dataframe

pair-wise duplicate removal from dataframe

One solution is to first sort each row of df:

for (i in 1:nrow(df))
{
df[i, ] = sort(df[i, ])
}
df

a b
1 A A
2 A B
3 A B
4 B C
5 A B
6 A B
7 B C
8 B C

At that point it's just a matter of removing the duplicated elements:

df = df[!duplicated(df),]
df
a b
1 A A
2 A B
4 B C

As thelatemail mentioned in the comments, your code actualy keeps the duplicates. You need to use !duplicated to remove them.

How to remove pair duplication in pandas?

Use numpy.sort for sort per rows with duplicated for boolean mask:

df1 = pd.DataFrame(np.sort(df[['antecedent','descendant']], axis=1))

Or:

#slowier solution
#df1 = df[['antecedent','descendant']].apply(frozenset, 1)

df = df[~df1.duplicated()]
print (df)
Id antecedent descendant
0 1 one two
2 3 two three
3 4 one three

Python data frame drop duplicate rows based on pairwise columns

Let us try compare with values

out = df[np.all(df[['Col1', 'Col2']].values != df[['Col3', 'Col4']].values,1)]
Out[298]:
Col1 Col2 Col3 Col4
0 A B C D
3 A B B A
4 C D D C

Remove duplicate column pairs, sort rows based on 2 columns

Instead of sorting for the whole dataset, sort the 'Var1', 'Var2', and then use duplicated to remove the duplicate rows

testdata[1:2] <- t( apply(testdata[1:2], 1, sort) )
testdata[!duplicated(testdata[1:2]),]
# Var1 Var2 f g
#1 1 4 blue blue
#2 2 3 green green
#5 5 5 orange2 orange1

Pandas removing mirror pairs from dataframe

np.sort + drop_duplicates

df.loc[pd.DataFrame(np.sort(df[['A','B']],1),index=df.index).drop_duplicates(keep='first').index]
Out[316]:
A B C D E
0 a b 0.1 0.3 0.9
1 c d 0.2 0.4 0.5

how do I remove rows with duplicate values of columns in pandas data frame?

Using drop_duplicates with subset with list of columns to check for duplicates on and keep='first' to keep first of duplicates.

If dataframe is:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)

Result:

  Column1   Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'

Then:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

Result:

  Column1   Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'


Related Topics



Leave a reply



Submit