pair-wise duplicate removal from dataframe
One solution is to first sort each row of df
:
for (i in 1:nrow(df))
{
df[i, ] = sort(df[i, ])
}
df
a b
1 A A
2 A B
3 A B
4 B C
5 A B
6 A B
7 B C
8 B C
At that point it's just a matter of removing the duplicated elements:
df = df[!duplicated(df),]
df
a b
1 A A
2 A B
4 B C
As thelatemail mentioned in the comments, your code actualy keeps the duplicates. You need to use !duplicated
to remove them.
How to remove pair duplication in pandas?
Use numpy.sort
for sort per rows with duplicated
for boolean mask:
df1 = pd.DataFrame(np.sort(df[['antecedent','descendant']], axis=1))
Or:
#slowier solution
#df1 = df[['antecedent','descendant']].apply(frozenset, 1)
df = df[~df1.duplicated()]
print (df)
Id antecedent descendant
0 1 one two
2 3 two three
3 4 one three
Python data frame drop duplicate rows based on pairwise columns
Let us try compare with values
out = df[np.all(df[['Col1', 'Col2']].values != df[['Col3', 'Col4']].values,1)]
Out[298]:
Col1 Col2 Col3 Col4
0 A B C D
3 A B B A
4 C D D C
Remove duplicate column pairs, sort rows based on 2 columns
Instead of sorting for the whole dataset, sort the 'Var1', 'Var2', and then use duplicated
to remove the duplicate rows
testdata[1:2] <- t( apply(testdata[1:2], 1, sort) )
testdata[!duplicated(testdata[1:2]),]
# Var1 Var2 f g
#1 1 4 blue blue
#2 2 3 green green
#5 5 5 orange2 orange1
Pandas removing mirror pairs from dataframe
np.sort
+ drop_duplicates
df.loc[pd.DataFrame(np.sort(df[['A','B']],1),index=df.index).drop_duplicates(keep='first').index]
Out[316]:
A B C D E
0 a b 0.1 0.3 0.9
1 c d 0.2 0.4 0.5
how do I remove rows with duplicate values of columns in pandas data frame?
Using drop_duplicates
with subset
with list of columns to check for duplicates on and keep='first'
to keep first of duplicates.
If dataframe
is:
df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Then:
result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
Related Topics
Is R'S Apply Family More Than Syntactic Sugar
Why Does Summarize or Mutate Not Work With Group_By When I Load 'Plyr' After 'Dplyr'
Convert Date-Time String to Class Date
Access Lapply Index Names Inside Fun
Replace Na With Previous or Next Value, by Group, Using Dplyr
Unique Combination of All Elements from Two (Or More) Vectors
How to Read Data When Some Numbers Contain Commas as Thousand Separator
How to Use a Variable to Specify Column Name in Ggplot
Collapse Text by Group in Data Frame
Counting the Number of Elements With the Values of X in a Vector
Difference Between Require() and Library()
Reshape Multiple Value Columns to Wide Format
How to Use Greek Symbols in Ggplot2
How to Add a Diagonal Line to a Plot
How to Use "≪≪-" (Scoping Assignment) in R
Removing Columns That Are All 0
Apply Several Summary Functions on Several Variables by Group in One Call