Deleting reversed duplicates with R
mydf <- read.table(text="gene_x gene_y
AT1 AT2
AT3 AT4
AT1 AT2
AT1 AT3
AT2 AT1", header=TRUE, stringsAsFactors=FALSE)
Here's one strategy using apply
, sort
, paste
, and duplicated
:
mydf[!duplicated(apply(mydf,1,function(x) paste(sort(x),collapse=''))),]
gene_x gene_y
1 AT1 AT2
2 AT3 AT4
4 AT1 AT3
And here's a slightly different solution:
mydf[!duplicated(lapply(as.data.frame(t(mydf), stringsAsFactors=FALSE), sort)),]
gene_x gene_y
1 AT1 AT2
2 AT3 AT4
4 AT1 AT3
Remove both pairs of a duplicated case
One way to achieve removal of all instances of duplicated rows is to reverse the order of the variable for the duplicated
function, which always:
returns the index i of the first duplicated entry x[i]
Using this functionality, we can then combine the forward and reverse passes to remove all rows which contain duplicated data.
# first pass
s1 = !duplicated(df[,1:3])
# second pass on the data.frame with reversed order in each column
s2 = !duplicated(apply(df[,1:3], 2, rev))
# the second pass needs to be back-reversed to match the original df
df[s1 & rev(s2), ]
ID Feature From To
5 3 A 2015-01-01 2016-01-01
6 3 B 2015-01-01 2017-01-01
Or we can use a more elegant solution that @dalloliogm pointed out, and apply duplicated
with argument fromLast = TRUE
.
s2 = !duplicated(df[,1:3], fromLast = TRUE)
df[s1 & s2, ]
Removing duplicate rows from data frame in R
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by the pmin(A, B)
and pmax(A,B)
, if
the number of rows is greater than 1, we get the first row or else
return the rows.
library(data.table)
setDT(df1)[, if(.N >1) head(.SD, 1) else .SD ,.(A=pmin(A, B), B= pmax(A, B))]
# A B prob
#1: 1 2 0.1
#2: 1 3 0.2
#3: 1 4 0.3
#4: 2 3 0.1
#5: 2 4 0.4
Or we can just used duplicated
on the pmax
, pmin
output to return a logical index and subset the data based on that.
setDT(df1)[!duplicated(cbind(pmax(A, B), pmin(A, B)))]
# A B prob
#1: 1 2 0.1
#2: 1 3 0.2
#3: 1 4 0.3
#4: 2 3 0.1
#5: 2 4 0.4
pair-wise duplicate removal from dataframe
One solution is to first sort each row of df
:
for (i in 1:nrow(df))
{
df[i, ] = sort(df[i, ])
}
df
a b
1 A A
2 A B
3 A B
4 B C
5 A B
6 A B
7 B C
8 B C
At that point it's just a matter of removing the duplicated elements:
df = df[!duplicated(df),]
df
a b
1 A A
2 A B
4 B C
As thelatemail mentioned in the comments, your code actualy keeps the duplicates. You need to use !duplicated
to remove them.
Remove reversed duplicates from a data frame
Try this one. It's completely in pandas (should be faster)
This also corrects bugs in my previous answer but the concept of taking the labels as a pair remains the same.
In [384]: df['pair'] = df[[0, 1]].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)
Get only max values per duplicated result:
In [385]: dfd = df.loc[df.groupby('pair')[2].idxmax()]
If you need the names to be in separate columns:
In [398]: dfd[0] = dfd['pair'].transform(lambda x: x.split('-')[0])
In [399]: dfd[1] = dfd['pair'].transform(lambda x: x.split('-')[1])
Related Topics
Error ".Onload Failed in Loadnamespace() for 'Tcltk'"
Evaluating Both Column Name and the Target Value Within 'J' Expression Within 'Data.Table'
R: Data.Table Cross-Join Not Working
Mean of Each Element of a List of Matrices
Operator == Inconsistent in Logical Columns in Data.Table
How to Deal with "'Somefunction' Is Not an Exported Object from 'Namespace:Somepackage'" Error
Pasting Elements of Two Vectors Alphabetically
Remove Ids That Occur X Times R
R Compare Multiple Values with Vector and Return Vector
What Are the R Sorting Rules of Character Vectors
How to Generate All Possible Combinations of Vectors Without Caring for Order
How to Work with Large Numbers in R
Return Index from a Vector of the Value Closest to a Given Element
How to Count the Frequency of a String for Each Row in R
Replacing Numbers Within a Range with a Factor
Remove Backslashes from Character String
Floating Point Less-Than-Equal Comparisons After Addition and Substraction