Deleting Reversed Duplicates with R

Deleting reversed duplicates with R

mydf <- read.table(text="gene_x    gene_y
AT1 AT2
AT3 AT4
AT1 AT2
AT1 AT3
AT2 AT1", header=TRUE, stringsAsFactors=FALSE)

Here's one strategy using apply, sort, paste, and duplicated:

mydf[!duplicated(apply(mydf,1,function(x) paste(sort(x),collapse=''))),]
gene_x gene_y
1 AT1 AT2
2 AT3 AT4
4 AT1 AT3

And here's a slightly different solution:

mydf[!duplicated(lapply(as.data.frame(t(mydf), stringsAsFactors=FALSE), sort)),]
gene_x gene_y
1 AT1 AT2
2 AT3 AT4
4 AT1 AT3

Remove both pairs of a duplicated case

One way to achieve removal of all instances of duplicated rows is to reverse the order of the variable for the duplicated function, which always:

returns the index i of the first duplicated entry x[i]

Using this functionality, we can then combine the forward and reverse passes to remove all rows which contain duplicated data.

# first pass
s1 = !duplicated(df[,1:3])
# second pass on the data.frame with reversed order in each column
s2 = !duplicated(apply(df[,1:3], 2, rev))
# the second pass needs to be back-reversed to match the original df
df[s1 & rev(s2), ]
ID Feature From To
5 3 A 2015-01-01 2016-01-01
6 3 B 2015-01-01 2017-01-01

Or we can use a more elegant solution that @dalloliogm pointed out, and apply duplicated with argument fromLast = TRUE.

s2 = !duplicated(df[,1:3], fromLast = TRUE)
df[s1 & s2, ]

Removing duplicate rows from data frame in R

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by the pmin(A, B) and pmax(A,B), if the number of rows is greater than 1, we get the first row or else return the rows.

 library(data.table)
setDT(df1)[, if(.N >1) head(.SD, 1) else .SD ,.(A=pmin(A, B), B= pmax(A, B))]
# A B prob
#1: 1 2 0.1
#2: 1 3 0.2
#3: 1 4 0.3
#4: 2 3 0.1
#5: 2 4 0.4

Or we can just used duplicated on the pmax, pmin output to return a logical index and subset the data based on that.

 setDT(df1)[!duplicated(cbind(pmax(A, B), pmin(A, B)))]
# A B prob
#1: 1 2 0.1
#2: 1 3 0.2
#3: 1 4 0.3
#4: 2 3 0.1
#5: 2 4 0.4

pair-wise duplicate removal from dataframe

One solution is to first sort each row of df:

for (i in 1:nrow(df))
{
df[i, ] = sort(df[i, ])
}
df

a b
1 A A
2 A B
3 A B
4 B C
5 A B
6 A B
7 B C
8 B C

At that point it's just a matter of removing the duplicated elements:

df = df[!duplicated(df),]
df
a b
1 A A
2 A B
4 B C

As thelatemail mentioned in the comments, your code actualy keeps the duplicates. You need to use !duplicated to remove them.

Remove reversed duplicates from a data frame

Try this one. It's completely in pandas (should be faster)
This also corrects bugs in my previous answer but the concept of taking the labels as a pair remains the same.

In [384]: df['pair'] = df[[0, 1]].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)

Get only max values per duplicated result:

In [385]: dfd = df.loc[df.groupby('pair')[2].idxmax()]

If you need the names to be in separate columns:

In [398]: dfd[0] = dfd['pair'].transform(lambda x: x.split('-')[0])
In [399]: dfd[1] = dfd['pair'].transform(lambda x: x.split('-')[1])


Related Topics



Leave a reply



Submit