Fastest Way to Remove All Duplicates in R

Fastest way to remove all duplicates in R

Some timings:

set.seed(1001)
d <- sample(1:100000, 100000, replace=T)
d <- c(d, sample(d, 20000, replace=T)) # ensure many duplicates
mb <- microbenchmark::microbenchmark(
d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
setdiff(d, d[duplicated(d)]),
{tmp <- rle(sort(d)); tmp$values[tmp$lengths == 1]},
as.integer(names(table(d)[table(d)==1])),
d[!(duplicated.default(d) | duplicated.default(d, fromLast=TRUE))],
d[!(d %in% d[duplicated(d)])],
{ ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))]
)
summary(mb)[, c(1, 4)] # in milliseconds
# expr mean
#1 d[!(duplicated(d) | duplicated(d, fromLast = TRUE))] 18.34692
#2 setdiff(d, d[duplicated(d)]) 24.84984
#3 { tmp <- rle(sort(d)) tmp$values[tmp$lengths == 1] } 9.53831
#4 as.integer(names(table(d)[table(d) == 1])) 255.76300
#5 d[!(duplicated.default(d) | duplicated.default(d, fromLast = TRUE))] 18.35360
#6 d[!(d %in% d[duplicated(d)])] 24.01009
#7 { ud = unique(d) ud[tabulate(match(d, ud)) == 1L] } 32.10166
#8 d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))] 18.33475

Given the comments let's see if they are all correct?

 results <- list(d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
setdiff(d, d[duplicated(d)]),
{tmp <- rle(sort(d)); tmp$values[tmp$lengths == 1]},
as.integer(names(table(d)[table(d)==1])),
d[!(duplicated.default(d) | duplicated.default(d, fromLast=TRUE))],
d[!(d %in% d[duplicated(d)])],
{ ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))])
all(sapply(ls, all.equal, c(3, 5, 6)))
# TRUE

How can I remove all duplicates so that NONE are left in a data frame?

This will extract the rows which appear only once (assuming your data frame is named df):

df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]

How it works: The function duplicated tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE is used, the function starts at the last line.

Boths boolean results are combined with | (logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using ! thereby creating a boolean vector indicating lines appearing only once.

Removing duplicates from a data frame, very fast

We could get the index of rows (.I[which.min(..)) that have minimum 'VALUE' for each 'SYMBOL' and use that column ('V1') to subset the dataset.

library(data.table)
dt[dt[,.I[which.min(VALUE)],by=list(SYMBOL)]$V1]

Or as @DavidArenburg mentioned, using setkey would be more efficient (although I am not sure why you get error with the original data)

 setkey(dt, VALUE) 
indx <- dt[,.I[1L], by = SYMBOL]$V1
dt[indx]

Remove duplicated rows

just isolate your data frame to the columns you need, then use the unique function :D

# in the above example, you only need the first three columns
deduped.data <- unique( yourdata[ , 1:3 ] )
# the fourth column no longer 'distinguishes' them,
# so they're duplicates and thrown out.

Delete all duplicated rows in R

You can use table() to get a frequency table of your column, then use the result to subset:

singletons <- names(which(table(test$a) == 1))
test[test$a %in% singletons, ]

a b c
1 1 2 a
2 4 5 b

Remove all copies of rows with duplicate values in R

The duplicated returns TRUE only from the duplicate value onwards. To return all the elements that are duplicates, we may need to apply the duplicated in the reverse i.e. from last value to first and use the OR condition i.e. |, negate and subset the dataset.

db[!(duplicated(db[2:3])|duplicated(db[2:3], fromLast=TRUE)),]
# name position type
# 2 B 13 T
# 4 D 12 T
# 5 E 11 S
# 6 F 10 S

Remove duplicated rows using dplyr

Note: dplyr now contains the distinct function for this purpose.

Original answer below:


library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)

One approach would be to group, and then only keep the first row:

df %>% group_by(x, y) %>% filter(row_number(z) == 1)

## Source: local data frame [3 x 3]
## Groups: x, y
##
## x y z
## 1 0 1 1
## 2 1 0 2
## 3 1 1 4

(In dplyr 0.2 you won't need the dummy z variable and will just be
able to write row_number() == 1)

I've also been thinking about adding a slice() function that would
work like:

df %>% group_by(x, y) %>% slice(from = 1, to = 1)

Or maybe a variation of unique() that would let you select which
variables to use:

df %>% unique(x, y)

Remove duplicates keeping entry with largest absolute value

First. Sort in the order putting the less desired items last within id groups

 aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value)

Then: Remove items after the first within id groups

 aa[ !duplicated(aa$id), ]              # take the first row within each id
id value
2 1 2
4 2 -4
5 3 -5
6 4 6


Related Topics



Leave a reply



Submit