Fastest way to remove all duplicates in R
Some timings:
set.seed(1001)
d <- sample(1:100000, 100000, replace=T)
d <- c(d, sample(d, 20000, replace=T)) # ensure many duplicates
mb <- microbenchmark::microbenchmark(
d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
setdiff(d, d[duplicated(d)]),
{tmp <- rle(sort(d)); tmp$values[tmp$lengths == 1]},
as.integer(names(table(d)[table(d)==1])),
d[!(duplicated.default(d) | duplicated.default(d, fromLast=TRUE))],
d[!(d %in% d[duplicated(d)])],
{ ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))]
)
summary(mb)[, c(1, 4)] # in milliseconds
# expr mean
#1 d[!(duplicated(d) | duplicated(d, fromLast = TRUE))] 18.34692
#2 setdiff(d, d[duplicated(d)]) 24.84984
#3 { tmp <- rle(sort(d)) tmp$values[tmp$lengths == 1] } 9.53831
#4 as.integer(names(table(d)[table(d) == 1])) 255.76300
#5 d[!(duplicated.default(d) | duplicated.default(d, fromLast = TRUE))] 18.35360
#6 d[!(d %in% d[duplicated(d)])] 24.01009
#7 { ud = unique(d) ud[tabulate(match(d, ud)) == 1L] } 32.10166
#8 d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))] 18.33475
Given the comments let's see if they are all correct?
results <- list(d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
setdiff(d, d[duplicated(d)]),
{tmp <- rle(sort(d)); tmp$values[tmp$lengths == 1]},
as.integer(names(table(d)[table(d)==1])),
d[!(duplicated.default(d) | duplicated.default(d, fromLast=TRUE))],
d[!(d %in% d[duplicated(d)])],
{ ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))])
all(sapply(ls, all.equal, c(3, 5, 6)))
# TRUE
How can I remove all duplicates so that NONE are left in a data frame?
This will extract the rows which appear only once (assuming your data frame is named df
):
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]
How it works: The function duplicated
tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE
is used, the function starts at the last line.
Boths boolean results are combined with |
(logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using !
thereby creating a boolean vector indicating lines appearing only once.
Removing duplicates from a data frame, very fast
We could get the index of rows (.I[which.min(..)
) that have minimum 'VALUE' for each 'SYMBOL' and use that column ('V1') to subset the dataset.
library(data.table)
dt[dt[,.I[which.min(VALUE)],by=list(SYMBOL)]$V1]
Or as @DavidArenburg mentioned, using setkey
would be more efficient (although I am not sure why you get error with the original data)
setkey(dt, VALUE)
indx <- dt[,.I[1L], by = SYMBOL]$V1
dt[indx]
Remove duplicated rows
just isolate your data frame to the columns you need, then use the unique function :D
# in the above example, you only need the first three columns
deduped.data <- unique( yourdata[ , 1:3 ] )
# the fourth column no longer 'distinguishes' them,
# so they're duplicates and thrown out.
Delete all duplicated rows in R
You can use table()
to get a frequency table of your column, then use the result to subset:
singletons <- names(which(table(test$a) == 1))
test[test$a %in% singletons, ]
a b c
1 1 2 a
2 4 5 b
Remove all copies of rows with duplicate values in R
The duplicated
returns TRUE only from the duplicate value onwards. To return all the elements that are duplicates, we may need to apply the duplicated
in the reverse i.e. from last value to first and use the OR
condition i.e. |
, negate and subset the dataset.
db[!(duplicated(db[2:3])|duplicated(db[2:3], fromLast=TRUE)),]
# name position type
# 2 B 13 T
# 4 D 12 T
# 5 E 11 S
# 6 F 10 S
Remove duplicated rows using dplyr
Note: dplyr
now contains the distinct
function for this purpose.
Original answer below:
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
One approach would be to group, and then only keep the first row:
df %>% group_by(x, y) %>% filter(row_number(z) == 1)
## Source: local data frame [3 x 3]
## Groups: x, y
##
## x y z
## 1 0 1 1
## 2 1 0 2
## 3 1 1 4
(In dplyr 0.2 you won't need the dummy z
variable and will just be
able to write row_number() == 1
)
I've also been thinking about adding a slice()
function that would
work like:
df %>% group_by(x, y) %>% slice(from = 1, to = 1)
Or maybe a variation of unique()
that would let you select which
variables to use:
df %>% unique(x, y)
Remove duplicates keeping entry with largest absolute value
First. Sort in the order putting the less desired items last within id
groups
aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value)
Then: Remove items after the first within id
groups
aa[ !duplicated(aa$id), ] # take the first row within each id
id value
2 1 2
4 2 -4
5 3 -5
6 4 6
Related Topics
How to Create a Vector of Functions
Replace Nas with Mean of the Same Column of a Data.Table
Rbindlist Two Data.Tables Where One Has Factor and Other Has Character Type for a Column
How to Draw Half-Filled Points in R (Preferably Using Ggplot)
Applying Gsub to Various Columns
Order Categorical Data in a Stacked Bar Plot with Ggplot2
Boxplot of Table Using Ggplot2
Draw Multiple Squares with Ggplot
How Does R's Ifelse Work with Character Data
Group_By() into Fill() Not Working as Expected
Retain Attributes When Using Gather from Tidyr (Attributes Are Not Identical)
Getting File Path from Shiny UI (Not Just Directory) Using Browse Button Without Uploading the File
Package Domc Not Available for R Version 3.0.0 Warning in Install.Packages
R - Cumulative Sum by Condition
Ggplot2: Fill Color Behaviour of Geom_Ribbon
How Do We Plot Images at Given Coordinates in R