Find Duplicated Elements With Dplyr

Find duplicated elements with dplyr

I guess you could use filter for this purpose:

mtcars %>% 
group_by(carb) %>%
filter(n()>1)

Small example (note that I added summarize() to prove that the resulting data set does not contain rows with duplicate 'carb'. I used 'carb' instead of 'cyl' because 'carb' has unique values whereas 'cyl' does not):

mtcars %>% group_by(carb) %>% summarize(n=n())
#Source: local data frame [6 x 2]
#
# carb n
#1 1 7
#2 2 10
#3 3 3
#4 4 10
#5 6 1
#6 8 1

mtcars %>% group_by(carb) %>% filter(n()>1) %>% summarize(n=n())
#Source: local data frame [4 x 2]
#
# carb n
#1 1 7
#2 2 10
#3 3 3
#4 4 10

High-performance way to find duplicated rows (using dplyr) on big data set

For large-ish data, it's often useful to try a data.table approch. In this case you can find duplicate rows using:

library(data.table)
setDT(df1, key = c("valA", "valB", "Score"))
df1[, N := .N, by = key(df1)] # count rows per group
df1[N > 1]

Find duplicated character values in two columns with dplyr

You could try:

library(dplyr)

dat %>%
filter(duplicated(paste0(`First name`, `Last name`)))

Output on the basis of data below:

  First name Last name
1 Peter Parker

If you'd like to have all the duplications returned, you could do:

dat %>%
group_by(`First name`, `Last name`) %>%
filter(n() > 1)

Output on the basis of data below:

# A tibble: 2 x 2
# Groups: First name, Last name [1]
`First name` `Last name`
<fct> <fct>
1 Peter Parker
2 Peter Parker

Example data:

dat <-
data.frame(
`First name` = c("Peter", "Peter", "John", "John"),
`Last name` = c("Parker", "Parker", "Biscuit", "Chocolate"),
check.names = FALSE
)

dat

First name Last name
1 Peter Parker
2 Peter Parker
3 John Biscuit
4 John Chocolate

Finding ALL duplicate rows, including elements with smaller subscripts

duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.


Some late Edit:
You didn't provide a reproducible example, so here's an illustration kindly contributed by @jbaums

vec <- c("a", "b", "c","c","c") 
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"

Edit: And an example for the case of a data frame:

df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
## X1 X2
## 3 c c
## 4 c c

Find duplicate rows in data frame based on multiple columns in r

We can do

library(data.table)
unique(setDT(data_concern_join2),
by = c('locid', 'stdate', 'sttime', 'charnam', 'valunit'))

find duplicates with grouped variables

We gather the 'from', 'to' columns to 'long' format, grouped by 'val', filter the groups having more than one unique elements, then pull the unique 'val' elements

library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, from:to) %>%
group_by(val) %>%
filter(n_distinct(group) > 1) %>%
distinct(val) %>%
pull(val)
#[1] 1 4

Or using base R we can just table to find the frequency, and get the ids out of it

out <-  with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"

data

df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L, 
4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train",
"train", "train")), class = "data.frame", row.names = c(NA, -6L
))

Find unique entries in otherwise identical rows

A data.table alternative. Coerce data frame to a data.table (setDT). Melt data to long format (melt(df, id.vars = "ID")).

Within each group defined by 'ID' and 'variable' (corresponding to the columns in the wide format) (by = .(ID, variable)), count number of unique values (uniqueN(value)) and check if it's equal to the number of rows in the subgroup (== .N). If so (if), select the entire subgroup (.SD).

Finally, reshape the data back to wide format (dcast).

library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
# ID ID_1 x2 x3 x5
# 1: 1 1 <NA> 7 x
# 2: 1 2 <NA> 10 p
# 3: 3 1 c 9 z
# 4: 3 2 d 11 q


Related Topics



Leave a reply



Submit