Selecting Only Duplicates Based on Multiple Columns in R

Find duplicated rows (based on 2 columns) in Data Frame in R

You can always try simply passing those first two columns to the function duplicated:

duplicated(dat[,1:2])

assuming your data frame is called dat. For more information, we can consult the help files for the duplicated function by typing ?duplicated at the console. This will provide the following sentences:

Determines which elements of a vector or data frame are duplicates of
elements with smaller subscripts, and returns a logical vector
indicating which elements (rows) are duplicates.

So duplicated returns a logical vector, which we can then use to extract a subset of dat:

ind <- duplicated(dat[,1:2])
dat[ind,]

or you can skip the separate assignment step and simply use:

dat[duplicated(dat[,1:2]),]

Find duplicate rows in data frame based on multiple columns in r

We can do

library(data.table)
unique(setDT(data_concern_join2),
by = c('locid', 'stdate', 'sttime', 'charnam', 'valunit'))

R - find and list duplicate rows based on two columns

Here is an option using duplicated twice, second time along with fromLast = TRUE option because it returns TRUE only from the duplicate value on-wards

dupe = data[,c('T.N','ID')] # select columns to check duplicates
data[duplicated(dupe) | duplicated(dupe, fromLast=TRUE),]

# File T.N ID Col1 Col2
#1 BAI.txt T 1 sdaf eiri
#3 BBK.txt T 1 ter ase

Use R to find duplicates in multiple columns at once

We can use unique with by option from data.table

library(data.table)
unique(setDT(df), by = c("Surname", "Address"))
# Surname First Name Address
#1: A1 Bobby X1
#2: B5 Joe X2
#3: B5 Mary X3
#4: F2 Lou X4
#5: F3 Sarah X5
#6: G4 Bobby X6
#7: H5 Eric X7
#8: K6 Peter X8

Or with tidyverse

library(dplyr)
df %>%
distinct(Surname, Address, .keep_all = TRUE)
# Surname First Name Address
#1 A1 Bobby X1
#2 B5 Joe X2
#3 B5 Mary X3
#4 F2 Lou X4
#5 F3 Sarah X5
#6 G4 Bobby X6
#7 H5 Eric X7
#8 K6 Peter X8

Update

Based on the updated post, perhaps this helps

setDT(df)[, if((uniqueN(FirstName))>1) .SD,.(Surname, Address)]
# Surname Address FirstName
#1: G4 X6 Bobby
#2: G4 X6 Fred
#3: G4 X6 Anna

duplicates in multiple columns

It works if you use duplicated twice:

df[!(duplicated(df[c("c","d")]) | duplicated(df[c("c","d")], fromLast = TRUE)), ]

a b c d
1 1 2 A 1001
4 4 8 C 1003
7 7 13 E 1005
8 8 14 E 1006

Remove duplicate rows based on multiple columns using dplyr / tidyverse?

duplicated expected to operate on "a vector or a data frame or an array" (but not two vectors ... it looks for duplication in its first argument only).

df %>%
filter(duplicated(.))
# a b
# 1 1 1
# 2 2 2

df %>%
filter(!duplicated(.))
# a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1

If you prefer to reference a specific subset of columns, then use cbind:

df %>%
filter(duplicated(cbind(a, b)))

As a side note, the dplyr verb for this can be distinct:

df %>%
distinct(a, b, .keep_all = TRUE)
# a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1

though I don't know that it has an inverse of this function.

R - Identify duplicate rows based on multiple columns and remove them based on date

You can use slice to select latest row where Wave = 2.

library(dplyr)

df %>%
mutate(Date = lubridate::dmy(Date)) %>%
group_by(ID, Wave) %>%
slice(if(first(Wave) == 2) which.max(Date) else seq_len(n()))

# ID Wave Date
# <chr> <int> <date>
# 1 A 1 2020-02-12
# 2 A 1 2020-02-11
# 3 A 2 2020-02-20
# 4 B 1 2020-02-13
# 5 B 2 2020-02-21
# 6 C 1 2020-02-14
# 7 C 2 2020-02-23
# 8 D 1 2020-02-15
# 9 D 2 2020-02-24
#10 E 1 2020-02-16
#11 E 2 2020-02-25
#12 F 1 2020-02-17
#13 F 2 2020-02-26
#14 G 1 2020-02-18

How to find duplicates based on values in 2 columns but also the groupings by another column in R?

It was a little unclear if you wanted to return:

  1. only the distinct rows
  2. single examples of duplicated rows
  3. all duplicated rows

So here are some options:

library(dplyr)
library(readr)

"ID,a,b
1, 1, 1
1, 1, 1
1, 1, 2
2, 1, 1
2, 1, 2" %>%
read_csv() -> exp_dat

# return only distinct rows
exp_dat %>%
distinct(ID, a, b)

# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 2
# 3 2 1 1
# 4 2 1 2

# return single examples of duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)

# # A tibble: 1 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1

# return all duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
add_count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)

# # A tibble: 2 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 1

Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

You can group by x2 and x3 and use slice(), i.e.

library(dplyr)

df %>%
group_by(x2, x3) %>%
slice(which.max(x4))

# A tibble: 3 x 4
# Groups: x2, x3 [3]
x1 x2 x3 x4
<chr> <chr> <chr> <int>
1 X A B 4
2 Z A C 1
3 X C B 5


Related Topics



Leave a reply



Submit