Find Duplicated Rows (Based on 2 Columns) in Data Frame in R

Find duplicated rows (based on 2 columns) in Data Frame in R

You can always try simply passing those first two columns to the function duplicated:

duplicated(dat[,1:2])

assuming your data frame is called dat. For more information, we can consult the help files for the duplicated function by typing ?duplicated at the console. This will provide the following sentences:

Determines which elements of a vector or data frame are duplicates of
elements with smaller subscripts, and returns a logical vector
indicating which elements (rows) are duplicates.

So duplicated returns a logical vector, which we can then use to extract a subset of dat:

ind <- duplicated(dat[,1:2])
dat[ind,]

or you can skip the separate assignment step and simply use:

dat[duplicated(dat[,1:2]),]

R - find and list duplicate rows based on two columns

Here is an option using duplicated twice, second time along with fromLast = TRUE option because it returns TRUE only from the duplicate value on-wards

dupe = data[,c('T.N','ID')] # select columns to check duplicates
data[duplicated(dupe) | duplicated(dupe, fromLast=TRUE),]

# File T.N ID Col1 Col2
#1 BAI.txt T 1 sdaf eiri
#3 BBK.txt T 1 ter ase

Find duplicate rows in data frame based on multiple columns in r

We can do

library(data.table)
unique(setDT(data_concern_join2),
by = c('locid', 'stdate', 'sttime', 'charnam', 'valunit'))

Find duplicate rows based on 2 columns and keep rows based on the value of a 3rd column in R

You can do:

library(tidyverse)

df %>%
group_by(id_number, date) %>%
filter(!(result == 9 & row_number() > 1)) %>%
ungroup()

# A tibble: 6 x 3
id_number date result
<dbl> <chr> <dbl>
1 1 2021-11-03 0
2 1 2021-11-19 1
3 2 2021-11-11 0
4 3 2021-11-05 0
5 3 2021-11-16 0
6 4 2021-11-29 9

remove duplicate values based on 2 columns

This will give you the desired result:

df [!duplicated(df[c(1,4)]),]

How to find duplicates based on values in 2 columns but also the groupings by another column in R?

It was a little unclear if you wanted to return:

  1. only the distinct rows
  2. single examples of duplicated rows
  3. all duplicated rows

So here are some options:

library(dplyr)
library(readr)

"ID,a,b
1, 1, 1
1, 1, 1
1, 1, 2
2, 1, 1
2, 1, 2" %>%
read_csv() -> exp_dat

# return only distinct rows
exp_dat %>%
distinct(ID, a, b)

# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 2
# 3 2 1 1
# 4 2 1 2

# return single examples of duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)

# # A tibble: 1 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1

# return all duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
add_count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)

# # A tibble: 2 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 1

Use R to find duplicates in multiple columns at once

We can use unique with by option from data.table

library(data.table)
unique(setDT(df), by = c("Surname", "Address"))
# Surname First Name Address
#1: A1 Bobby X1
#2: B5 Joe X2
#3: B5 Mary X3
#4: F2 Lou X4
#5: F3 Sarah X5
#6: G4 Bobby X6
#7: H5 Eric X7
#8: K6 Peter X8

Or with tidyverse

library(dplyr)
df %>%
distinct(Surname, Address, .keep_all = TRUE)
# Surname First Name Address
#1 A1 Bobby X1
#2 B5 Joe X2
#3 B5 Mary X3
#4 F2 Lou X4
#5 F3 Sarah X5
#6 G4 Bobby X6
#7 H5 Eric X7
#8 K6 Peter X8

Update

Based on the updated post, perhaps this helps

setDT(df)[, if((uniqueN(FirstName))>1) .SD,.(Surname, Address)]
# Surname Address FirstName
#1: G4 X6 Bobby
#2: G4 X6 Fred
#3: G4 X6 Anna

Remove duplicated rows based on 2 columns in R

For the sake of completeness, the unique() function from the data.table package can be used as well:

library(data.table)
unique(setDT(df), by = "IndexA")
   TimeStamp IndexA IndexB     Value
1: 12:00:01 1 NA Windows
2: 12:00:48 NA 1 Macintosh
3: 12:02:01 2 NA Windows

This is looking for unique values only in IndexA which is equivalent to Tito Sanz' answer. Obviously, this approach returns the expected result for the given sample data set but checking only one column for duplicate entries is oversimplifying IMHO and may fail with production data.

Or, looking for unique combinations of the values in three columns (which is equivalent to www's answer):

unique(setDT(df), by = 2:4) # very terse
unique(setDT(df), by = c("IndexA", "IndexB", "Value")) # explicitely named cols
   TimeStamp IndexA IndexB     Value
1: 12:00:01 1 NA Windows
2: 12:00:48 NA 1 Macintosh
3: 12:02:01 2 NA Windows

Data

library(data.table)
df <- fread(
"TimeStamp IndexA IndexB Value
12:00:01 1 NA Windows
12:00:05 1 NA Windows
12:00:13 1 NA Windows
12:00:48 NA 1 Macintosh
12:01:30 NA 1 Macintosh
12:01:45 NA 1 Macintosh
12:02:01 2 NA Windows
12:02:13 2 NA Windows")

r filter duplicate rows based on value in column

Here is an option

df %>%
group_by(Id) %>%
filter(Col3 == "A" | n() == 1) %>%
ungroup()
## A tibble: 3 x 5
# Id Date Col1 Col2 Col3
# <int> <chr> <int> <int> <chr>
#1 1 1/1/1995 NA 1 A
#2 2 3/10/1992 0 1 B
#3 3 8/15/2002 1 1 B

This keeps either rows where Col3 == "A" or single rows per group. PS. I recommend always using ungroup() to avoid unwanted surprises downstream.



Related Topics



Leave a reply



Submit