Find duplicated rows (based on 2 columns) in Data Frame in R
You can always try simply passing those first two columns to the function duplicated
:
duplicated(dat[,1:2])
assuming your data frame is called dat
. For more information, we can consult the help files for the duplicated
function by typing ?duplicated
at the console. This will provide the following sentences:
Determines which elements of a vector or data frame are duplicates of
elements with smaller subscripts, and returns a logical vector
indicating which elements (rows) are duplicates.
So duplicated
returns a logical vector, which we can then use to extract a subset of dat
:
ind <- duplicated(dat[,1:2])
dat[ind,]
or you can skip the separate assignment step and simply use:
dat[duplicated(dat[,1:2]),]
Find duplicate rows in data frame based on multiple columns in r
We can do
library(data.table)
unique(setDT(data_concern_join2),
by = c('locid', 'stdate', 'sttime', 'charnam', 'valunit'))
R - find and list duplicate rows based on two columns
Here is an option using duplicated
twice, second time along with fromLast = TRUE
option because it returns TRUE only from the duplicate value on-wards
dupe = data[,c('T.N','ID')] # select columns to check duplicates
data[duplicated(dupe) | duplicated(dupe, fromLast=TRUE),]
# File T.N ID Col1 Col2
#1 BAI.txt T 1 sdaf eiri
#3 BBK.txt T 1 ter ase
Use R to find duplicates in multiple columns at once
We can use unique
with by
option from data.table
library(data.table)
unique(setDT(df), by = c("Surname", "Address"))
# Surname First Name Address
#1: A1 Bobby X1
#2: B5 Joe X2
#3: B5 Mary X3
#4: F2 Lou X4
#5: F3 Sarah X5
#6: G4 Bobby X6
#7: H5 Eric X7
#8: K6 Peter X8
Or with tidyverse
library(dplyr)
df %>%
distinct(Surname, Address, .keep_all = TRUE)
# Surname First Name Address
#1 A1 Bobby X1
#2 B5 Joe X2
#3 B5 Mary X3
#4 F2 Lou X4
#5 F3 Sarah X5
#6 G4 Bobby X6
#7 H5 Eric X7
#8 K6 Peter X8
Update
Based on the updated post, perhaps this helps
setDT(df)[, if((uniqueN(FirstName))>1) .SD,.(Surname, Address)]
# Surname Address FirstName
#1: G4 X6 Bobby
#2: G4 X6 Fred
#3: G4 X6 Anna
duplicates in multiple columns
It works if you use duplicated
twice:
df[!(duplicated(df[c("c","d")]) | duplicated(df[c("c","d")], fromLast = TRUE)), ]
a b c d
1 1 2 A 1001
4 4 8 C 1003
7 7 13 E 1005
8 8 14 E 1006
Remove duplicate rows based on multiple columns using dplyr / tidyverse?
duplicated
expected to operate on "a vector or a data frame or an array" (but not two vectors ... it looks for duplication in its first argument only).
df %>%
filter(duplicated(.))
# a b
# 1 1 1
# 2 2 2
df %>%
filter(!duplicated(.))
# a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1
If you prefer to reference a specific subset of columns, then use cbind
:
df %>%
filter(duplicated(cbind(a, b)))
As a side note, the dplyr
verb for this can be distinct
:
df %>%
distinct(a, b, .keep_all = TRUE)
# a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1
though I don't know that it has an inverse of this function.
R - Identify duplicate rows based on multiple columns and remove them based on date
You can use slice
to select latest row where Wave = 2
.
library(dplyr)
df %>%
mutate(Date = lubridate::dmy(Date)) %>%
group_by(ID, Wave) %>%
slice(if(first(Wave) == 2) which.max(Date) else seq_len(n()))
# ID Wave Date
# <chr> <int> <date>
# 1 A 1 2020-02-12
# 2 A 1 2020-02-11
# 3 A 2 2020-02-20
# 4 B 1 2020-02-13
# 5 B 2 2020-02-21
# 6 C 1 2020-02-14
# 7 C 2 2020-02-23
# 8 D 1 2020-02-15
# 9 D 2 2020-02-24
#10 E 1 2020-02-16
#11 E 2 2020-02-25
#12 F 1 2020-02-17
#13 F 2 2020-02-26
#14 G 1 2020-02-18
How to find duplicates based on values in 2 columns but also the groupings by another column in R?
It was a little unclear if you wanted to return:
- only the distinct rows
- single examples of duplicated rows
- all duplicated rows
So here are some options:
library(dplyr)
library(readr)
"ID,a,b
1, 1, 1
1, 1, 1
1, 1, 2
2, 1, 1
2, 1, 2" %>%
read_csv() -> exp_dat
# return only distinct rows
exp_dat %>%
distinct(ID, a, b)
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 2
# 3 2 1 1
# 4 2 1 2
# return single examples of duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 1 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# return all duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
add_count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 2 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 1
Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact
You can group by x2 and x3 and use slice()
, i.e.
library(dplyr)
df %>%
group_by(x2, x3) %>%
slice(which.max(x4))
# A tibble: 3 x 4
# Groups: x2, x3 [3]
x1 x2 x3 x4
<chr> <chr> <chr> <int>
1 X A B 4
2 Z A C 1
3 X C B 5
Related Topics
Creating Grouped Bar-Plot of Multi-Column Data in R
Creating a for Loop to Subset Data on R
Removing Columns That Are All 0
Error in If/While (Condition) {: Missing Value Where True/False Needed
How to Use R'S Ellipsis Feature When Writing Your Own Function
Cleaning Up Factor Levels (Collapsing Multiple Levels/Labels)
Insert Rows For Missing Dates/Times
Extracting Specific Columns from a Data Frame
Generate Multiple Graphics from Within an R Function
Concatenate String Columns and Order in Alphabetical Order
Calculate Row Means on Subset of Columns
Coerce Multiple Columns to Factors At Once
Plotting Two Variables as Lines Using Ggplot2 on the Same Graph
Show Percent % Instead of Counts in Charts of Categorical Variables
Dcast Warning: 'Aggregation Function Missing: Defaulting to Length'
How to Subset Matrix to One Column, Maintain Matrix Data Type, Maintain Row/Column Names