Find duplicated rows (based on 2 columns) in Data Frame in R
You can always try simply passing those first two columns to the function duplicated
:
duplicated(dat[,1:2])
assuming your data frame is called dat
. For more information, we can consult the help files for the duplicated
function by typing ?duplicated
at the console. This will provide the following sentences:
Determines which elements of a vector or data frame are duplicates of
elements with smaller subscripts, and returns a logical vector
indicating which elements (rows) are duplicates.
So duplicated
returns a logical vector, which we can then use to extract a subset of dat
:
ind <- duplicated(dat[,1:2])
dat[ind,]
or you can skip the separate assignment step and simply use:
dat[duplicated(dat[,1:2]),]
R - find and list duplicate rows based on two columns
Here is an option using duplicated
twice, second time along with fromLast = TRUE
option because it returns TRUE only from the duplicate value on-wards
dupe = data[,c('T.N','ID')] # select columns to check duplicates
data[duplicated(dupe) | duplicated(dupe, fromLast=TRUE),]
# File T.N ID Col1 Col2
#1 BAI.txt T 1 sdaf eiri
#3 BBK.txt T 1 ter ase
Find duplicate rows in data frame based on multiple columns in r
We can do
library(data.table)
unique(setDT(data_concern_join2),
by = c('locid', 'stdate', 'sttime', 'charnam', 'valunit'))
Find duplicate rows based on 2 columns and keep rows based on the value of a 3rd column in R
You can do:
library(tidyverse)
df %>%
group_by(id_number, date) %>%
filter(!(result == 9 & row_number() > 1)) %>%
ungroup()
# A tibble: 6 x 3
id_number date result
<dbl> <chr> <dbl>
1 1 2021-11-03 0
2 1 2021-11-19 1
3 2 2021-11-11 0
4 3 2021-11-05 0
5 3 2021-11-16 0
6 4 2021-11-29 9
remove duplicate values based on 2 columns
This will give you the desired result:
df [!duplicated(df[c(1,4)]),]
How to find duplicates based on values in 2 columns but also the groupings by another column in R?
It was a little unclear if you wanted to return:
- only the distinct rows
- single examples of duplicated rows
- all duplicated rows
So here are some options:
library(dplyr)
library(readr)
"ID,a,b
1, 1, 1
1, 1, 1
1, 1, 2
2, 1, 1
2, 1, 2" %>%
read_csv() -> exp_dat
# return only distinct rows
exp_dat %>%
distinct(ID, a, b)
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 2
# 3 2 1 1
# 4 2 1 2
# return single examples of duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 1 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# return all duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
add_count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 2 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 1
Use R to find duplicates in multiple columns at once
We can use unique
with by
option from data.table
library(data.table)
unique(setDT(df), by = c("Surname", "Address"))
# Surname First Name Address
#1: A1 Bobby X1
#2: B5 Joe X2
#3: B5 Mary X3
#4: F2 Lou X4
#5: F3 Sarah X5
#6: G4 Bobby X6
#7: H5 Eric X7
#8: K6 Peter X8
Or with tidyverse
library(dplyr)
df %>%
distinct(Surname, Address, .keep_all = TRUE)
# Surname First Name Address
#1 A1 Bobby X1
#2 B5 Joe X2
#3 B5 Mary X3
#4 F2 Lou X4
#5 F3 Sarah X5
#6 G4 Bobby X6
#7 H5 Eric X7
#8 K6 Peter X8
Update
Based on the updated post, perhaps this helps
setDT(df)[, if((uniqueN(FirstName))>1) .SD,.(Surname, Address)]
# Surname Address FirstName
#1: G4 X6 Bobby
#2: G4 X6 Fred
#3: G4 X6 Anna
Remove duplicated rows based on 2 columns in R
For the sake of completeness, the unique()
function from the data.table
package can be used as well:
library(data.table)
unique(setDT(df), by = "IndexA")
TimeStamp IndexA IndexB Value
1: 12:00:01 1 NA Windows
2: 12:00:48 NA 1 Macintosh
3: 12:02:01 2 NA Windows
This is looking for unique values only in IndexA
which is equivalent to Tito Sanz' answer. Obviously, this approach returns the expected result for the given sample data set but checking only one column for duplicate entries is oversimplifying IMHO and may fail with production data.
Or, looking for unique combinations of the values in three columns (which is equivalent to www's answer):
unique(setDT(df), by = 2:4) # very terse
unique(setDT(df), by = c("IndexA", "IndexB", "Value")) # explicitely named cols
TimeStamp IndexA IndexB Value
1: 12:00:01 1 NA Windows
2: 12:00:48 NA 1 Macintosh
3: 12:02:01 2 NA Windows
Data
library(data.table)
df <- fread(
"TimeStamp IndexA IndexB Value
12:00:01 1 NA Windows
12:00:05 1 NA Windows
12:00:13 1 NA Windows
12:00:48 NA 1 Macintosh
12:01:30 NA 1 Macintosh
12:01:45 NA 1 Macintosh
12:02:01 2 NA Windows
12:02:13 2 NA Windows")
r filter duplicate rows based on value in column
Here is an option
df %>%
group_by(Id) %>%
filter(Col3 == "A" | n() == 1) %>%
ungroup()
## A tibble: 3 x 5
# Id Date Col1 Col2 Col3
# <int> <chr> <int> <int> <chr>
#1 1 1/1/1995 NA 1 A
#2 2 3/10/1992 0 1 B
#3 3 8/15/2002 1 1 B
This keeps either rows where Col3 == "A"
or single rows per group. PS. I recommend always using ungroup()
to avoid unwanted surprises downstream.
Related Topics
Calculate Row-Wise Proportions
Pass a Vector of Variable Names to Arrange() in Dplyr
How to Remove Empty Factors from Ggplot2 Facets
Remove Groups with Less Than Three Unique Observations
How to Change Library Location in R
Unicode Characters in Ggplot2 PDF Output
Get Last Row of Each Group in R
How to Source R Markdown File Like 'Source('Myfile.R')'
What Can R Do About a Messy Data Format
Paste Quotation Marks into Character String, Within a Loop
How to Use a String Variable to Select a Data Frame Column Using $ Notation
Proper Idiom for Adding Zero Count Rows in Tidyr/Dplyr
Longest Common Substring in R Finding Non-Contiguous Matches Between the Two Strings
Explain Ggplot2 Warning: "Removed K Rows Containing Missing Values"
Is There a R Function That Applies a Function to Each Pair of Columns