Finding ALL duplicate rows, including elements with smaller subscripts
duplicated
has a fromLast
argument. The "Example" section of ?duplicated
shows you how to use it. Just call duplicated
twice, once with fromLast=FALSE
and once with fromLast=TRUE
and take the rows where either are TRUE
.
Some late Edit:
You didn't provide a reproducible example, so here's an illustration kindly contributed by @jbaums
vec <- c("a", "b", "c","c","c")
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"
Edit: And an example for the case of a data frame:
df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
## X1 X2
## 3 c c
## 4 c c
Find All Unique rows based on single column and exclude all duplicate rows
As you probably realised, unique
and duplicated
don’t quite what you need, because they essentially cause the retention of all distinct values, and just collapse “multiple copies” of such values.
For your first question, you can group_by
the column that you’re interested in, and then retain just those groups (via filter
) which have more than one row:
mtcars %>%
group_by(mpg) %>%
filter(length(mpg) > 1) %>%
ungroup()
This example selects all rows for which the mpg
value is duplicated. This works because, when applied to groups, dplyr operations such as filter
work on each group individually. This means that length(mpg)
in the above code will return the length of the mpg
column vector of each group, separately.
To invert the logic, it’s enough to invert the filtering condition:
mtcars %>%
group_by(mpg) %>%
filter(length(mpg) == 1) %>%
ungroup()
Remove *all* duplicate rows, unless there's a similar row
An option would be to group by 'V1', get the index of group that has length of unique elements greater than 1 and then take the unique
unique(dt[dt[, .(i1 = .I[uniqueN(V2) > 1]), V1]$i1])
# V1 V2
#1: 2 5
#2: 2 6
#3: 2 7
Or as @r2evans mentioned
unique(dt[, .SD[(uniqueN(V2) > 1)], by = "V1"])
NOTE: The OP's dataset is data.table
and data.table
methods are the natural way of doing it
If we need a tidyverse
option, a comparable one to the above data.table
option is
library(dplyr)
dt %>%
group_by(V1) %>%
filter(n_distinct(V2) > 1) %>%
distinct()
Finding ALL duplicate rows, including elements with smaller subscripts
duplicated
has a fromLast
argument. The "Example" section of ?duplicated
shows you how to use it. Just call duplicated
twice, once with fromLast=FALSE
and once with fromLast=TRUE
and take the rows where either are TRUE
.
Some late Edit:
You didn't provide a reproducible example, so here's an illustration kindly contributed by @jbaums
vec <- c("a", "b", "c","c","c")
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"
Edit: And an example for the case of a data frame:
df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
## X1 X2
## 3 c c
## 4 c c
Looking to remove both rows if duplicated in a column using dplyr
Here's one way using dplyr
-
df %>%
group_by(id) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 5 x 2
id award_amount
<chr> <dbl>
1 1-2 3000
2 1-4 5881515
3 1-5 155555
4 1-9 750000
5 1-22 3500000
Related Topics
How to Remove Rows With Any Zero Value
How to Force R to Use a Specified Factor Level as Reference in a Regression
Convert Multiple Columns of Numeric Data to Dates in R
Add Legend to Geom_Line() Graph in R
Merge 2 Data Frames in a Loop for Each Column in One of Them
How to Deal With "Package 'Xxx' Is Not Available (For R Version X.Y.Z)" Warning
How to Install an R Package from Source
How to Set Limits For Axes in Ggplot2 R Plots
Run R Script from Command Line
Delete Rows That Exist in Another Data Frame
Adding Some Space Between the X-Axis and the Bars, in Ggplot
How to Add a Row to Data Frame Based on a Condition
Add Row to a Data Frame With Total Sum for Each Column
Simultaneously Merge Multiple Data.Frames in a List