Finding All Duplicate Rows, Including "Elements With Smaller Subscripts"

Finding ALL duplicate rows, including elements with smaller subscripts

duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.


Some late Edit:
You didn't provide a reproducible example, so here's an illustration kindly contributed by @jbaums

vec <- c("a", "b", "c","c","c") 
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"

Edit: And an example for the case of a data frame:

df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
## X1 X2
## 3 c c
## 4 c c

Find All Unique rows based on single column and exclude all duplicate rows

As you probably realised, unique and duplicated don’t quite what you need, because they essentially cause the retention of all distinct values, and just collapse “multiple copies” of such values.

For your first question, you can group_by the column that you’re interested in, and then retain just those groups (via filter) which have more than one row:

mtcars %>%
group_by(mpg) %>%
filter(length(mpg) > 1) %>%
ungroup()

This example selects all rows for which the mpg value is duplicated. This works because, when applied to groups, dplyr operations such as filter work on each group individually. This means that length(mpg) in the above code will return the length of the mpg column vector of each group, separately.

To invert the logic, it’s enough to invert the filtering condition:

mtcars %>%
group_by(mpg) %>%
filter(length(mpg) == 1) %>%
ungroup()

Remove *all* duplicate rows, unless there's a similar row

An option would be to group by 'V1', get the index of group that has length of unique elements greater than 1 and then take the unique

unique(dt[dt[, .(i1 = .I[uniqueN(V2) > 1]), V1]$i1])
# V1 V2
#1: 2 5
#2: 2 6
#3: 2 7

Or as @r2evans mentioned

unique(dt[, .SD[(uniqueN(V2) > 1)], by = "V1"])

NOTE: The OP's dataset is data.table and data.table methods are the natural way of doing it


If we need a tidyverse option, a comparable one to the above data.table option is

library(dplyr)
dt %>%
group_by(V1) %>%
filter(n_distinct(V2) > 1) %>%
distinct()

Finding ALL duplicate rows, including elements with smaller subscripts

duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.


Some late Edit:
You didn't provide a reproducible example, so here's an illustration kindly contributed by @jbaums

vec <- c("a", "b", "c","c","c") 
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"

Edit: And an example for the case of a data frame:

df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
## X1 X2
## 3 c c
## 4 c c

Looking to remove both rows if duplicated in a column using dplyr

Here's one way using dplyr -

df %>% 
group_by(id) %>%
filter(n() == 1) %>%
ungroup()

# A tibble: 5 x 2
id award_amount
<chr> <dbl>
1 1-2 3000
2 1-4 5881515
3 1-5 155555
4 1-9 750000
5 1-22 3500000


Related Topics



Leave a reply



Submit