Remove Duplicated Rows Using Dplyr

Remove duplicated rows using dplyr

Note: dplyr now contains the distinct function for this purpose.

Original answer below:


library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)

One approach would be to group, and then only keep the first row:

df %>% group_by(x, y) %>% filter(row_number(z) == 1)

## Source: local data frame [3 x 3]
## Groups: x, y
##
## x y z
## 1 0 1 1
## 2 1 0 2
## 3 1 1 4

(In dplyr 0.2 you won't need the dummy z variable and will just be
able to write row_number() == 1)

I've also been thinking about adding a slice() function that would
work like:

df %>% group_by(x, y) %>% slice(from = 1, to = 1)

Or maybe a variation of unique() that would let you select which
variables to use:

df %>% unique(x, y)

Remove duplicate rows based on multiple columns using dplyr / tidyverse?

duplicated expected to operate on "a vector or a data frame or an array" (but not two vectors ... it looks for duplication in its first argument only).

df %>%
filter(duplicated(.))
# a b
# 1 1 1
# 2 2 2

df %>%
filter(!duplicated(.))
# a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1

If you prefer to reference a specific subset of columns, then use cbind:

df %>%
filter(duplicated(cbind(a, b)))

As a side note, the dplyr verb for this can be distinct:

df %>%
distinct(a, b, .keep_all = TRUE)
# a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1

though I don't know that it has an inverse of this function.

remove duplicates with distinct() dplyr in R

My understanding is we need to separate the distinct calls. If we use distinct(df2, mpg,hp, .keep_all=TRUE) we are asking for columns that do not have duplicates in both columns within the same row, this does not happen in the given data set so everything is returned.

If we first return all rows without duplicates in hp and then take that data and only return rows without duplicates in mpg, you will get the expected result.

library(dplyr)

df= mtcars %>% select(mpg,hp)
df2= slice(df,10:20)
df3<-distinct(df2, hp, .keep_all=TRUE)
df4<-distinct(df3, mpg, .keep_all=TRUE)

> df4
mpg hp
1 19.2 123
2 16.4 180
3 10.4 205
4 14.7 230
5 32.4 66
6 30.4 52
7 33.9 65

Looking to remove both rows if duplicated in a column using dplyr

Here's one way using dplyr -

df %>% 
group_by(id) %>%
filter(n() == 1) %>%
ungroup()

# A tibble: 5 x 2
id award_amount
<chr> <dbl>
1 1-2 3000
2 1-4 5881515
3 1-5 155555
4 1-9 750000
5 1-22 3500000

How to remove duplicate rows in R?

You can considerably shorten your code:

df<-starwars %>%
group_by(homeworld) %>%
filter(!is.na(height), !is.na(homeworld), n() >=5) %>%
summarize(shortest_5 = mean(if_else(rank(height) > 5, NA_integer_, height), na.rm = TRUE))

df

# # A tibble: 2 x 2
# homeworld shortest_5
# <chr> <dbl>
# 1 Naboo 151.
# 2 Tatooine 153.

Note:

  • I get different results than you, e.g. on Naboo the shortest 5 characters have height: 96, 157, 165, 165, 170. And the mean of these 5 values is 150.6.
  • You shouldn't have values for e.g. Coruscant, since there are only 3 characters from that homeworld. The only two homeworlds with at least 5 characters are Naboo and Tatooine.

Remove duplicated rows when column above a threshold in R

Using dplyr

library(dplyr)
x %>%
filter(!duplicated(x)| Values <=5)

R - Identify and remove ONE instance of duplicate rows

subset(df, !duplicated(df[c('Course_ID', 'Text_ID')]))
Course_ID Text_ID
1 33 17
3 58 17
4 5 22
5 8 22
6 42 25
8 17 26
10 35 39
11 51 39

or even

df[!duplicated(df[c('Course_ID', 'Text_ID')]), ]

If only 2 columns as shown, just do unique(df)



Related Topics



Leave a reply



Submit