Remove duplicated rows using dplyr
Note: dplyr
now contains the distinct
function for this purpose.
Original answer below:
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
One approach would be to group, and then only keep the first row:
df %>% group_by(x, y) %>% filter(row_number(z) == 1)
## Source: local data frame [3 x 3]
## Groups: x, y
##
## x y z
## 1 0 1 1
## 2 1 0 2
## 3 1 1 4
(In dplyr 0.2 you won't need the dummy z
variable and will just be
able to write row_number() == 1
)
I've also been thinking about adding a slice()
function that would
work like:
df %>% group_by(x, y) %>% slice(from = 1, to = 1)
Or maybe a variation of unique()
that would let you select which
variables to use:
df %>% unique(x, y)
Remove duplicate rows based on multiple columns using dplyr / tidyverse?
duplicated
expected to operate on "a vector or a data frame or an array" (but not two vectors ... it looks for duplication in its first argument only).
df %>%
filter(duplicated(.))
# a b
# 1 1 1
# 2 2 2
df %>%
filter(!duplicated(.))
# a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1
If you prefer to reference a specific subset of columns, then use cbind
:
df %>%
filter(duplicated(cbind(a, b)))
As a side note, the dplyr
verb for this can be distinct
:
df %>%
distinct(a, b, .keep_all = TRUE)
# a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1
though I don't know that it has an inverse of this function.
remove duplicates with distinct() dplyr in R
My understanding is we need to separate the distinct calls. If we use distinct(df2, mpg,hp, .keep_all=TRUE)
we are asking for columns that do not have duplicates in both columns within the same row, this does not happen in the given data set so everything is returned.
If we first return all rows without duplicates in hp
and then take that data and only return rows without duplicates in mpg
, you will get the expected result.
library(dplyr)
df= mtcars %>% select(mpg,hp)
df2= slice(df,10:20)
df3<-distinct(df2, hp, .keep_all=TRUE)
df4<-distinct(df3, mpg, .keep_all=TRUE)
> df4
mpg hp
1 19.2 123
2 16.4 180
3 10.4 205
4 14.7 230
5 32.4 66
6 30.4 52
7 33.9 65
Looking to remove both rows if duplicated in a column using dplyr
Here's one way using dplyr
-
df %>%
group_by(id) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 5 x 2
id award_amount
<chr> <dbl>
1 1-2 3000
2 1-4 5881515
3 1-5 155555
4 1-9 750000
5 1-22 3500000
How to remove duplicate rows in R?
You can considerably shorten your code:
df<-starwars %>%
group_by(homeworld) %>%
filter(!is.na(height), !is.na(homeworld), n() >=5) %>%
summarize(shortest_5 = mean(if_else(rank(height) > 5, NA_integer_, height), na.rm = TRUE))
df
# # A tibble: 2 x 2
# homeworld shortest_5
# <chr> <dbl>
# 1 Naboo 151.
# 2 Tatooine 153.
Note:
- I get different results than you, e.g. on Naboo the shortest 5 characters have height: 96, 157, 165, 165, 170. And the mean of these 5 values is 150.6.
- You shouldn't have values for e.g. Coruscant, since there are only 3 characters from that homeworld. The only two homeworlds with at least 5 characters are Naboo and Tatooine.
Remove duplicated rows when column above a threshold in R
Using dplyr
library(dplyr)
x %>%
filter(!duplicated(x)| Values <=5)
R - Identify and remove ONE instance of duplicate rows
subset(df, !duplicated(df[c('Course_ID', 'Text_ID')]))
Course_ID Text_ID
1 33 17
3 58 17
4 5 22
5 8 22
6 42 25
8 17 26
10 35 39
11 51 39
or even
df[!duplicated(df[c('Course_ID', 'Text_ID')]), ]
If only 2 columns as shown, just do unique(df)
Related Topics
Starting Shiny App After Password Input
Subset Dataframe by Multiple Logical Conditions of Rows to Remove
R on Macos Error: Vector Memory Exhausted (Limit Reached)
Frequency Count of Two Column in R
Dplyr: Nonstandard Column Names (White Space, Punctuation, Starts With Numbers)
How to Efficiently Calculate Distance Between Pair of Coordinates Using Data.Table :=
How to Display the Frequency At the Top of Each Factor in a Barplot in R
How to Put a Transformed Scale on the Right Side of a Ggplot2
How to Remove All Whitespace from a String
How to Use Facets With a Dual Y-Axis Ggplot
Split Date-Time Column into Date and Time Variables
Dplyr Mutate Rowsums Calculations or Custom Functions
Error - Replacement Has [X] Rows, Data Has [Y]
Find Which Season a Particular Date Belongs To
R Ifelse to Replace Values in a Column
Reshaping Time Series Data from Wide to Tall Format (For Plotting)