Select unique values with 'select' function in 'dplyr' library
In dplyr 0.3 this can be easily achieved using the distinct()
method.
Here is an example:
distinct_df = df %>% distinct(field1)
You can get a vector of the distinct values with:
distinct_vector = distinct_df$field1
You can also select a subset of columns at the same time as you perform the distinct()
call, which can be cleaner to look at if you examine the data frame using head/tail/glimpse.:
distinct_df = df %>% distinct(field1) %>% select(field1)
distinct_vector = distinct_df$field1
get unique values with distinct function in dplyr
One thing that you could do is to count instances of combinations of ID and subject_result2.
new_df <- df %>%
group_by(ID, subject_result2) %>%
summarise(id = n()) %>%distinct() %>%
select(-id)
new_df
R dplyr - Filter unique row in each group with dplyr
dat %>%
mutate(rn = row_number()) %>%
arrange(flag) %>%
group_by(id, col2, col3) %>%
slice(1) %>%
ungroup() %>%
arrange(rn) %>%
select(-rn)
# # A tibble: 4 x 5
# id col2 col3 flag val
# <int> <chr> <chr> <int> <int>
# 1 1 a q NA NA
# 2 1 a w 1 NA
# 3 1 b r NA NA
# 4 2 c q 1 5
If your data is instead strings with empty strings (it's not clear in the question), then
dat %>%
# this is just to transform my number-based 'flag'/'val' to strings, you don't need this
mutate(across(c(flag, val), ~ if_else(is.na(.), "", as.character(.)))) %>%
# pick up here
mutate(rn = row_number()) %>%
arrange(!nzchar(flag)) %>% # this is the only difference from above
group_by(id, col2, col3) %>%
slice(1) %>%
ungroup() %>%
arrange(rn) %>%
select(-rn)
# # A tibble: 4 x 5
# id col2 col3 flag val
# <int> <chr> <chr> <chr> <chr>
# 1 1 a q "" ""
# 2 1 a w "1" ""
# 3 1 b r "" ""
# 4 2 c q "1" "5"
The use of rn
is merely to ensure that the order is preserved across the filtering. If order is not an issue (perhaps it's inferred some other way), then you can remove the mutate
, and the trailing arrange(rn) %>% select(-rn)
.
Data
dat <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), col2 = c("a", "a", "b", "c", "c", "c"), col3 = c("q", "w", "r", "q", "q", "q"), flag = c(NA, 1L, NA, 1L, NA, 1L), val = c(NA, NA, NA, 5L, NA, 6L)), class = "data.frame", row.names = c(NA, -6L))
R: How do I choose which row dplyr::distinct() keeps based on a value in another variable?
Arranging alphabetically works in the stated simple case, but if you want you can add a protocol_preference
variable to give an ordering of what you'd prefer to be selected if Y
isn't available, and to select "Y" even if it doesn't happen to be the last protocol value when sorted alphabetically.
Building off @davechilders answer and @Nathan Werth 's idea of creating a factor based on an "order of importance" vector
order_of_importance <- c("Y", "Z", "X")
df2 %>%
mutate(protocol = factor(protocol, order_of_importance)) %>%
arrange(id, protocol) %>%
distinct(id, .keep_all = TRUE)
Or if you just want to select 'Y' and don't have a preference for what's selected if 'Y' isn't avaialable you can do
df %>%
arrange(id, desc(protocol == 'Y')) %>%
distinct(id, .keep_all = TRUE)
R dplyr, distinct, unique combination of variables, with maximum value of third
We can add an arrange
statement before the distinct
library(dplyr)
dt1 %>%
arrange(var2, var3, desc(var4)) %>%
distinct(var2, var3, .keep_all = TRUE)
-output
# A tibble: 2 x 4
var1 var2 var3 var4
<chr> <chr> <chr> <dbl>
1 num2 A B 10
2 num5 A C 19
Or another option is slice_max
dt1 %>%
group_by(var2, var3) %>%
mutate(var4new = first(var4)) %>%
slice_max(order_by= var4, n = 1) %>%
ungroup
-output
# A tibble: 2 x 5
var1 var2 var3 var4 var4new
<chr> <chr> <chr> <dbl> <dbl>
1 num2 A B 10 5
2 num5 A C 19 3
How to select columns with equal or more than 2 unique values while ignoring NA and blank?
Use n_distinct
, which also have na.rm
argument, The _if/_at/_all
are deprecated in favor of across/where
. The empty strings (''
) can be checked with nzchar
which returns a TRUE only if it is non-empty, thus subset the elements of the columns with nzchar
and then apply n_distinct
column wise and create the condition to select
only those columns and then get the names
library(dplyr)
df %>%
select(where(~ n_distinct(.x[nzchar(.x)], na.rm = TRUE) > 1)) %>%
names
-output
[1] "ID" "color" "owner"
An option is also to convert the ""
to NA
(na_if
), perhaps it may be slightly compact
df %>%
select(where(~ n_distinct(na_if(.x, ""), na.rm = TRUE) > 1)) %>%
names
How to use dplyr to find unique entries in the previous rows
Because you are trying to calculate the number of distinct ids
across groups, first we'll need to define a boolean column that will allow us to sum only the unique values.
Secondly, you want to include missing dates from your original df
in your expected output, so we'll also need to perform a right_join
with the full sequence of dates. I assume here that your dates
column is already of class Date
. This will produce NA
values that we replace
by 0
.
Finally we calculate the cumsum
for both unique_ids
and sum_values
.
library(dplyr)
df %>% mutate(unique_ids = !duplicated(ids)) %>%
group_by(dates) %>%
summarise(unique_ids = sum(unique_ids),
sum_values = sum(values)) %>%
right_join(data.frame(dates = seq(min(df$date),
max(df$dates),
by = 1))) %>%
mutate_each(funs(replace(., is.na(.), 0)), -dates) %>%
mutate_each(funs(cumsum), -dates)
# dates unique_ids sum_values
# <date> <dbl> <dbl>
#1 2011-10-01 2 36
#2 2011-10-02 3 38
#3 2011-10-03 4 43
#4 2011-10-04 4 43
#5 2011-10-05 4 53
#6 2011-10-06 5 58
Find unique entries in otherwise identical rows
A data.table
alternative. Coerce data frame to a data.table
(setDT
). Melt data to long format (melt(df, id.vars = "ID")
).
Within each group defined by 'ID' and 'variable' (corresponding to the columns in the wide format) (by = .(ID, variable)
), count number of unique values (uniqueN(value)
) and check if it's equal to the number of rows in the subgroup (== .N
). If so (if
), select the entire subgroup (.SD
).
Finally, reshape the data back to wide format (dcast
).
library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
# ID ID_1 x2 x3 x5
# 1: 1 1 <NA> 7 x
# 2: 1 2 <NA> 10 p
# 3: 3 1 c 9 z
# 4: 3 2 d 11 q
Related Topics
Rank a Vector Based on Order and Replace Ties with Their Average
How to Remove Rows with All Zeros Without Using Rowsums in R
Relationship Between R Markdown, Knitr, Pandoc, and Bookdown
R Cmd Check Note: Found No Calls To: 'R_Registerroutines', 'R_Usedynamicsymbols'
Hiding Personal Functions in R
How to Do Selective Labeling with Ggplot Geom_Point()
How to Add Rmse, Slope, Intercept, R^2 to R Plot
Find Location of Current .R File
Error: Could Not Find Function "Unit"
R: Cumulative Sum Over Rolling Date Range
R Partial Reshape Data from Long to Wide
R: Expand and Fill Data Frame by Date in Series
How to Organize Large Shiny Apps
Jupyter-Client Has to Be Installed But "Jupyter Kernelspec --Version" Exited with Code 127
Remove Spacing Around Plotting Area in R
Remove a Layer from a Ggplot2 Chart