Select Unique Values with 'Select' Function in 'Dplyr' Library

Select unique values with 'select' function in 'dplyr' library

In dplyr 0.3 this can be easily achieved using the distinct() method.

Here is an example:

distinct_df = df %>% distinct(field1)

You can get a vector of the distinct values with:

distinct_vector = distinct_df$field1

You can also select a subset of columns at the same time as you perform the distinct() call, which can be cleaner to look at if you examine the data frame using head/tail/glimpse.:

distinct_df = df %>% distinct(field1) %>% select(field1)
distinct_vector = distinct_df$field1

get unique values with distinct function in dplyr

One thing that you could do is to count instances of combinations of ID and subject_result2.

new_df <- df %>%
group_by(ID, subject_result2) %>%
summarise(id = n()) %>%distinct() %>%
select(-id)

new_df

R dplyr - Filter unique row in each group with dplyr

dat %>%
mutate(rn = row_number()) %>%
arrange(flag) %>%
group_by(id, col2, col3) %>%
slice(1) %>%
ungroup() %>%
arrange(rn) %>%
select(-rn)
# # A tibble: 4 x 5
# id col2 col3 flag val
# <int> <chr> <chr> <int> <int>
# 1 1 a q NA NA
# 2 1 a w 1 NA
# 3 1 b r NA NA
# 4 2 c q 1 5

If your data is instead strings with empty strings (it's not clear in the question), then

dat %>%
# this is just to transform my number-based 'flag'/'val' to strings, you don't need this
mutate(across(c(flag, val), ~ if_else(is.na(.), "", as.character(.)))) %>%
# pick up here
mutate(rn = row_number()) %>%
arrange(!nzchar(flag)) %>% # this is the only difference from above
group_by(id, col2, col3) %>%
slice(1) %>%
ungroup() %>%
arrange(rn) %>%
select(-rn)
# # A tibble: 4 x 5
# id col2 col3 flag val
# <int> <chr> <chr> <chr> <chr>
# 1 1 a q "" ""
# 2 1 a w "1" ""
# 3 1 b r "" ""
# 4 2 c q "1" "5"

The use of rn is merely to ensure that the order is preserved across the filtering. If order is not an issue (perhaps it's inferred some other way), then you can remove the mutate, and the trailing arrange(rn) %>% select(-rn).


Data

dat <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), col2 = c("a", "a", "b", "c", "c", "c"), col3 = c("q", "w", "r", "q", "q", "q"), flag = c(NA, 1L, NA, 1L, NA, 1L), val = c(NA, NA, NA, 5L, NA, 6L)), class = "data.frame", row.names = c(NA, -6L))

R: How do I choose which row dplyr::distinct() keeps based on a value in another variable?

Arranging alphabetically works in the stated simple case, but if you want you can add a protocol_preference variable to give an ordering of what you'd prefer to be selected if Y isn't available, and to select "Y" even if it doesn't happen to be the last protocol value when sorted alphabetically.

Building off @davechilders answer and @Nathan Werth 's idea of creating a factor based on an "order of importance" vector

order_of_importance <- c("Y", "Z", "X")

df2 %>%
mutate(protocol = factor(protocol, order_of_importance)) %>%
arrange(id, protocol) %>%
distinct(id, .keep_all = TRUE)

Or if you just want to select 'Y' and don't have a preference for what's selected if 'Y' isn't avaialable you can do

df %>% 
arrange(id, desc(protocol == 'Y')) %>%
distinct(id, .keep_all = TRUE)

R dplyr, distinct, unique combination of variables, with maximum value of third

We can add an arrange statement before the distinct

library(dplyr)
dt1 %>%
arrange(var2, var3, desc(var4)) %>%
distinct(var2, var3, .keep_all = TRUE)

-output

# A tibble: 2 x 4
var1 var2 var3 var4
<chr> <chr> <chr> <dbl>
1 num2 A B 10
2 num5 A C 19

Or another option is slice_max

dt1 %>%
group_by(var2, var3) %>%
mutate(var4new = first(var4)) %>%
slice_max(order_by= var4, n = 1) %>%
ungroup

-output

# A tibble: 2 x 5
var1 var2 var3 var4 var4new
<chr> <chr> <chr> <dbl> <dbl>
1 num2 A B 10 5
2 num5 A C 19 3

How to select columns with equal or more than 2 unique values while ignoring NA and blank?

Use n_distinct, which also have na.rm argument, The _if/_at/_all are deprecated in favor of across/where. The empty strings ('') can be checked with nzchar which returns a TRUE only if it is non-empty, thus subset the elements of the columns with nzchar and then apply n_distinct column wise and create the condition to select only those columns and then get the names

library(dplyr)
df %>%
select(where(~ n_distinct(.x[nzchar(.x)], na.rm = TRUE) > 1)) %>%
names

-output

[1] "ID"    "color" "owner"

An option is also to convert the "" to NA (na_if), perhaps it may be slightly compact

df %>% 
select(where(~ n_distinct(na_if(.x, ""), na.rm = TRUE) > 1)) %>%
names

How to use dplyr to find unique entries in the previous rows

Because you are trying to calculate the number of distinct ids across groups, first we'll need to define a boolean column that will allow us to sum only the unique values.

Secondly, you want to include missing dates from your original df in your expected output, so we'll also need to perform a right_join with the full sequence of dates. I assume here that your dates column is already of class Date. This will produce NA values that we replace by 0.

Finally we calculate the cumsum for both unique_ids and sum_values.

library(dplyr)

df %>% mutate(unique_ids = !duplicated(ids)) %>%
group_by(dates) %>%
summarise(unique_ids = sum(unique_ids),
sum_values = sum(values)) %>%
right_join(data.frame(dates = seq(min(df$date),
max(df$dates),
by = 1))) %>%
mutate_each(funs(replace(., is.na(.), 0)), -dates) %>%
mutate_each(funs(cumsum), -dates)
# dates unique_ids sum_values
# <date> <dbl> <dbl>
#1 2011-10-01 2 36
#2 2011-10-02 3 38
#3 2011-10-03 4 43
#4 2011-10-04 4 43
#5 2011-10-05 4 53
#6 2011-10-06 5 58

Find unique entries in otherwise identical rows

A data.table alternative. Coerce data frame to a data.table (setDT). Melt data to long format (melt(df, id.vars = "ID")).

Within each group defined by 'ID' and 'variable' (corresponding to the columns in the wide format) (by = .(ID, variable)), count number of unique values (uniqueN(value)) and check if it's equal to the number of rows in the subgroup (== .N). If so (if), select the entire subgroup (.SD).

Finally, reshape the data back to wide format (dcast).

library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
# ID ID_1 x2 x3 x5
# 1: 1 1 <NA> 7 x
# 2: 1 2 <NA> 10 p
# 3: 3 1 c 9 z
# 4: 3 2 d 11 q


Related Topics



Leave a reply



Submit