Programming With Dplyr Using String as Input

Programming with dplyr using string as input

dplyr >= 1.0

Use combination of double braces and the across function:

my_summarise2 <- function(df, group_var) {
df %>% group_by(across({{ group_var }})) %>%
summarise(mpg = mean(mpg))
}

my_summarise2(mtcars, "cyl")

# A tibble: 3 x 2
# cyl mpg
# <dbl> <dbl>
# 1 4 26.7
# 2 6 19.7
# 3 8 15.1

# same result as above, passing cyl without quotes
my_summarise(mtcars, cyl)

dplyr < 1.0

As far as I know, you could use as.name or sym (from the rlang package - I don't know if dplyr will import it eventually):

library(dplyr)
my_summarise <- function(df, var) {
var <- rlang::sym(var)
df %>%
group_by(!!var) %>%
summarise(mpg = mean(mpg))
}

or

my_summarise <- function(df, var) {
var <- as.name(var)
df %>%
group_by(!!var) %>%
summarise(mpg = mean(mpg))
}

my_summarise(mtcars, "cyl")
# # A tibble: 3 × 2
# cyl mpg
# <dbl> <dbl>
# 1 4 26.66364
# 2 6 19.74286
# 3 8 15.10000

Creating dplyr function that can tell if variable input is a string or a symbol

my_summarise <- function(df, group_var) {

group_var <- substitute(group_var)

if(!is.name(group_var)) group_var <- as.name(group_var) # instead of is.name and as.name you can use is.symbol and as.symbol or a mixture.

group_var <- enquo(group_var)

df %>% group_by(!! group_var) %>%
summarise(a = mean(a))
}

You can also ignore the if condition altogether :

my_summarise <- function(df, group_var) {

group_var<- as.name(substitute(group_var))

group_var <- enquo(group_var)

df %>% group_by(!! group_var) %>%
summarise(a = mean(a))
}

Pass a string as variable name in dplyr::filter

!! or UQ evaluates the variable, so mtcars %>% filter(!!var == 4) is the same as mtcars %>% filter('cyl' == 4) where the condition always evaluates to false; You can prove this by printing !!var in the filter function:

mtcars %>% filter({ print(!!var); (!!var) == 4 })
# [1] "cyl"
# [1] mpg cyl disp hp drat wt qsec vs am gear carb
# <0 rows> (or 0-length row.names)

To evaluate var to the cyl column, you need to convert var to a symbol of cyl first, then evaluate the symbol cyl to a column:

Using rlang:

library(rlang)
var <- 'cyl'
mtcars %>% filter((!!sym(var)) == 4)

# mpg cyl disp hp drat wt qsec vs am gear carb
#1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# ...

Or use as.symbol/as.name from baseR:

mtcars %>% filter((!!as.symbol(var)) == 4)

mtcars %>% filter((!!as.name(var)) == 4)

pass string as the argument in function to be used in other function in R

@rawr is correct. The linked answer shows passing a string containing a column name into group_by. The process is no different when passing the string into summarise:

This is the approach I typically use:

library(dplyr)
focal <- function(dataset, focal.var){
df <- dataset %>%
group_by(group1) %>%
mutate(FV_ft = mean(!!sym(focal.var)))
return(df)
}

This is an approach recommended by the programming with dplyr vignette:

library(dplyr)
focal <- function(dataset, focal.var){
df <- dataset %>%
group_by(group1) %>%
mutate(FV_ft = mean(.data[[focal.var]]))
return(df)
}

Group by multiple columns in dplyr, using string vector input

Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:

data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))

#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27

The output from your example question is as expected (see comparison to plyr above and output below):

# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998

Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.

Using strings as arguments in custom dplyr function using non-standard evaluation

You can either use sym to turn "y" into a symbol or parse_expr to parse it into an expression, then unquote it using !!:

library(rlang)

testFun(data.frame(x = c("a", "b", "c"), y = 1:3), !!sym(myVar))

testFun(data.frame(x = c("a", "b", "c"), y = 1:3), !!parse_expr(myVar))

Result:

  x   y
1 a 0
2 b 100
3 c 200

Check my answer in this question for explanation of difference between sym and parse_expr.

How to use tidy evaluation with column name as strings?

We can use also ensym with !!

my_summarise <- function(df, group_var) {


df %>%
group_by(!!rlang::ensym(group_var)) %>%
summarise(a = mean(a))
}

my_summarise(df, 'g1')

Or another option is group_by_at

my_summarise <- function(df, group_var) {


df %>%
group_by_at(vars(group_var)) %>%
summarise(a = mean(a))
}

my_summarise(df, 'g1')

Include column names as function input with dplyr

I've slightly updated your code to dplyr 1.0.0 and tidyr. Then you can make use of the new dplyr programming feature {{}} to specify variables that are arguments of a function.

# Example data frame
df <- data.frame("ID" = rep(1:5, each = 4), "score" = runif(20, 0, 100), "location" = rep(c("a", "b", "c", "d"), 5))
library(dplyr)
wide_fun <- function(.data, key_name, value_name) {
.data %>%
group_by(across(-{{value_name}})) %>% # group by everything other than the value column.
mutate(row_id = 1:n()) %>% ungroup() %>% # build group index
tidyr::pivot_wider(
names_from = {{key_name}},
values_from = {{value_name}}) %>% # spread
select(-row_id)
}

wide_fun(df, location, score)
#> # A tibble: 5 x 5
#> ID a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 90.8 38.9 28.7 39.0
#> 2 2 94.5 24.9 84.6 54.6
#> 3 3 61.1 97.2 12.2 57.7
#> 4 4 52.7 85.6 41.4 100.
#> 5 5 17.8 86.1 92.3 33.7

Created on 2020-09-11 by the reprex package (v0.3.0)

Edit

This function should also work with older versions of dplyr:

library(dplyr)
wide_fun_2 <- function(.data, key_name, value_name) {
.data %>%
group_by_at(vars(-!!ensym(value_name))) %>% # group by everything other than the value column.
mutate(row_id = 1:n()) %>% ungroup() %>% # build group index
tidyr::pivot_wider(
names_from = !!ensym(key_name),
values_from = !!ensym(value_name)) %>% # spread
select(-row_id)
}

df %>%
wide_fun_2(location, score)
A tibble: 5 x 5
ID a b c d
<int> <dbl> <dbl> <dbl> <dbl>
1 1 72.2 81.4 52.5 48.8
2 2 36.1 27.5 82.2 73.0
3 3 83.9 68.2 80.9 15.7
4 4 0.451 70.0 18.5 43.2
5 5 82.6 68.2 22.8 63.0

If you just provide the argument that specifies the column, you only need to deal with symbols and not quosures, therefore you need to use ensym.

standard evaluation in dplyr: summarise a variable given as a character string

dplyr 1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr programming vignette here:

https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html

The new way to refer to columns when their identifier is stored as a character vector is to use the .data pronoun from rlang, and then subset as you would in base R.

library(dplyr)

key <- "v3"
val <- "v2"
drp <- "v1"

df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))

df %>%
select(-matches(drp)) %>%
group_by(.data[[key]]) %>%
summarise(total = sum(.data[[val]], na.rm = TRUE))

#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> v3 total
#> <chr> <int>
#> 1 A 21
#> 2 B 19

If your code is in a package function, you can @importFrom rlang .data to avoid R check notes about undefined globals.



Related Topics



Leave a reply



Submit