Programming with dplyr using string as input
dplyr >= 1.0
Use combination of double braces and the across function:
my_summarise2 <- function(df, group_var) {
df %>% group_by(across({{ group_var }})) %>%
summarise(mpg = mean(mpg))
}
my_summarise2(mtcars, "cyl")
# A tibble: 3 x 2
# cyl mpg
# <dbl> <dbl>
# 1 4 26.7
# 2 6 19.7
# 3 8 15.1
# same result as above, passing cyl without quotes
my_summarise(mtcars, cyl)
dplyr < 1.0
As far as I know, you could use as.name
or sym
(from the rlang
package - I don't know if dplyr
will import it eventually):
library(dplyr)
my_summarise <- function(df, var) {
var <- rlang::sym(var)
df %>%
group_by(!!var) %>%
summarise(mpg = mean(mpg))
}
or
my_summarise <- function(df, var) {
var <- as.name(var)
df %>%
group_by(!!var) %>%
summarise(mpg = mean(mpg))
}
my_summarise(mtcars, "cyl")
# # A tibble: 3 × 2
# cyl mpg
# <dbl> <dbl>
# 1 4 26.66364
# 2 6 19.74286
# 3 8 15.10000
Creating dplyr function that can tell if variable input is a string or a symbol
my_summarise <- function(df, group_var) {
group_var <- substitute(group_var)
if(!is.name(group_var)) group_var <- as.name(group_var) # instead of is.name and as.name you can use is.symbol and as.symbol or a mixture.
group_var <- enquo(group_var)
df %>% group_by(!! group_var) %>%
summarise(a = mean(a))
}
You can also ignore the if
condition altogether :
my_summarise <- function(df, group_var) {
group_var<- as.name(substitute(group_var))
group_var <- enquo(group_var)
df %>% group_by(!! group_var) %>%
summarise(a = mean(a))
}
Pass a string as variable name in dplyr::filter
!!
or UQ
evaluates the variable, so mtcars %>% filter(!!var == 4)
is the same as mtcars %>% filter('cyl' == 4)
where the condition always evaluates to false; You can prove this by printing !!var
in the filter function:
mtcars %>% filter({ print(!!var); (!!var) == 4 })
# [1] "cyl"
# [1] mpg cyl disp hp drat wt qsec vs am gear carb
# <0 rows> (or 0-length row.names)
To evaluate var
to the cyl
column, you need to convert var
to a symbol of cyl
first, then evaluate the symbol cyl
to a column:
Using rlang
:
library(rlang)
var <- 'cyl'
mtcars %>% filter((!!sym(var)) == 4)
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# ...
Or use as.symbol/as.name
from baseR:
mtcars %>% filter((!!as.symbol(var)) == 4)
mtcars %>% filter((!!as.name(var)) == 4)
pass string as the argument in function to be used in other function in R
@rawr is correct. The linked answer shows passing a string containing a column name into group_by
. The process is no different when passing the string into summarise
:
This is the approach I typically use:
library(dplyr)
focal <- function(dataset, focal.var){
df <- dataset %>%
group_by(group1) %>%
mutate(FV_ft = mean(!!sym(focal.var)))
return(df)
}
This is an approach recommended by the programming with dplyr vignette:
library(dplyr)
focal <- function(dataset, focal.var){
df <- dataset %>%
group_by(group1) %>%
mutate(FV_ft = mean(.data[[focal.var]]))
return(df)
}
Group by multiple columns in dplyr, using string vector input
Since this question was posted, dplyr added scoped versions of group_by
(documentation here). This lets you use the same functions you would use with select
, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
Note that since dplyr::summarize
only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup
to your pipeline after you summarize.
Using strings as arguments in custom dplyr function using non-standard evaluation
You can either use sym
to turn "y" into a symbol or parse_expr
to parse it into an expression, then unquote it using !!
:
library(rlang)
testFun(data.frame(x = c("a", "b", "c"), y = 1:3), !!sym(myVar))
testFun(data.frame(x = c("a", "b", "c"), y = 1:3), !!parse_expr(myVar))
Result:
x y
1 a 0
2 b 100
3 c 200
Check my answer in this question for explanation of difference between sym
and parse_expr
.
How to use tidy evaluation with column name as strings?
We can use also ensym
with !!
my_summarise <- function(df, group_var) {
df %>%
group_by(!!rlang::ensym(group_var)) %>%
summarise(a = mean(a))
}
my_summarise(df, 'g1')
Or another option is group_by_at
my_summarise <- function(df, group_var) {
df %>%
group_by_at(vars(group_var)) %>%
summarise(a = mean(a))
}
my_summarise(df, 'g1')
Include column names as function input with dplyr
I've slightly updated your code to dplyr 1.0.0
and tidyr
. Then you can make use of the new dplyr
programming feature {{}}
to specify variables that are arguments of a function.
# Example data frame
df <- data.frame("ID" = rep(1:5, each = 4), "score" = runif(20, 0, 100), "location" = rep(c("a", "b", "c", "d"), 5))
library(dplyr)
wide_fun <- function(.data, key_name, value_name) {
.data %>%
group_by(across(-{{value_name}})) %>% # group by everything other than the value column.
mutate(row_id = 1:n()) %>% ungroup() %>% # build group index
tidyr::pivot_wider(
names_from = {{key_name}},
values_from = {{value_name}}) %>% # spread
select(-row_id)
}
wide_fun(df, location, score)
#> # A tibble: 5 x 5
#> ID a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 90.8 38.9 28.7 39.0
#> 2 2 94.5 24.9 84.6 54.6
#> 3 3 61.1 97.2 12.2 57.7
#> 4 4 52.7 85.6 41.4 100.
#> 5 5 17.8 86.1 92.3 33.7
Created on 2020-09-11 by the reprex package (v0.3.0)
Edit
This function should also work with older versions of dplyr
:
library(dplyr)
wide_fun_2 <- function(.data, key_name, value_name) {
.data %>%
group_by_at(vars(-!!ensym(value_name))) %>% # group by everything other than the value column.
mutate(row_id = 1:n()) %>% ungroup() %>% # build group index
tidyr::pivot_wider(
names_from = !!ensym(key_name),
values_from = !!ensym(value_name)) %>% # spread
select(-row_id)
}
df %>%
wide_fun_2(location, score)
A tibble: 5 x 5
ID a b c d
<int> <dbl> <dbl> <dbl> <dbl>
1 1 72.2 81.4 52.5 48.8
2 2 36.1 27.5 82.2 73.0
3 3 83.9 68.2 80.9 15.7
4 4 0.451 70.0 18.5 43.2
5 5 82.6 68.2 22.8 63.0
If you just provide the argument that specifies the column, you only need to deal with symbols and not quosures, therefore you need to use ensym
.
standard evaluation in dplyr: summarise a variable given as a character string
dplyr
1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr
programming vignette here:
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
The new way to refer to columns when their identifier is stored as a character vector is to use the .data
pronoun from rlang
, and then subset as you would in base R.
library(dplyr)
key <- "v3"
val <- "v2"
drp <- "v1"
df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
df %>%
select(-matches(drp)) %>%
group_by(.data[[key]]) %>%
summarise(total = sum(.data[[val]], na.rm = TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> v3 total
#> <chr> <int>
#> 1 A 21
#> 2 B 19
If your code is in a package function, you can @importFrom rlang .data
to avoid R check notes about undefined globals.
Related Topics
Shiny 4 Small Textinput Boxes Side-By-Side
How to Add a Row to a Data Frame in R
Can Dplyr Summarise Over Several Variables Without Listing Each One
Adding a Regression Line on a Ggplot
How to Add Code Folding to Output Chunks in Rmarkdown HTML Documents
Plotting Smooth Line Through All Data Points
Custom Sorting (Non-Alphabetical)
How to Change the Default Time Zone in R
Compute Mean and Standard Deviation by Group For Multiple Variables in a Data.Frame
Select Subset of Columns in Data.Table R
Addressing X and Y in Aes by Variable Number
Define and Apply Custom Bins on a Dataframe
Customize Ggplot2 Axis Labels With Different Colors
Melt/Reshape in Excel Using Vba
Clang-7: Error: Linker Command Failed With Exit Code 1 For Macos Big Sur
How to Position Two Legends Independently in Ggplot