dplyr::group_by_ with character string input of several variable names
No need for interp
here, just use as.formula
to convert the strings to formulas:
dots = sapply(y, . %>% {as.formula(paste0('~', .))})
mtcars %>% group_by_(.dots = dots)
The reason why your interp
approach doesn’t work is that the expression gives you back the following:
~list(c("cyl", "gear"))
– not what you want. You could, of course, sapply
interp
over y
, which would be similar to using as.formula
above:
dots1 = sapply(y, . %>% {interp(~var, var = .)})
But, in fact, you can also directly pass y
:
mtcars %>% group_by_(.dots = y)
The dplyr vignette on non-standard evaluation goes into more detail and explains the difference between these approaches.
Group by multiple columns in dplyr, using string vector input
Since this question was posted, dplyr added scoped versions of group_by
(documentation here). This lets you use the same functions you would use with select
, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
Note that since dplyr::summarize
only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup
to your pipeline after you summarize.
Pass character string of column names (e.g. c(speed, dist ) to `across` function in R
You can't use substitute()
or eval()
on character vectors. You need to parse those character vectors into language objects. Otherwise when you eval a string, you just get that string back. It's not like eval
in other languages. One way to do the parsing is str2lang
. Then you can inject that expression into the across
using tidy evaulation's !!
. For example
mtcars_2 %>%
mutate(across(.cols = !!str2lang(.$cols_to_modify),.fns = round))
dplyr - groupby on multiple columns using variable names
dplyr version >1.0
With more recent versions of dplyr
, you should use across
along with a tidyselect helper function. See help("language", "tidyselect")
for a list of all the helper functions. In this case if you want all columns in a character vector, use all_of()
cols <- c("mpg","hp","wt")
mtcars %>%
group_by(across(all_of(cols))) %>%
summarize(x=mean(gear))
original answer (older versions of dplyr)
If you have a vector of variable names, you should pass them to the .dots=
parameter of group_by_
. For example:
mtcars %>%
group_by_(.dots=c("mpg","hp","wt")) %>%
summarize(x=mean(gear))
standard evaluation in dplyr: summarise a variable given as a character string
dplyr
1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr
programming vignette here:
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
The new way to refer to columns when their identifier is stored as a character vector is to use the .data
pronoun from rlang
, and then subset as you would in base R.
library(dplyr)
key <- "v3"
val <- "v2"
drp <- "v1"
df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
df %>%
select(-matches(drp)) %>%
group_by(.data[[key]]) %>%
summarise(total = sum(.data[[val]], na.rm = TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> v3 total
#> <chr> <int>
#> 1 A 21
#> 2 B 19
If your code is in a package function, you can @importFrom rlang .data
to avoid R check notes about undefined globals.
Using Variable names for dplyr inside function
sum1 <- function(df, group_var,x,y) {
group_var <- enquo(group_var)
x = as.name(x)
y = as.name(y)
df.temp<- df %>%
group_by(!!group_var) %>%
mutate(
sum = !!enquo(x)+!!enquo(y)
)
return(df.temp)
}
sum1(df, g1, A.Key, B.Key)
# A tibble: 5 x 4
# Groups: g1 [5]
g1 a b sum
<dbl> <int> <int> <int>
1 1. 3 2 5
2 2. 2 1 3
3 3. 1 3 4
4 4. 4 4 8
5 5. 5 5 10
Function calling variable names for group_by in dplyr - how do I vectorise this variable in the function?
@akrun's answer offers a working solution, but I think this is an ideal situation to wrap function parameters in vars(), passing the variables you want to group by as a quasi-quotation that dplyr can interpret without any explicit tidyeval code in the body of the function.
library(tidyverse)
#> -- Attaching packages ------------------------------------ tidyverse 1.2.1 --
#> v ggplot2 3.0.0 v purrr 0.2.5
#> v tibble 1.4.2 v dplyr 0.7.6
#> v tidyr 0.8.0 v stringr 1.3.1
#> v readr 1.1.1 v forcats 0.3.0
#> -- Conflicts --------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
# Create data frame for analysis
dat <- data.frame(
Type1 = c(0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0),
Type2 = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
Output = c(4, 2, 7, 5, 1, 1, 7, 8, 3, 2, 5, 4, 3, 6)
)
# using the dplyr::vars() quoting function has 3 main advantages:
# 1. It makes functions neater
mean_out <- function(.vars) {
dat %>%
# group_by will continue to work for basic selections
# group_by_at allows for full tidyselect functionality
group_by_at(.vars) %>%
summarise(mean = mean(Output))
}
# 2. It lets us harness the power of tidyselect
mean_out(vars(Type1))
#> # A tibble: 2 x 2
#> Type1 mean
#> <dbl> <dbl>
#> 1 0 3.83
#> 2 1 4.38
mean_out(vars(Type1, Type2))
#> # A tibble: 6 x 3
#> # Groups: Type1 [?]
#> Type1 Type2 mean
#> <dbl> <dbl> <dbl>
#> 1 0 1 2.33
#> 2 0 2 5
#> 3 0 3 6
#> 4 1 1 4.33
#> 5 1 2 5
#> 6 1 3 4
mean_out(vars(-Output))
#> # A tibble: 6 x 3
#> # Groups: Type1 [?]
#> Type1 Type2 mean
#> <dbl> <dbl> <dbl>
#> 1 0 1 2.33
#> 2 0 2 5
#> 3 0 3 6
#> 4 1 1 4.33
#> 5 1 2 5
#> 6 1 3 4
mean_out(vars(matches("Type")))
#> # A tibble: 6 x 3
#> # Groups: Type1 [?]
#> Type1 Type2 mean
#> <dbl> <dbl> <dbl>
#> 1 0 1 2.33
#> 2 0 2 5
#> 3 0 3 6
#> 4 1 1 4.33
#> 5 1 2 5
#> 6 1 3 4
# 3. It doesn't demand that we load rlang, since it's built into dplyr
Related Topics
Multiple Graphs in One Canvas Using Ggplot2
Do You Use Attach() or Call Variables by Name or Slicing
Filter Function in Dplyr Errors: Object 'Name' Not Found
Ggplot2 Shade Area Under Density Curve by Group
Fill Missing Combinations in a Dataframe
Inserting a Table Under the Legend in a Ggplot2 Histogram
How to Separate Two Plots in R
Determining Utm Zone (To Convert) from Longitude/Latitude
How to Get Google Search Results
Ggplot2 0.9.0 Automatically Dropping Unused Factor Levels from Plot Legend
Mean of a Column in a Data Frame, Given the Column's Name
Change Row Order in a Matrix/Dataframe
Change Background and Text of Strips Associated to Multiple Panels in R/Lattice
Showing String in Formula and Not as Variable in Lm Fit
Extract Prediction Band from Lme Fit