Dplyr::Group_By_ with Character String Input of Several Variable Names

dplyr::group_by_ with character string input of several variable names

No need for interp here, just use as.formula to convert the strings to formulas:

dots = sapply(y, . %>% {as.formula(paste0('~', .))})
mtcars %>% group_by_(.dots = dots)

The reason why your interp approach doesn’t work is that the expression gives you back the following:

~list(c("cyl", "gear"))

– not what you want. You could, of course, sapply interp over y, which would be similar to using as.formula above:

dots1 = sapply(y, . %>% {interp(~var, var = .)})

But, in fact, you can also directly pass y:

mtcars %>% group_by_(.dots = y)

The dplyr vignette on non-standard evaluation goes into more detail and explains the difference between these approaches.

Group by multiple columns in dplyr, using string vector input

Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:

data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))

#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27

The output from your example question is as expected (see comparison to plyr above and output below):

# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998

Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.

Pass character string of column names (e.g. c(speed, dist ) to `across` function in R

You can't use substitute() or eval() on character vectors. You need to parse those character vectors into language objects. Otherwise when you eval a string, you just get that string back. It's not like eval in other languages. One way to do the parsing is str2lang. Then you can inject that expression into the across using tidy evaulation's !!. For example

mtcars_2 %>% 
mutate(across(.cols = !!str2lang(.$cols_to_modify),.fns = round))

dplyr - groupby on multiple columns using variable names


dplyr version >1.0

With more recent versions of dplyr, you should use across along with a tidyselect helper function. See help("language", "tidyselect") for a list of all the helper functions. In this case if you want all columns in a character vector, use all_of()

cols <- c("mpg","hp","wt")
mtcars %>%
group_by(across(all_of(cols))) %>%
summarize(x=mean(gear))

original answer (older versions of dplyr)

If you have a vector of variable names, you should pass them to the .dots= parameter of group_by_. For example:

mtcars %>% 
group_by_(.dots=c("mpg","hp","wt")) %>%
summarize(x=mean(gear))

standard evaluation in dplyr: summarise a variable given as a character string

dplyr 1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr programming vignette here:

https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html

The new way to refer to columns when their identifier is stored as a character vector is to use the .data pronoun from rlang, and then subset as you would in base R.

library(dplyr)

key <- "v3"
val <- "v2"
drp <- "v1"

df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))

df %>%
select(-matches(drp)) %>%
group_by(.data[[key]]) %>%
summarise(total = sum(.data[[val]], na.rm = TRUE))

#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> v3 total
#> <chr> <int>
#> 1 A 21
#> 2 B 19

If your code is in a package function, you can @importFrom rlang .data to avoid R check notes about undefined globals.

Using Variable names for dplyr inside function


sum1 <- function(df, group_var,x,y) {

group_var <- enquo(group_var)

x = as.name(x)
y = as.name(y)

df.temp<- df %>%
group_by(!!group_var) %>%
mutate(
sum = !!enquo(x)+!!enquo(y)
)

return(df.temp)
}

sum1(df, g1, A.Key, B.Key)
# A tibble: 5 x 4
# Groups: g1 [5]
g1 a b sum
<dbl> <int> <int> <int>
1 1. 3 2 5
2 2. 2 1 3
3 3. 1 3 4
4 4. 4 4 8
5 5. 5 5 10

Function calling variable names for group_by in dplyr - how do I vectorise this variable in the function?

@akrun's answer offers a working solution, but I think this is an ideal situation to wrap function parameters in vars(), passing the variables you want to group by as a quasi-quotation that dplyr can interpret without any explicit tidyeval code in the body of the function.

library(tidyverse)
#> -- Attaching packages ------------------------------------ tidyverse 1.2.1 --
#> v ggplot2 3.0.0 v purrr 0.2.5
#> v tibble 1.4.2 v dplyr 0.7.6
#> v tidyr 0.8.0 v stringr 1.3.1
#> v readr 1.1.1 v forcats 0.3.0
#> -- Conflicts --------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
# Create data frame for analysis
dat <- data.frame(
Type1 = c(0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0),
Type2 = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
Output = c(4, 2, 7, 5, 1, 1, 7, 8, 3, 2, 5, 4, 3, 6)
)
# using the dplyr::vars() quoting function has 3 main advantages:
# 1. It makes functions neater
mean_out <- function(.vars) {

dat %>%

# group_by will continue to work for basic selections
# group_by_at allows for full tidyselect functionality
group_by_at(.vars) %>%
summarise(mean = mean(Output))
}
# 2. It lets us harness the power of tidyselect
mean_out(vars(Type1))
#> # A tibble: 2 x 2
#> Type1 mean
#> <dbl> <dbl>
#> 1 0 3.83
#> 2 1 4.38
mean_out(vars(Type1, Type2))
#> # A tibble: 6 x 3
#> # Groups: Type1 [?]
#> Type1 Type2 mean
#> <dbl> <dbl> <dbl>
#> 1 0 1 2.33
#> 2 0 2 5
#> 3 0 3 6
#> 4 1 1 4.33
#> 5 1 2 5
#> 6 1 3 4
mean_out(vars(-Output))
#> # A tibble: 6 x 3
#> # Groups: Type1 [?]
#> Type1 Type2 mean
#> <dbl> <dbl> <dbl>
#> 1 0 1 2.33
#> 2 0 2 5
#> 3 0 3 6
#> 4 1 1 4.33
#> 5 1 2 5
#> 6 1 3 4
mean_out(vars(matches("Type")))
#> # A tibble: 6 x 3
#> # Groups: Type1 [?]
#> Type1 Type2 mean
#> <dbl> <dbl> <dbl>
#> 1 0 1 2.33
#> 2 0 2 5
#> 3 0 3 6
#> 4 1 1 4.33
#> 5 1 2 5
#> 6 1 3 4
# 3. It doesn't demand that we load rlang, since it's built into dplyr


Related Topics



Leave a reply



Submit