Dplyr - Groupby on Multiple Columns Using Variable Names

dplyr - groupby on multiple columns using variable names

dplyr version >1.0

With more recent versions of dplyr, you should use across along with a tidyselect helper function. See help("language", "tidyselect") for a list of all the helper functions. In this case if you want all columns in a character vector, use all_of()

cols <- c("mpg","hp","wt")
mtcars %>%
group_by(across(all_of(cols))) %>%
summarize(x=mean(gear))

original answer (older versions of dplyr)

If you have a vector of variable names, you should pass them to the .dots= parameter of group_by_. For example:

mtcars %>% 
group_by_(.dots=c("mpg","hp","wt")) %>%
summarize(x=mean(gear))

Dynamic variables names in dplyr function across multiple columns

We could use .names in across to rename

mean_fun_multicols <- function(data, group_cols, summary_cols) {
data %>%
group_by(across({{group_cols}})) %>%
summarise(across({{ summary_cols }},
~ mean(., na.rm = TRUE), .names = "mean_{.col}"), .groups = "drop")
}

-testing

mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt))
# A tibble: 8 × 4
cyl gear mean_mpg mean_wt
<dbl> <dbl> <dbl> <dbl>
1 4 3 21.5 2.46
2 4 4 26.9 2.38
3 4 5 28.2 1.83
4 6 3 19.8 3.34
5 6 4 19.8 3.09
6 6 5 19.7 2.77
7 8 3 15.0 4.10
8 8 5 15.4 3.37

NOTE: The := is mainly used when there is a single column in tidyverse


If we use the OP's function, we are assigning multiple columns to a single column and this returns a tibble instead of a normal column. We may need to unpack

library(tidyr)
> mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt)) %>% str
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
grouped_df [8 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
$ cyl : num [1:8] 4 4 4 6 6 6 8 8
$ gear : num [1:8] 3 4 5 3 4 5 3 5
$ mean_c(mpg, wt): tibble [8 × 2] (S3: tbl_df/tbl/data.frame)
..$ mpg: num [1:8] 21.5 26.9 28.2 19.8 19.8 ...
..$ wt : num [1:8] 2.46 2.38 1.83 3.34 3.09 ...
- attr(*, "groups")= tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
..$ cyl : num [1:3] 4 6 8
..$ .rows: list<int> [1:3]
.. ..$ : int [1:3] 1 2 3
.. ..$ : int [1:3] 4 5 6
.. ..$ : int [1:2] 7 8
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE

> mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt)) %>%
unpack(where(is_tibble))
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 8 × 4
# Groups: cyl [3]
cyl gear mpg wt
<dbl> <dbl> <dbl> <dbl>
1 4 3 21.5 2.46
2 4 4 26.9 2.38
3 4 5 28.2 1.83
4 6 3 19.8 3.34
5 6 4 19.8 3.09
6 6 5 19.7 2.77
7 8 3 15.0 4.10
8 8 5 15.4 3.37

How to pass multiple column names as input to group_by in dplyr

You've almost got it, you just need to use the .dots argument to pass in your grouping variables.

group <- c("origin","carrier") 

flights %>%
group_by_(.dots = group) %>%
tally()

dplyr group_by - Mix variable names with and without surrounding quotes

This would be one way to do it:

library(dplyr)

group_and_summarize <- function(var) {
test_tbl %>%
select(Species, {{var}}, Sepal.Length, Petal.Width) %>%
group_by(Species, {{var}}) %>%
summarize(average.Sepal.Length = mean(Sepal.Length),
average.Petal.Width = mean(Petal.Width))
}

group_and_summarize(extra_var1)
#> `summarise()` regrouping output by 'Species' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: Species [3]
#> Species extra_var1 average.Sepal.Length average.Petal.Width
#> <fct> <chr> <dbl> <dbl>
#> 1 setosa No 4.67 0.195
#> 2 setosa Yes 5.23 0.28
#> 3 versicolor No 4.9 1
#> 4 versicolor Yes 5.96 1.33
#> 5 virginica No 4.9 1.7
#> 6 virginica Yes 6.62 2.03

Created on 2021-05-11 by the reprex package (v0.3.0)

If you want the user to enter strings then we can use !!! syms():

group_and_summarize <- function(vars) {
test_tbl %>%
select(Species, !!! syms(vars), Sepal.Length, Petal.Width) %>%
group_by(Species, !!! syms(vars)) %>%
summarize(average.Sepal.Length = mean(Sepal.Length),
average.Petal.Width = mean(Petal.Width))
}

group_and_summarize(c("extra_var1", "extra_var2"))

#> `summarise()` regrouping output by 'Species', 'extra_var1' (override with `.groups` argument)
#> # A tibble: 6 x 5
#> # Groups: Species, extra_var1 [6]
#> Species extra_var1 extra_var2 average.Sepal.Length average.Petal.Width
#> <fct> <chr> <chr> <dbl> <dbl>
#> 1 setosa No What 4.67 0.195
#> 2 setosa Yes What 5.23 0.28
#> 3 versicolor No What 4.9 1
#> 4 versicolor Yes What 5.96 1.33
#> 5 virginica No What 4.9 1.7
#> 6 virginica Yes What 6.62 2.03

Created on 2021-05-11 by the reprex package (v0.3.0)

Group by multiple columns in dplyr, using string vector input

Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:

data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))

#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27

The output from your example question is as expected (see comparison to plyr above and output below):

# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998

Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.

Group by two column and summarize multiple columns

We can use summarise with across from dplyr version > = 1.00

library(dplyr)
df %>%
group_by(State, Date) %>%
summarise(across(everything(), sum, na.rm = TRUE), .groups = 'drop')
# A tibble: 6 x 4
# State Date Female Male
# <chr> <chr> <int> <int>
#1 Cali 05/06/2005 3 2
#2 Cali 10/06/2005 4 3
#3 NY 11/06/2005 10 5
#4 NY 12/06/2005 11 6
#5 Texas 01/01/2004 5 3
#6 Texas 02/01/2004 5 4

Or using aggregate from base R

aggregate(.~ State + Date, df, sum, na.rm = TRUE)

data

df <-  structure(list(State = c("Texas", "Texas", "Texas", "Cali", "Cali", 
"Cali", "Cali", "NY", "NY"), Female = c(2L, 3L, 5L, 1L, 2L, 3L,
1L, 10L, 11L), Male = c(2L, 1L, 4L, 1L, 1L, 1L, 2L, 5L, 6L),
Date = c("01/01/2004", "01/01/2004", "02/01/2004", "05/06/2005",
"05/06/2005", "10/06/2005", "10/06/2005", "11/06/2005", "12/06/2005"
)), class = "data.frame", row.names = c(NA, -9L))

specify variable names when grouping

What about using across to select the columns

iris[, -(rbinom(1, 1, .5) + 1) ]  %>%
group_by(across(starts_with('Sepal')))


# A tibble: 150 x 4
# Groups: Sepal.Length [35]
Sepal.Length Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <fct>
1 5.1 1.4 0.2 setosa
2 4.9 1.4 0.2 setosa
3 4.7 1.3 0.2 setosa
4 4.6 1.5 0.2 setosa
5 5 1.4 0.2 setosa
6 5.4 1.7 0.4 setosa
7 4.6 1.4 0.3 setosa
8 5 1.5 0.2 setosa
9 4.4 1.4 0.2 setosa
10 4.9 1.5 0.1 setosa
# … with 140 more rows


Related Topics



Leave a reply



Submit