dplyr - groupby on multiple columns using variable names
dplyr version >1.0
With more recent versions of dplyr
, you should use across
along with a tidyselect helper function. See help("language", "tidyselect")
for a list of all the helper functions. In this case if you want all columns in a character vector, use all_of()
cols <- c("mpg","hp","wt")
mtcars %>%
group_by(across(all_of(cols))) %>%
summarize(x=mean(gear))
original answer (older versions of dplyr)
If you have a vector of variable names, you should pass them to the .dots=
parameter of group_by_
. For example:
mtcars %>%
group_by_(.dots=c("mpg","hp","wt")) %>%
summarize(x=mean(gear))
Dynamic variables names in dplyr function across multiple columns
We could use .names
in across
to rename
mean_fun_multicols <- function(data, group_cols, summary_cols) {
data %>%
group_by(across({{group_cols}})) %>%
summarise(across({{ summary_cols }},
~ mean(., na.rm = TRUE), .names = "mean_{.col}"), .groups = "drop")
}
-testing
mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt))
# A tibble: 8 × 4
cyl gear mean_mpg mean_wt
<dbl> <dbl> <dbl> <dbl>
1 4 3 21.5 2.46
2 4 4 26.9 2.38
3 4 5 28.2 1.83
4 6 3 19.8 3.34
5 6 4 19.8 3.09
6 6 5 19.7 2.77
7 8 3 15.0 4.10
8 8 5 15.4 3.37
NOTE: The :=
is mainly used when there is a single column in tidyverse
If we use the OP's function, we are assigning multiple columns to a single column and this returns a tibble
instead of a normal column. We may need to unpack
library(tidyr)
> mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt)) %>% str
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
grouped_df [8 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
$ cyl : num [1:8] 4 4 4 6 6 6 8 8
$ gear : num [1:8] 3 4 5 3 4 5 3 5
$ mean_c(mpg, wt): tibble [8 × 2] (S3: tbl_df/tbl/data.frame)
..$ mpg: num [1:8] 21.5 26.9 28.2 19.8 19.8 ...
..$ wt : num [1:8] 2.46 2.38 1.83 3.34 3.09 ...
- attr(*, "groups")= tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
..$ cyl : num [1:3] 4 6 8
..$ .rows: list<int> [1:3]
.. ..$ : int [1:3] 1 2 3
.. ..$ : int [1:3] 4 5 6
.. ..$ : int [1:2] 7 8
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
> mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt)) %>%
unpack(where(is_tibble))
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 8 × 4
# Groups: cyl [3]
cyl gear mpg wt
<dbl> <dbl> <dbl> <dbl>
1 4 3 21.5 2.46
2 4 4 26.9 2.38
3 4 5 28.2 1.83
4 6 3 19.8 3.34
5 6 4 19.8 3.09
6 6 5 19.7 2.77
7 8 3 15.0 4.10
8 8 5 15.4 3.37
How to pass multiple column names as input to group_by in dplyr
You've almost got it, you just need to use the .dots
argument to pass in your grouping variables.
group <- c("origin","carrier")
flights %>%
group_by_(.dots = group) %>%
tally()
dplyr group_by - Mix variable names with and without surrounding quotes
This would be one way to do it:
library(dplyr)
group_and_summarize <- function(var) {
test_tbl %>%
select(Species, {{var}}, Sepal.Length, Petal.Width) %>%
group_by(Species, {{var}}) %>%
summarize(average.Sepal.Length = mean(Sepal.Length),
average.Petal.Width = mean(Petal.Width))
}
group_and_summarize(extra_var1)
#> `summarise()` regrouping output by 'Species' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: Species [3]
#> Species extra_var1 average.Sepal.Length average.Petal.Width
#> <fct> <chr> <dbl> <dbl>
#> 1 setosa No 4.67 0.195
#> 2 setosa Yes 5.23 0.28
#> 3 versicolor No 4.9 1
#> 4 versicolor Yes 5.96 1.33
#> 5 virginica No 4.9 1.7
#> 6 virginica Yes 6.62 2.03
Created on 2021-05-11 by the reprex package (v0.3.0)
If you want the user to enter strings then we can use !!! syms()
:
group_and_summarize <- function(vars) {
test_tbl %>%
select(Species, !!! syms(vars), Sepal.Length, Petal.Width) %>%
group_by(Species, !!! syms(vars)) %>%
summarize(average.Sepal.Length = mean(Sepal.Length),
average.Petal.Width = mean(Petal.Width))
}
group_and_summarize(c("extra_var1", "extra_var2"))
#> `summarise()` regrouping output by 'Species', 'extra_var1' (override with `.groups` argument)
#> # A tibble: 6 x 5
#> # Groups: Species, extra_var1 [6]
#> Species extra_var1 extra_var2 average.Sepal.Length average.Petal.Width
#> <fct> <chr> <chr> <dbl> <dbl>
#> 1 setosa No What 4.67 0.195
#> 2 setosa Yes What 5.23 0.28
#> 3 versicolor No What 4.9 1
#> 4 versicolor Yes What 5.96 1.33
#> 5 virginica No What 4.9 1.7
#> 6 virginica Yes What 6.62 2.03
Created on 2021-05-11 by the reprex package (v0.3.0)
Group by multiple columns in dplyr, using string vector input
Since this question was posted, dplyr added scoped versions of group_by
(documentation here). This lets you use the same functions you would use with select
, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
Note that since dplyr::summarize
only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup
to your pipeline after you summarize.
Group by two column and summarize multiple columns
We can use summarise
with across
from dplyr
version > = 1.00
library(dplyr)
df %>%
group_by(State, Date) %>%
summarise(across(everything(), sum, na.rm = TRUE), .groups = 'drop')
# A tibble: 6 x 4
# State Date Female Male
# <chr> <chr> <int> <int>
#1 Cali 05/06/2005 3 2
#2 Cali 10/06/2005 4 3
#3 NY 11/06/2005 10 5
#4 NY 12/06/2005 11 6
#5 Texas 01/01/2004 5 3
#6 Texas 02/01/2004 5 4
Or using aggregate
from base R
aggregate(.~ State + Date, df, sum, na.rm = TRUE)
data
df <- structure(list(State = c("Texas", "Texas", "Texas", "Cali", "Cali",
"Cali", "Cali", "NY", "NY"), Female = c(2L, 3L, 5L, 1L, 2L, 3L,
1L, 10L, 11L), Male = c(2L, 1L, 4L, 1L, 1L, 1L, 2L, 5L, 6L),
Date = c("01/01/2004", "01/01/2004", "02/01/2004", "05/06/2005",
"05/06/2005", "10/06/2005", "10/06/2005", "11/06/2005", "12/06/2005"
)), class = "data.frame", row.names = c(NA, -9L))
specify variable names when grouping
What about using across
to select the columns
iris[, -(rbinom(1, 1, .5) + 1) ] %>%
group_by(across(starts_with('Sepal')))
# A tibble: 150 x 4
# Groups: Sepal.Length [35]
Sepal.Length Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <fct>
1 5.1 1.4 0.2 setosa
2 4.9 1.4 0.2 setosa
3 4.7 1.3 0.2 setosa
4 4.6 1.5 0.2 setosa
5 5 1.4 0.2 setosa
6 5.4 1.7 0.4 setosa
7 4.6 1.4 0.3 setosa
8 5 1.5 0.2 setosa
9 4.4 1.4 0.2 setosa
10 4.9 1.5 0.1 setosa
# … with 140 more rows
Related Topics
R: How to Sum Columns Grouped by a Factor
Possible to Create Rd Help Files for Objects Not in a Package
Setting Work Directory in Knitr Using Opts_Chunk$Set(Root.Dir = ...) Doesn't Work
Create New Column Based on 4 Values in Another Column
Split Time Series Data into Time Intervals (Say an Hour) and Then Plot the Count
Asymmetric Color Distribution in Scale_Gradient2
How to Fix 'Tar: Failed to Set Default Locale' Error
Use Expression with a Variable R
How to Insert a Dataframe into a SQL Server Table
Roc Curve from Training Data in Caret
Left-Adjust Title in Ggplot2, or Absolute Position for Ggtitle
Differencebetween Nan and Inf, and Null and Na in R
Read Gzipped CSV Directly from a Url in R
How to Select R Data.Table Rows Based on Substring Match (A La SQL Like)
Datalabels in R Highcharter Cannot Be Seen After Print as Png or Jpg