Group by multiple columns in dplyr, using string vector input
Since this question was posted, dplyr added scoped versions of group_by
(documentation here). This lets you use the same functions you would use with select
, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
Note that since dplyr::summarize
only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup
to your pipeline after you summarize.
dplyr group_by vector of column names?
We can use group_by
with across
from dplyr
version >= 1.0.0
library(dplyr)
mtcars %>%
group_by(across(all_of(c('mpg', 'cyl')))) %>%
tally() %>%
head(2)
# A tibble: 2 x 3
# Groups: mpg [2]
# mpg cyl n
# <dbl> <dbl> <int>
#1 10.4 8 2
#2 13.3 8 1
With older versions, use the group_by_at
mtcars %>%
group_by_at(c('mpg', 'cyl')) %>%
tally() %>%
head(2)
# A tibble: 2 x 3
# Groups: mpg [2]
# mpg cyl n
# <dbl> <dbl> <int>
#1 10.4 8 2
#2 13.3 8 1
Selecting and grouping multiple columns in dtplyr vs dplyr
It seems that the method for group_by
with dtplyr
(group_by.dtplyr_step
) is creating the issue.
> methods('group_by')
[1] group_by.data.frame* group_by.data.table* group_by.dtplyr_step*
Not sure if it is a bug or not.
> traceback()
...
6: group_by.dtplyr_step(., across(all_of(.x))) ###
5: group_by(., across(all_of(.x)))
4: filter(., n() > 1)
3: airq %>% select(all_of(.x)) %>% group_by(across(all_of(.x))) %>%
filter(n() > 1)
2: .f(.x[[i]], ...)
1: map(columnpairs, ~airq %>% select(all_of(.x)) %>% group_by(across(all_of(.x))) %>%
filter(n() > 1))
Here are two methods that are working
- Using the deprecated
group_by_at
- Converting to
syms
and then evaluate (!!!
)
Using group_by_at
library(dtplyr)
library(purrr)
library(dplyr)
map(columnpairs, ~ airq %>%
select(all_of(.x)) %>%
group_by_at(all_of(.x)) %>%
filter(n() > 1))
$V1
Source: local data table [105 x 2]
Groups: Wind, Month
Call:
_DT2 <- `_DT1`[, .(Wind, Month)]
`_DT2`[`_DT2`[, .I[.N > 1], by = .(Wind, Month)]$V1]
Wind Month
<dbl> <int>
1 7.4 5
2 7.4 5
3 8 5
4 8 5
5 11.5 5
6 11.5 5
# … with 99 more rows
...
Converting to symbols and evaluate
map(columnpairs, ~ airq %>%
select(all_of(.x)) %>%
group_by(!!! rlang::syms(.x)) %>%
filter(n() > 1))
$V1
Source: local data table [105 x 2]
Groups: Wind, Month
Call:
_DT20 <- `_DT1`[, .(Wind, Month)]
`_DT20`[`_DT20`[, .I[.N > 1], by = .(Wind, Month)]$V1]
Wind Month
<dbl> <int>
1 7.4 5
2 7.4 5
3 8 5
4 8 5
5 11.5 5
6 11.5 5
# … with 99 more rows
# Use as.data.table()/as.data.frame()/as_tibble() to access results
$V2
...
dplyr - groupby on multiple columns using variable names
dplyr version >1.0
With more recent versions of dplyr
, you should use across
along with a tidyselect helper function. See help("language", "tidyselect")
for a list of all the helper functions. In this case if you want all columns in a character vector, use all_of()
cols <- c("mpg","hp","wt")
mtcars %>%
group_by(across(all_of(cols))) %>%
summarize(x=mean(gear))
original answer (older versions of dplyr)
If you have a vector of variable names, you should pass them to the .dots=
parameter of group_by_
. For example:
mtcars %>%
group_by_(.dots=c("mpg","hp","wt")) %>%
summarize(x=mean(gear))
correlation of a vector across all column in R (dplyr)
Since cor()
requires same dimension for both x
and y
, you cannot group rows together, otherwise, they will not have 4 elements to match with 4 values in y
.
Prepare data and library
library(dplyr)
gdf <-
tibble(g = c(1, 1, 2, 3), v1 = 10:13, v2 = 20:23)
y <- rnorm(4)
[1] 0.59390132 0.91897737 0.78213630 0.07456498
mutate()
If you want to keep v1
and v2
in the output, use the .names
argument to indicate the names of the new columns. {.col}
refers to the column name that across
is acting on.
gdf %>% mutate(across(v1:v2, ~ cor(.x,y), .names = "{.col}_cor"))
# A tibble: 4 x 5
g v1 v2 v1_cor v2_cor
<dbl> <int> <int> <dbl> <dbl>
1 1 10 20 -0.591 -0.591
2 1 11 21 -0.591 -0.591
3 2 12 22 -0.591 -0.591
4 3 13 23 -0.591 -0.591
summarise()
If you only want the cor()
output in the results, you can use summarise
gdf %>% summarize(across(v1:v2, ~ cor(.x,y)))
# A tibble: 1 x 2
v1 v2
<dbl> <dbl>
1 -0.591 -0.591
Dynamic variables names in dplyr function across multiple columns
We could use .names
in across
to rename
mean_fun_multicols <- function(data, group_cols, summary_cols) {
data %>%
group_by(across({{group_cols}})) %>%
summarise(across({{ summary_cols }},
~ mean(., na.rm = TRUE), .names = "mean_{.col}"), .groups = "drop")
}
-testing
mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt))
# A tibble: 8 × 4
cyl gear mean_mpg mean_wt
<dbl> <dbl> <dbl> <dbl>
1 4 3 21.5 2.46
2 4 4 26.9 2.38
3 4 5 28.2 1.83
4 6 3 19.8 3.34
5 6 4 19.8 3.09
6 6 5 19.7 2.77
7 8 3 15.0 4.10
8 8 5 15.4 3.37
NOTE: The :=
is mainly used when there is a single column in tidyverse
If we use the OP's function, we are assigning multiple columns to a single column and this returns a tibble
instead of a normal column. We may need to unpack
library(tidyr)
> mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt)) %>% str
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
grouped_df [8 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
$ cyl : num [1:8] 4 4 4 6 6 6 8 8
$ gear : num [1:8] 3 4 5 3 4 5 3 5
$ mean_c(mpg, wt): tibble [8 × 2] (S3: tbl_df/tbl/data.frame)
..$ mpg: num [1:8] 21.5 26.9 28.2 19.8 19.8 ...
..$ wt : num [1:8] 2.46 2.38 1.83 3.34 3.09 ...
- attr(*, "groups")= tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
..$ cyl : num [1:3] 4 6 8
..$ .rows: list<int> [1:3]
.. ..$ : int [1:3] 1 2 3
.. ..$ : int [1:3] 4 5 6
.. ..$ : int [1:2] 7 8
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
> mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt)) %>%
unpack(where(is_tibble))
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 8 × 4
# Groups: cyl [3]
cyl gear mpg wt
<dbl> <dbl> <dbl> <dbl>
1 4 3 21.5 2.46
2 4 4 26.9 2.38
3 4 5 28.2 1.83
4 6 3 19.8 3.34
5 6 4 19.8 3.09
6 6 5 19.7 2.77
7 8 3 15.0 4.10
8 8 5 15.4 3.37
dplyr group by colnames described as vector of strings
You can use group_by_at
, where you can pass a character vector of column names as group variables:
mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
Or you can move the column selection inside group_by_at
using vars
and column select helper functions:
mtcars %>%
filter(disp < 160) %>%
group_by_at(vars(matches('[a-z]{3,}$'))) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
How can I filter multiple columns with dplyr using string matching for the column name?
You could do that using filter_at
with ends_with
.
library(dplyr)
nyc_crashes %>%
# Select columns that end with KILLED or INJURED
filter_at(vars(c(ends_with("KILLED"),ends_with("INJURED"))),
# Keep rows where any of these variables is >= 1
any_vars(. >= 1))
Related Topics
Rename Multiple Columns by Names
A Comprehensive Survey of the Types of Things in R; 'Mode' and 'Class' and 'Typeof' Are Insufficient
Returning Multiple Objects in an R Function
Convert Data.Frame Column Format from Character to Factor
Change Variable Name in For Loop Using R
Create a Variable Name With "Paste" in R
Assign Multiple New Variables on Lhs in a Single Line
Overlay Histogram With Density Curve
Filter Data Frame by Character Column Name (In Dplyr)
Labeling Outliers of Boxplots in R
How to Count Runs in a Sequence
Select Rows With Min Value by Group
Unlist Data Frame Column Preserving Information from Other Column