Group by Multiple Columns in Dplyr, Using String Vector Input

Group by multiple columns in dplyr, using string vector input

Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:

data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))

#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27

The output from your example question is as expected (see comparison to plyr above and output below):

# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998

Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.

dplyr group_by vector of column names?

We can use group_by with across from dplyr version >= 1.0.0

library(dplyr)
mtcars %>%
group_by(across(all_of(c('mpg', 'cyl')))) %>%
tally() %>%
head(2)
# A tibble: 2 x 3
# Groups: mpg [2]
# mpg cyl n
# <dbl> <dbl> <int>
#1 10.4 8 2
#2 13.3 8 1

With older versions, use the group_by_at

mtcars %>%
group_by_at(c('mpg', 'cyl')) %>%
tally() %>%
head(2)
# A tibble: 2 x 3
# Groups: mpg [2]
# mpg cyl n
# <dbl> <dbl> <int>
#1 10.4 8 2
#2 13.3 8 1

Selecting and grouping multiple columns in dtplyr vs dplyr

It seems that the method for group_by with dtplyr (group_by.dtplyr_step) is creating the issue.

> methods('group_by')
[1] group_by.data.frame* group_by.data.table* group_by.dtplyr_step*

Not sure if it is a bug or not.

> traceback()
...
6: group_by.dtplyr_step(., across(all_of(.x))) ###
5: group_by(., across(all_of(.x)))
4: filter(., n() > 1)
3: airq %>% select(all_of(.x)) %>% group_by(across(all_of(.x))) %>%
filter(n() > 1)
2: .f(.x[[i]], ...)
1: map(columnpairs, ~airq %>% select(all_of(.x)) %>% group_by(across(all_of(.x))) %>%
filter(n() > 1))

Here are two methods that are working

  1. Using the deprecated group_by_at
  2. Converting to syms and then evaluate (!!!)
Using group_by_at

library(dtplyr)
library(purrr)
library(dplyr)
map(columnpairs, ~ airq %>%
select(all_of(.x)) %>%
group_by_at(all_of(.x)) %>%
filter(n() > 1))
$V1
Source: local data table [105 x 2]
Groups: Wind, Month
Call:
_DT2 <- `_DT1`[, .(Wind, Month)]
`_DT2`[`_DT2`[, .I[.N > 1], by = .(Wind, Month)]$V1]

Wind Month
<dbl> <int>
1 7.4 5
2 7.4 5
3 8 5
4 8 5
5 11.5 5
6 11.5 5
# … with 99 more rows
...

Converting to symbols and evaluate

map(columnpairs, ~ airq %>% 
select(all_of(.x)) %>%
group_by(!!! rlang::syms(.x)) %>%
filter(n() > 1))
$V1
Source: local data table [105 x 2]
Groups: Wind, Month
Call:
_DT20 <- `_DT1`[, .(Wind, Month)]
`_DT20`[`_DT20`[, .I[.N > 1], by = .(Wind, Month)]$V1]

Wind Month
<dbl> <int>
1 7.4 5
2 7.4 5
3 8 5
4 8 5
5 11.5 5
6 11.5 5
# … with 99 more rows

# Use as.data.table()/as.data.frame()/as_tibble() to access results

$V2
...

dplyr - groupby on multiple columns using variable names

dplyr version >1.0

With more recent versions of dplyr, you should use across along with a tidyselect helper function. See help("language", "tidyselect") for a list of all the helper functions. In this case if you want all columns in a character vector, use all_of()

cols <- c("mpg","hp","wt")
mtcars %>%
group_by(across(all_of(cols))) %>%
summarize(x=mean(gear))

original answer (older versions of dplyr)

If you have a vector of variable names, you should pass them to the .dots= parameter of group_by_. For example:

mtcars %>% 
group_by_(.dots=c("mpg","hp","wt")) %>%
summarize(x=mean(gear))

correlation of a vector across all column in R (dplyr)

Since cor() requires same dimension for both x and y, you cannot group rows together, otherwise, they will not have 4 elements to match with 4 values in y.

Prepare data and library

library(dplyr)

gdf <-
tibble(g = c(1, 1, 2, 3), v1 = 10:13, v2 = 20:23)

y <- rnorm(4)
[1] 0.59390132 0.91897737 0.78213630 0.07456498

mutate()

If you want to keep v1 and v2 in the output, use the .names argument to indicate the names of the new columns. {.col} refers to the column name that across is acting on.

gdf %>% mutate(across(v1:v2, ~ cor(.x,y), .names = "{.col}_cor"))

# A tibble: 4 x 5
g v1 v2 v1_cor v2_cor
<dbl> <int> <int> <dbl> <dbl>
1 1 10 20 -0.591 -0.591
2 1 11 21 -0.591 -0.591
3 2 12 22 -0.591 -0.591
4 3 13 23 -0.591 -0.591

summarise()

If you only want the cor() output in the results, you can use summarise

gdf %>% summarize(across(v1:v2, ~ cor(.x,y)))

# A tibble: 1 x 2
v1 v2
<dbl> <dbl>
1 -0.591 -0.591

Dynamic variables names in dplyr function across multiple columns

We could use .names in across to rename

mean_fun_multicols <- function(data, group_cols, summary_cols) {
data %>%
group_by(across({{group_cols}})) %>%
summarise(across({{ summary_cols }},
~ mean(., na.rm = TRUE), .names = "mean_{.col}"), .groups = "drop")
}

-testing

mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt))
# A tibble: 8 × 4
cyl gear mean_mpg mean_wt
<dbl> <dbl> <dbl> <dbl>
1 4 3 21.5 2.46
2 4 4 26.9 2.38
3 4 5 28.2 1.83
4 6 3 19.8 3.34
5 6 4 19.8 3.09
6 6 5 19.7 2.77
7 8 3 15.0 4.10
8 8 5 15.4 3.37

NOTE: The := is mainly used when there is a single column in tidyverse


If we use the OP's function, we are assigning multiple columns to a single column and this returns a tibble instead of a normal column. We may need to unpack

library(tidyr)
> mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt)) %>% str
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
grouped_df [8 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
$ cyl : num [1:8] 4 4 4 6 6 6 8 8
$ gear : num [1:8] 3 4 5 3 4 5 3 5
$ mean_c(mpg, wt): tibble [8 × 2] (S3: tbl_df/tbl/data.frame)
..$ mpg: num [1:8] 21.5 26.9 28.2 19.8 19.8 ...
..$ wt : num [1:8] 2.46 2.38 1.83 3.34 3.09 ...
- attr(*, "groups")= tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
..$ cyl : num [1:3] 4 6 8
..$ .rows: list<int> [1:3]
.. ..$ : int [1:3] 1 2 3
.. ..$ : int [1:3] 4 5 6
.. ..$ : int [1:2] 7 8
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE

> mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt)) %>%
unpack(where(is_tibble))
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 8 × 4
# Groups: cyl [3]
cyl gear mpg wt
<dbl> <dbl> <dbl> <dbl>
1 4 3 21.5 2.46
2 4 4 26.9 2.38
3 4 5 28.2 1.83
4 6 3 19.8 3.34
5 6 4 19.8 3.09
6 6 5 19.7 2.77
7 8 3 15.0 4.10
8 8 5 15.4 3.37

dplyr group by colnames described as vector of strings

You can use group_by_at, where you can pass a character vector of column names as group variables:

mtcars %>% 
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...

Or you can move the column selection inside group_by_at using vars and column select helper functions:

mtcars %>% 
filter(disp < 160) %>%
group_by_at(vars(matches('[a-z]{3,}$'))) %>%
summarise(n = n())

# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...

How can I filter multiple columns with dplyr using string matching for the column name?

You could do that using filter_at with ends_with.

library(dplyr)
nyc_crashes %>%
# Select columns that end with KILLED or INJURED
filter_at(vars(c(ends_with("KILLED"),ends_with("INJURED"))),
# Keep rows where any of these variables is >= 1
any_vars(. >= 1))


Related Topics



Leave a reply



Submit