Dplyr Group by Colnames Described as Vector of Strings

dplyr group by colnames described as vector of strings

You can use group_by_at, where you can pass a character vector of column names as group variables:

mtcars %>% 
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...

Or you can move the column selection inside group_by_at using vars and column select helper functions:

mtcars %>% 
filter(disp < 160) %>%
group_by_at(vars(matches('[a-z]{3,}$'))) %>%
summarise(n = n())

# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...

dplyr group_by vector of column names?

We can use group_by with across from dplyr version >= 1.0.0

library(dplyr)
mtcars %>%
group_by(across(all_of(c('mpg', 'cyl')))) %>%
tally() %>%
head(2)
# A tibble: 2 x 3
# Groups: mpg [2]
# mpg cyl n
# <dbl> <dbl> <int>
#1 10.4 8 2
#2 13.3 8 1

With older versions, use the group_by_at

mtcars %>%
group_by_at(c('mpg', 'cyl')) %>%
tally() %>%
head(2)
# A tibble: 2 x 3
# Groups: mpg [2]
# mpg cyl n
# <dbl> <dbl> <int>
#1 10.4 8 2
#2 13.3 8 1

Group by multiple columns in dplyr, using string vector input

Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:

data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))

#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27

The output from your example question is as expected (see comparison to plyr above and output below):

# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998

Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.

Pass column names as strings to group_by and summarize

For this you can now use _at versions of the verbs

df %>%  
group_by_at(cols2group) %>%
summarize_at(.vars = col2summarize, .funs = min)

Edit (2021-06-09):

Please see Ronak Shah's answer, using

mutate(across(all_of(cols2summarize), min))

Now the preferred option

How to use vector of column names as input into dplyr::group_by()?

You need to use the unquote-splice operator !!!:

aggregate <- function(df, by) {
df %>% group_by(!!!syms(by)) %>% summarize(a = mean(a))
}

group_key <- c("g1", "g2")

aggregate(df, by = group_key)
## A tibble: 4 x 3
## Groups: g1 [2]
# g1 g2 a
# <dbl> <dbl> <dbl>
#1 1 1 1
#2 1 2 4
#3 2 1 2.5
#4 2 2 5

Use vector of columns in custom dplyr function

You don't necessarily need the function, as you can just mutate across the columns and get sums for each category.

library(tidyverse)

dat %>%
group_by(category) %>%
mutate(across(ends_with("take"), .fns = list(count = ~sum(. == "yes"))))

Or if you have a long list, then you can use vars directly in the across statement:

vars <- c("intake", "outtake", "pretake")

dat %>%
group_by(category) %>%
mutate(across(vars, .fns = list(count = ~sum(. == "yes"))))

Output

  category intake outtake pretake intake_count outtake_count pretake_count
<chr> <fct> <fct> <fct> <int> <int> <int>
1 a no yes no 0 2 0
2 b no yes yes 0 1 2
3 c no yes no 1 1 0
4 d no yes yes 1 1 2
5 e no yes no 1 1 0
6 f no yes yes 1 1 2
7 g no yes no 1 1 0
8 h no yes yes 1 1 2
9 i no yes no 1 1 0
10 j no yes yes 1 1 2
11 a no yes no 0 2 0
12 b no no yes 0 1 2
13 c yes no no 1 1 0
14 d yes no yes 1 1 2
15 e yes no no 1 1 0
16 f yes no yes 1 1 2
17 g yes no no 1 1 0
18 h yes no yes 1 1 2
19 i yes no no 1 1 0
20 j yes no yes 1 1 2

group_by by a vector of characters using tidy evaluation semantics

There is group_by_at variant of group_by:

library(dplyr)
group_by <- c('cyl', 'vs')
mtcars %>% group_by_at(group_by) %>% summarise(gear = mean(gear))

Above it's simplified version of generalized:

mtcars %>% group_by_at(vars(one_of(group_by))) %>% summarise(gear = mean(gear))

inside vars you could use any dplyr way of select variables:

mtcars %>%
group_by_at(vars(
one_of(group_by) # columns from predefined set
,starts_with("a") # add ones started with a
,-hp # but omit that one
,vs # this should be always include
,contains("_gr_") # and ones with string _gr_
)) %>%
summarise(gear = mean(gear))

How to group according to position in a vector using dplyr

Here is an idea,

library(dplyr)

mywords %>%
group_by(grp = rep(seq(n()/10), each = 10)) %>%
count(TheTerms)

which gives,

A tibble: 4,500 x 3
# Groups: grp [1,000]
grp TheTerms n
<int> <fctr> <int>
1 1 DD 3
2 1 HG 4
3 1 POS 3
4 2 DD 1
5 2 HG 1
6 2 KKL 3
7 2 NNTD 4
8 2 POS 1
9 3 HG 1
10 3 KKL 3
# ... with 4,490 more rows

How to check if a vector contained in list column of a data frame with dplyr

Do you want to check for any value in wanted_status or all of them? The expected output suggests all.

library(dplyr)

wanted_status <- c("x+", "y-")

dat %>%
group_by(cell) %>%
summarise(contained = if(all(wanted_status %in% status)) 'in' else 'out')

# cell contained
# <chr> <chr>
#1 A in
#2 B out


Related Topics



Leave a reply



Submit