Apply a Summarise Condition to a Range of Columns When Using Dplyr Group_By

Apply a summarise condition to a range of columns when using dplyr group_by?

The upcoming version 1.0.0 of dplyr will have across() function that does what you wish for

Basic usage

across() has two primary arguments:

  • The first argument, .cols, selects the columns you want to operate on.
    It uses tidy selection (like select()) so you can pick variables by
    position, name, and type.
  • The second argument, .fns, is a function or list of functions to apply to
    each column. This can also be a purrr style formula (or list of formulas)
    like ~ .x / 2. (This argument is optional, and you can omit it if you just want
    to get the underlying data; you'll see that technique used in
    vignette("rowwise").)
### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)

Control how the names are created with the .names argument which takes a glue spec:

iris %>% 
group_by(Species) %>%
summarise(
across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
)
#> # A tibble: 3 x 5
#> Species mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 1.46 0.246 5.8
#> 2 versicolor 2.77 4.26 1.33 7
#> 3 virginica 2.97 5.55 2.03 7.9

Using multiple functions

my_func <- list(
mean = ~ mean(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE)
)

iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#> Species mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 5.8 3.43 4.4
#> 2 versicolor 5.94 7 2.77 3.4
#> 3 virginica 6.59 7.9 2.97 3.8
#> mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1.46 1.9 0.246 0.6
#> 2 4.26 5.1 1.33 1.8
#> 3 5.55 6.9 2.03 2.5

Created on 2020-03-06 by the reprex package (v0.3.0)

Applying group_by and summarise on data while keeping all the columns' info

Here are two options using a) filter and b) slice from dplyr. In this case there are no duplicated minimum values in column c for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group.

a)

> data %>% group_by(b) %>% filter(c == min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
# a b c d
#1 1 a 1.2 small
#2 4 b 1.7 larg
#3 6 c 3.1 med
#4 10 d 2.2 med

Or similarly

> data %>% group_by(b) %>% filter(min_rank(c) == 1L)
#Source: local data frame [4 x 4]
#Groups: b
#
# a b c d
#1 1 a 1.2 small
#2 4 b 1.7 larg
#3 6 c 3.1 med
#4 10 d 2.2 med

b)

> data %>% group_by(b) %>% slice(which.min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
# a b c d
#1 1 a 1.2 small
#2 4 b 1.7 larg
#3 6 c 3.1 med
#4 10 d 2.2 med

Dplyr: using summarise across to take mean of columns only if row value 0

What you want is:

library(tidyverse)
df %>%
group_by(Clusters) %>%
summarize(across(everything(), ~mean(.[. > 0])))

~mean(. > 0) checks if an element is greater 0 or not and thus returns TRUE/FALSE and then gives you the mean of the underlying 0/1's. Instead you want to filter each column which can be achieved with the usual [] approach

Summarize all group values and a conditional subset in the same call

Writing up @hadley's comment as an answer

df_sqlite %>%
group_by(ID) %>%
mutate(Bfoo = if(A=="foo") B else 0) %>%
summarize(sumB = sum(B),
sumBfoo = sum(Bfoo)) %>%
collect

How to summarize across multiple columns with condition on another (grouped) column with dplyr?

Use another across to get corresponding values in column a:c where j is minimum.

library(dplyr)

myDF %>%
group_by(i) %>%
summarize(across(where(is.numeric), median, .names="med_{col}"),
across(a:c, ~.[which.min(j)],.names = 'best_{col}'))

# i med_j med_a med_b med_c best_a best_b best_c
#* <int> <dbl> <int> <int> <int> <int> <int> <int>
#1 1 0.217 4 7 4 7 7 4
#2 2 0.689 6 6 6 8 6 8
#3 3 -0.213 5 2 7 9 1 7

To do it in the same across statement :

myDF %>% 
group_by(i) %>%
summarize(across(where(is.numeric), list(med = median,
best = ~.[which.min(j)]),
.names="{fn}_{col}"))

How to use group_by with mean and sum in dplyr?

If I understood correctly, this might help you

#Libraries

library(tidyverse)
library(lubridate)

#Data

df <-
tibble::tribble(
~Year, ~School.Name, ~Student.Score1, ~Student.Score2,
2019L, "ISD 1", 1L, NA,
2020L, "ISD 4", 4L, 2L,
2020L, "ISD 3", NA, 3L,
2018L, "ISD 1", 4L, NA,
2019L, "ISD 4", 2L, 5L,
2020L, "ISD 4", 3L, 2L,
2019L, "ISD 3", NA, 1L,
2018L, "ISD 1", 2L, 4L
)

#How to

df %>%
group_by(Year,School.Name) %>%
summarise(
n = n(),
across(.cols = contains(".Score"),.fns = function(x)mean(x,na.rm = TRUE))
)

# A tibble: 6 x 5
# Groups: Year [3]
Year School.Name n Student.Score1 Student.Score2
<int> <chr> <int> <dbl> <dbl>
1 2018 ISD 1 2 3 4
2 2019 ISD 1 1 1 NaN
3 2019 ISD 3 1 NaN 1
4 2019 ISD 4 1 2 5
5 2020 ISD 3 1 NaN 3
6 2020 ISD 4 2 3.5 2

Using dplyr summarise with conditions

We could keep the all(Status) as second argument in summarise (or change the column name) and also, it can be done with if/else as the logic seems to return a single TRUE/FALSE based on whether all of the 'Status' is TRUE or not

df %>%
group_by(ID) %>%
summarise( Test = if(all(Status)) first(Price[Status]) else
first(Price[!Status]), Status = all(Status))
# A tibble: 3 x 3
# ID Test Status
# <dbl> <dbl> <lgl>
#1 1 5 FALSE
#2 2 0 TRUE
#3 3 7 FALSE

NOTE: It is better not to use ifelse with unequal lengths for its arguments

Applying group_by and summarise(sum) but keep a large number of additional columns

We can create a column with mutate and then apply distinct

library(dplyr)
df %>%
group_by(location) %>%
mutate(count = sum(count)) %>% select(-date) %>%
distinct(location, important_1, important_30, .keep_all = TRUE)

If there are multiple column names, we can also use syms to convert to symbol and evaluate (!!!)

df %>% 
group_by(location) %>%
mutate(count = sum(count)) %>% select(-date) %>%
distinct(location, !!! rlang::syms(names(.)[startsWith(names(.), 'important')]), .keep_all = TRUE)


Related Topics



Leave a reply



Submit