Apply a summarise condition to a range of columns when using dplyr group_by?
The upcoming version 1.0.0 of dplyr
will have across()
function that does what you wish for
Basic usage
across()
has two primary arguments:
- The first argument,
.cols
, selects the columns you want to operate on.
It uses tidy selection (likeselect()
) so you can pick variables by
position, name, and type.
- The second argument,
.fns
, is a function or list of functions to apply to
each column. This can also be a purrr style formula (or list of formulas)
like~ .x / 2
. (This argument is optional, and you can omit it if you just want
to get the underlying data; you'll see that technique used invignette("rowwise")
.)
### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)
Control how the names are created with the .names
argument which takes a glue spec:
iris %>%
group_by(Species) %>%
summarise(
across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
)
#> # A tibble: 3 x 5
#> Species mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 1.46 0.246 5.8
#> 2 versicolor 2.77 4.26 1.33 7
#> 3 virginica 2.97 5.55 2.03 7.9
Using multiple functions
my_func <- list(
mean = ~ mean(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE)
)
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#> Species mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 5.8 3.43 4.4
#> 2 versicolor 5.94 7 2.77 3.4
#> 3 virginica 6.59 7.9 2.97 3.8
#> mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1.46 1.9 0.246 0.6
#> 2 4.26 5.1 1.33 1.8
#> 3 5.55 6.9 2.03 2.5
Created on 2020-03-06 by the reprex package (v0.3.0)
Applying group_by and summarise on data while keeping all the columns' info
Here are two options using a) filter
and b) slice
from dplyr. In this case there are no duplicated minimum values in column c
for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group.
a)
> data %>% group_by(b) %>% filter(c == min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
# a b c d
#1 1 a 1.2 small
#2 4 b 1.7 larg
#3 6 c 3.1 med
#4 10 d 2.2 med
Or similarly
> data %>% group_by(b) %>% filter(min_rank(c) == 1L)
#Source: local data frame [4 x 4]
#Groups: b
#
# a b c d
#1 1 a 1.2 small
#2 4 b 1.7 larg
#3 6 c 3.1 med
#4 10 d 2.2 med
b)
> data %>% group_by(b) %>% slice(which.min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
# a b c d
#1 1 a 1.2 small
#2 4 b 1.7 larg
#3 6 c 3.1 med
#4 10 d 2.2 med
Dplyr: using summarise across to take mean of columns only if row value 0
What you want is:
library(tidyverse)
df %>%
group_by(Clusters) %>%
summarize(across(everything(), ~mean(.[. > 0])))
~mean(. > 0)
checks if an element is greater 0 or not and thus returns TRUE/FALSE and then gives you the mean of the underlying 0/1's. Instead you want to filter each column which can be achieved with the usual []
approach
Summarize all group values and a conditional subset in the same call
Writing up @hadley's comment as an answer
df_sqlite %>%
group_by(ID) %>%
mutate(Bfoo = if(A=="foo") B else 0) %>%
summarize(sumB = sum(B),
sumBfoo = sum(Bfoo)) %>%
collect
How to summarize across multiple columns with condition on another (grouped) column with dplyr?
Use another across
to get corresponding values in column a:c
where j
is minimum.
library(dplyr)
myDF %>%
group_by(i) %>%
summarize(across(where(is.numeric), median, .names="med_{col}"),
across(a:c, ~.[which.min(j)],.names = 'best_{col}'))
# i med_j med_a med_b med_c best_a best_b best_c
#* <int> <dbl> <int> <int> <int> <int> <int> <int>
#1 1 0.217 4 7 4 7 7 4
#2 2 0.689 6 6 6 8 6 8
#3 3 -0.213 5 2 7 9 1 7
To do it in the same across
statement :
myDF %>%
group_by(i) %>%
summarize(across(where(is.numeric), list(med = median,
best = ~.[which.min(j)]),
.names="{fn}_{col}"))
How to use group_by with mean and sum in dplyr?
If I understood correctly, this might help you
#Libraries
library(tidyverse)
library(lubridate)
#Data
df <-
tibble::tribble(
~Year, ~School.Name, ~Student.Score1, ~Student.Score2,
2019L, "ISD 1", 1L, NA,
2020L, "ISD 4", 4L, 2L,
2020L, "ISD 3", NA, 3L,
2018L, "ISD 1", 4L, NA,
2019L, "ISD 4", 2L, 5L,
2020L, "ISD 4", 3L, 2L,
2019L, "ISD 3", NA, 1L,
2018L, "ISD 1", 2L, 4L
)
#How to
df %>%
group_by(Year,School.Name) %>%
summarise(
n = n(),
across(.cols = contains(".Score"),.fns = function(x)mean(x,na.rm = TRUE))
)
# A tibble: 6 x 5
# Groups: Year [3]
Year School.Name n Student.Score1 Student.Score2
<int> <chr> <int> <dbl> <dbl>
1 2018 ISD 1 2 3 4
2 2019 ISD 1 1 1 NaN
3 2019 ISD 3 1 NaN 1
4 2019 ISD 4 1 2 5
5 2020 ISD 3 1 NaN 3
6 2020 ISD 4 2 3.5 2
Using dplyr summarise with conditions
We could keep the all(Status)
as second argument in summarise
(or change the column name) and also, it can be done with if/else
as the logic seems to return a single TRUE/FALSE based on whether all
of the 'Status' is TRUE or not
df %>%
group_by(ID) %>%
summarise( Test = if(all(Status)) first(Price[Status]) else
first(Price[!Status]), Status = all(Status))
# A tibble: 3 x 3
# ID Test Status
# <dbl> <dbl> <lgl>
#1 1 5 FALSE
#2 2 0 TRUE
#3 3 7 FALSE
NOTE: It is better not to use ifelse
with unequal lengths for its arguments
Applying group_by and summarise(sum) but keep a large number of additional columns
We can create a column with mutate
and then apply distinct
library(dplyr)
df %>%
group_by(location) %>%
mutate(count = sum(count)) %>% select(-date) %>%
distinct(location, important_1, important_30, .keep_all = TRUE)
If there are multiple column names, we can also use syms
to convert to symbol
and evaluate (!!!
)
df %>%
group_by(location) %>%
mutate(count = sum(count)) %>% select(-date) %>%
distinct(location, !!! rlang::syms(names(.)[startsWith(names(.), 'important')]), .keep_all = TRUE)
Related Topics
Extract Time (Hms) from Lubridate Date Time Object
How to Save a Data Frame in a Txt or Excel File Separated by Columns
Split a File Path into Folder Names Vector
Visualizing Two or More Data Points Where They Overlap (Ggplot R)
Dplyr: Grouping and Summarizing/Mutating Data with Rolling Time Windows
How to Replace the String Exactly Using Gsub()
Usemethod("Predict"):No Applicable Method for 'Predict' Applied to an Object of Class "Train"
Stacking an Existing Rasterstack Multiple Times
R Dplyr Filter Based on Matching Search Term with First Words of Any Work in Select Columns
Paste Several Column Values into One Value in R
Assign Point Color Depending on Data.Frame Column Value R
Ggplot2: Dashed Line in Legend
Concatenate Values Across Columns in Data.Table, Row by Row