Using Variable Column Names in Dplyr Summarise

How to use summarise from dplyr with dynamic column names?

1) Use !!sym(...) like this:

colOfInterest <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarize(avg = mean(!!sym(colOfInterest))) %>%
ungroup

giving:

# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59

2) A second approach is:

colOfInterest <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarize(avg = mean(.data[[colOfInterest]])) %>%
ungroup

Of course this is straight forward in base R:

aggregate(list(avg = iris[[colOfInterest]]), iris["Species"], mean)

using variable column names in dplyr summarise

You can use base::get:

df %>% summarise(mean(get(x[1])) - mean(get(x[2])))

# # A tibble: 2 x 2
# c `mean(a) - mean(b)`
# <dbl> <dbl>
# 1 1 -1
# 2 2 -1

get will search in current environment by default.

As the error message says, mean expects a logical or numeric object, as.name returns a name:

class(as.name("a")) # [1] "name"

You could evaluate your name, that would work as well :

df %>% summarise(mean(eval(as.name(x[1]))) - mean(eval(as.name(x[2]))))
# # A tibble: 2 x 2
# c `mean(eval(as.name(x[1]))) - mean(eval(as.name(x[2])))`
# <dbl> <dbl>
# 1 1 -1
# 2 2 -1

Dplyr Summarise Groups as Column Names

There's lots of ways to go about it, but I would simplify it by pivoting to a longer data frame initially, and then grouping by var and group. Then you can just pivot wider to get the final result you want. Note that I used summarize(across()) which replaces the deprecated summarize_all(), even though with a single column could've just manually specified Mean = ... and Sum = ....

set.seed(123)

test_df %>%
pivot_longer(
var1:var2,
names_to = "var"
) %>%
group_by(Group, var) %>%
summarize(
across(
everything(),
list(Mean = mean, Sum = sum),
.names = "{.fn}"
),
.groups = "drop"
) %>%
pivot_wider(
names_from = "Group",
values_from = c(Mean, Sum),
names_glue = "{Group}_{.value}"
)
#> # A tibble: 2 × 7
#> var A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
#> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
#> 1 var1 1 2.5 3.2 1 10 16
#> 2 var2 5 4.5 4.4 5 18 22

Variables to summarise data in dplyr and R statistics: Refer to column names stored as strings with the `.data` pronoun:

using the outline in the comments by Luis. Translating into my function:

summariseData <- function(df, column_to_summerise, target, kpi_target)
{
column_to_summerise <- enquo(column_to_summerise)

calc_df <- df %>%
group_by(Date_Received) %>%
dplyr:: summarise(med=median(!!column_to_summerise, na.rm = TRUE),
per95=quantile(!!column_to_summerise, probs = kpi_target, na.rm = TRUE),
In_Target = sum(!!column_to_summerise <= target, na.rm = TRUE),
Out_Target = sum(!!column_to_summerise > target, na.rm = TRUE),
Total_Data = n())
return(calc_df)
}

Key point is to use the enquo() function prior to using !!

From (https://dplyr.tidyverse.org/articles/programming.html) By
analogy to strings, we don’t want "", instead we want some function
that turns an argument into a string. That’s the job of enquo().
enquo() uses some dark magic to look at the argument, see what the
user typed, and return that value as a quosure

In dplyr (and in tidyeval in general) you use !! to say that you want
to unquote an input so that it’s evaluated

R: How to summarize and group by variables as column names

We can use across from the new version of dplyr

library(dplyr)
df %>%
group_by(across(colums_to_group)) %>%
summarise(across(all_of(columns_to_sum), sum, na.rm = TRUE), .groups = 'drop')
# A tibble: 2 x 3
# A B C
# <chr> <int> <int>
#1 X 6 21
#2 Y 9 19

In the previous version, we could use group_by_at along with summarise_at

df %>%
group_by_at(colums_to_group) %>%
summarise_at(vars(columns_to_sum), sum, na.rm = TRUE)

R dplyr how to select variables by column number rather than column name with summarise

Making use of the .data pronoun from rlang you could write a custom function which takes a dataframe, the names of two variables and some additional grouping variables and computes your desired summary table like so:

library(dplyr)
library(Hmisc)

summary_table <- function(.data, x, y, ...) {
.data %>%
group_by(...) %>% # Group species
summarise(n = n(), # number of records
WtMn = wtd.mean(.data[[x]], .data[[y]]), # weighted mean
WtSd = sqrt(wtd.var(.data[[x]], .data[[y]])), # weighted SD
WtCV = WtMn/WtSd, # weighted CV
Minm = min(.data[[x]]), # minumum
Wp05 = wtd.quantile(.data[[x]], .data[[y]] , 0.05), # p05
Wp50 = wtd.quantile(.data[[x]], .data[[y]] , 0.50), # p50
Wp95 = wtd.quantile(.data[[x]], .data[[y]] , 0.95), # p95
Wp975 = wtd.quantile(.data[[x]], .data[[y]] , 0.975), # p975
Wp99 = wtd.quantile(.data[[x]], .data[[y]] , 0.99), # p99
Maxm = max(.data[[x]]) # maximum
)
}

summary_table(iris, "Sepal.Length", "Petal.Width", Species)
#> # A tibble: 3 x 12
#> Species n WtMn WtSd WtCV Minm Wp05 Wp50 Wp95 Wp975 Wp99 Maxm
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 50 5.05 0.356 14.2 4.3 4.61 5.06 5.62 5.70 5.72 5.8
#> 2 versicolor 50 5.98 0.508 11.8 4.9 5.13 6 6.80 6.97 7 7
#> 3 virginica 50 6.61 0.626 10.6 4.9 5.8 6.5 7.7 7.7 7.9 7.9

summary_table(iris, "Sepal.Width", "Petal.Width", Species)
#> # A tibble: 3 x 12
#> Species n WtMn WtSd WtCV Minm Wp05 Wp50 Wp95 Wp975 Wp99 Maxm
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 50 3.47 0.399 8.69 2.3 3.06 3.46 4.27 4.4 4.4 4.4
#> 2 versicolor 50 2.80 0.310 9.04 2 2.3 2.86 3.20 3.37 3.4 3.4
#> 3 virginica 50 3.00 0.320 9.38 2.2 2.5 3 3.6 3.8 3.8 3.8

Dynamic variables names in dplyr function across multiple columns

We could use .names in across to rename

mean_fun_multicols <- function(data, group_cols, summary_cols) {
data %>%
group_by(across({{group_cols}})) %>%
summarise(across({{ summary_cols }},
~ mean(., na.rm = TRUE), .names = "mean_{.col}"), .groups = "drop")
}

-testing

mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt))
# A tibble: 8 × 4
cyl gear mean_mpg mean_wt
<dbl> <dbl> <dbl> <dbl>
1 4 3 21.5 2.46
2 4 4 26.9 2.38
3 4 5 28.2 1.83
4 6 3 19.8 3.34
5 6 4 19.8 3.09
6 6 5 19.7 2.77
7 8 3 15.0 4.10
8 8 5 15.4 3.37

NOTE: The := is mainly used when there is a single column in tidyverse


If we use the OP's function, we are assigning multiple columns to a single column and this returns a tibble instead of a normal column. We may need to unpack

library(tidyr)
> mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt)) %>% str
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
grouped_df [8 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
$ cyl : num [1:8] 4 4 4 6 6 6 8 8
$ gear : num [1:8] 3 4 5 3 4 5 3 5
$ mean_c(mpg, wt): tibble [8 × 2] (S3: tbl_df/tbl/data.frame)
..$ mpg: num [1:8] 21.5 26.9 28.2 19.8 19.8 ...
..$ wt : num [1:8] 2.46 2.38 1.83 3.34 3.09 ...
- attr(*, "groups")= tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
..$ cyl : num [1:3] 4 6 8
..$ .rows: list<int> [1:3]
.. ..$ : int [1:3] 1 2 3
.. ..$ : int [1:3] 4 5 6
.. ..$ : int [1:2] 7 8
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE

> mean_fun_multicols(mtcars, c(cyl, gear), c(mpg, wt)) %>%
unpack(where(is_tibble))
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 8 × 4
# Groups: cyl [3]
cyl gear mpg wt
<dbl> <dbl> <dbl> <dbl>
1 4 3 21.5 2.46
2 4 4 26.9 2.38
3 4 5 28.2 1.83
4 6 3 19.8 3.34
5 6 4 19.8 3.09
6 6 5 19.7 2.77
7 8 3 15.0 4.10
8 8 5 15.4 3.37

standard evaluation in dplyr: summarise a variable given as a character string

dplyr 1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr programming vignette here:

https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html

The new way to refer to columns when their identifier is stored as a character vector is to use the .data pronoun from rlang, and then subset as you would in base R.

library(dplyr)

key <- "v3"
val <- "v2"
drp <- "v1"

df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))

df %>%
select(-matches(drp)) %>%
group_by(.data[[key]]) %>%
summarise(total = sum(.data[[val]], na.rm = TRUE))

#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> v3 total
#> <chr> <int>
#> 1 A 21
#> 2 B 19

If your code is in a package function, you can @importFrom rlang .data to avoid R check notes about undefined globals.



Related Topics



Leave a reply



Submit