Variable Results with Dplyr Summarise, Depending on Output Variable Naming

Variable results with dplyr summarise, depending on output variable naming

The transformations you specify in summarize are performed in the order they appear, that means if you change variable values, then those new values appear for the subsequent columns (this is different from the base function tranform()). When you do

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

The glucose=mean(glucose, na.rm=TRUE) part has changed the value of the glucose variable such that when you calculate the glucose.sd=sd(glucose, na.rm=TRUE) part, the sd() does not see the original glucose values, it see the new value that is the mean of the original values. If you re-order the columns, it will work.

df %>% group_by(time) %>%
  summarise(glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)), 
        glucose=mean(glucose, na.rm=TRUE))

If you are wondering why this is the default behavior, this is because it is often nice to create a column and then use that column value later in the transformations. For example, with mutate()

df %>% group_by(time) %>%
  mutate(glucose_sq = glucose^2,
        glucose_sq_plus2 = glucose_sq+2)

Using dplyr to summarize and keep the same variable name

A possible solution is to skip the mutate steps and use transmute for the first mutate/select-step and directly calculate the desired variables from the original variables without creating an intermediate variable for the second mutate-step:

df %>% 
  transmute(Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total) %>% 
  bind_rows(df %>%
              summarize(Group = "All",
                        Count_Dist = sum(Count/sum(Count)),
                        Weighted_Avg_Total = sum((Count/sum(Count))*Total)))

which gives:

  Group Count_Dist Weighted_Avg_Total
1     A 0.09345794            50.0000
2     B 0.14018692           300.0000
3     C 0.11214953           600.0000
4     D 0.18691589           400.0000
5     E 0.46728972          1000.0000
6   All 1.00000000           656.0748

Another possible solution is to alter the order in which the new variables are calculated in dplyr and then use select to get the column-order back into what you originally wanted:

df %>% 
  mutate(Count_Dist = Count/sum(Count)) %>%
  select(Group, Count_Dist, Weighted_Avg_Total = Total) %>% 
  bind_rows(df %>%
              mutate(Count_Dist = Count/sum(Count)) %>%
              summarize(Group = "All",
                        Weighted_Avg_Total = sum(Count_Dist*Total),
                        Count_Dist = sum(Count_Dist)) %>% 
              select(Group, Count_Dist, Weighted_Avg_Total))

If you want to include the Count-column as well, you could do (based on my comment from below):

df %>% 
  transmute(Group = Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total, Count) %>% 
  bind_rows(df %>%
              summarize(Group = "All",
                        Count_Dist = sum(Count/sum(Count)),
                        Weighted_Avg_Total = sum((Count/sum(Count))*Total),
                        Count = sum(Count)))

standard evaluation in dplyr: summarise a variable given as a character string

dplyr 1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr programming vignette here:

https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html

The new way to refer to columns when their identifier is stored as a character vector is to use the .data pronoun from rlang, and then subset as you would in base R.

library(dplyr)

key <- "v3"
val <- "v2"
drp <- "v1"

df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))

df %>% 
    select(-matches(drp)) %>% 
    group_by(.data[[key]]) %>% 
    summarise(total = sum(.data[[val]], na.rm = TRUE))

#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#>   v3    total
#>   <chr> <int>
#> 1 A        21
#> 2 B        19

If your code is in a package function, you can @importFrom rlang .data to avoid R check notes about undefined globals.

Variables to summarise data in dplyr and R statistics: Refer to column names stored as strings with the `.data` pronoun:

using the outline in the comments by Luis. Translating into my function:

summariseData <- function(df, column_to_summerise, target, kpi_target)
{
  column_to_summerise <- enquo(column_to_summerise)

  calc_df <- df %>% 
    group_by(Date_Received) %>% 
    dplyr:: summarise(med=median(!!column_to_summerise, na.rm = TRUE), 
                      per95=quantile(!!column_to_summerise, probs = kpi_target, na.rm = TRUE), 
                      In_Target = sum(!!column_to_summerise <= target, na.rm = TRUE), 
                      Out_Target = sum(!!column_to_summerise > target, na.rm = TRUE), 
                      Total_Data = n()) 
  return(calc_df)
}

Key point is to use the enquo() function prior to using !!

From (https://dplyr.tidyverse.org/articles/programming.html) By
analogy to strings, we don’t want "", instead we want some function
that turns an argument into a string. That’s the job of enquo().
enquo() uses some dark magic to look at the argument, see what the
user typed, and return that value as a quosure

In dplyr (and in tidyeval in general) you use !! to say that you want
to unquote an input so that it’s evaluated

Dplyr Summarise Groups as Column Names

There's lots of ways to go about it, but I would simplify it by pivoting to a longer data frame initially, and then grouping by var and group. Then you can just pivot wider to get the final result you want. Note that I used summarize(across()) which replaces the deprecated summarize_all(), even though with a single column could've just manually specified Mean = ... and Sum = ....

set.seed(123)

test_df %>%
  pivot_longer(
    var1:var2,
    names_to = "var"
  ) %>%
  group_by(Group, var) %>%
  summarize(
    across(
      everything(),
      list(Mean = mean, Sum = sum),
      .names = "{.fn}"
    ),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = "Group",
    values_from = c(Mean, Sum),
    names_glue = "{Group}_{.value}"
  )
#> # A tibble: 2 × 7
#>   var   A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
#>   <chr>  <dbl>  <dbl>  <dbl> <int> <int> <int>
#> 1 var1       1    2.5    3.2     1    10    16
#> 2 var2       5    4.5    4.4     5    18    22

Dynamic naming for the output of pipe with group by, mutate, select

We can use assign here

assign( paste(source_name, "constant", sep="_"), df %>% select(a,b,c) )

dplyr summarize: create variables from named vector

You can also try this with do():

toy_df %>%
  group_by(group) %>%
  do(res = toy_fn(.$value))

R:dplyr summarise data by group with nth() call with variable n calculated during aggregation

Let's add variable z and n in summarise part. Those variables are defined as below.

df %>% 
  mutate(rn = row_number()) %>%
  group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
  summarise(x=mean(x, na.rm = TRUE), 
            d=sum(d, na.rm=T), ,i.start=first(rn), 
            i.end=last(rn), 
            z = round(first(rn)+(last(rn)-first(rn))/2-1),
            n = n())

     grp      x     d i.start i.end     z     n
   <dbl>  <dbl> <dbl>   <int> <int> <dbl> <int>
 1     1 0.0746  89.0       1     4     2     4
 2     2 0.0759  97.6       5     8     6     4
 3     3 0.0535 105.        9    12    10     4
 4     4 0.0650 106.       13    16    14     4
 5     5 0.0860  98.2      17    20    18     4
 6     6 0.0626  84.4      21    23    21     3
 7     7 0.0479 112.       24    27    24     4
 8     8 0.0394  83.5      28    30    28     3
 9     9 0.0706 110.       31    34    32     4
10    10 0.0575 112.       35    38    36     4
11    11 0.0647  83.0      39    41    39     3
12    12 0.0659 108.       42    45    42     4
13    13 0.0854 111.       46    49    46     4
14    14 0.0204  27.9      50    50    49     1

In dataframe above, n indicates sample size of each groups separated by grp. However, as you state group_by(grp), when you call nth(y, z), YOU WILL CALL Z-TH VALUE BY GROUP.

It means that for 5th group, although there exists only 4 values, you call 18th value of y. So it prints NA.

To get this easy, the most simple way I think is use n().

df %>% 
  mutate(rn = row_number()) %>%
  group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
  summarise(x=mean(x, na.rm = TRUE), 
            d=sum(d, na.rm=T), ,i.start=first(rn), 
            i.end=last(rn), 
            y=nth(y, round(n()/2)))

     grp      x     d i.start i.end     y
   <dbl>  <dbl> <dbl>   <int> <int> <dbl>
 1     1 0.0746  89.0       1     4  19.8
 2     2 0.0759  97.6       5     8  19.8
 3     3 0.0535 105.        9    12  19.8
 4     4 0.0650 106.       13    16  19.8
 5     5 0.0860  98.2      17    20  19.8
 6     6 0.0626  84.4      21    23  19.8
 7     7 0.0479 112.       24    27  19.8
 8     8 0.0394  83.5      28    30  19.8
 9     9 0.0706 110.       31    34  19.8
10    10 0.0575 112.       35    38  19.8
11    11 0.0647  83.0      39    41  19.8
12    12 0.0659 108.       42    45  19.8
13    13 0.0854 111.       46    49  19.8
14    14 0.0204  27.9      50    50  NA

You'll call floor(n/2)th y, which means y that locates middle of each group. Note that you can also try floor(n/2)+1.

R: How to summarize and group by variables as column names

We can use across from the new version of dplyr

library(dplyr)
df %>%
    group_by(across(colums_to_group)) %>% 
    summarise(across(all_of(columns_to_sum), sum, na.rm = TRUE), .groups = 'drop')
# A tibble: 2 x 3
#  A         B     C
#  <chr> <int> <int>
#1 X         6    21
#2 Y         9    19

In the previous version, we could use group_by_at along with summarise_at

df %>%
    group_by_at(colums_to_group) %>%
    summarise_at(vars(columns_to_sum), sum, na.rm = TRUE)

Is there a way to input dplyr::summarise variables?

Use := notation to assign column names

library(dplyr)

get_summarised<- function(df, col){
  df %>% summarise({{col}} := mean({{ col }}))
}

get_summarised(mtcars, mpg)
#       mpg
#1 20.09062

Variable Results with Dplyr Summarise, Depending on Output Variable Naming