Variable results with dplyr summarise, depending on output variable naming
The transformations you specify in summarize
are performed in the order they appear, that means if you change variable values, then those new values appear for the subsequent columns (this is different from the base function tranform()
). When you do
df %>% group_by(time) %>%
summarise(glucose=mean(glucose, na.rm=TRUE),
glucose.sd=sd(glucose, na.rm=TRUE),
n=sum(!is.na(glucose)))
The glucose=mean(glucose, na.rm=TRUE)
part has changed the value of the glucose
variable such that when you calculate the glucose.sd=sd(glucose, na.rm=TRUE)
part, the sd()
does not see the original glucose values, it see the new value that is the mean of the original values. If you re-order the columns, it will work.
df %>% group_by(time) %>%
summarise(glucose.sd=sd(glucose, na.rm=TRUE),
n=sum(!is.na(glucose)),
glucose=mean(glucose, na.rm=TRUE))
If you are wondering why this is the default behavior, this is because it is often nice to create a column and then use that column value later in the transformations. For example, with mutate()
df %>% group_by(time) %>%
mutate(glucose_sq = glucose^2,
glucose_sq_plus2 = glucose_sq+2)
Using dplyr to summarize and keep the same variable name
A possible solution is to skip the mutate
steps and use transmute
for the first mutate
/select
-step and directly calculate the desired variables from the original variables without creating an intermediate variable for the second mutate
-step:
df %>%
transmute(Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total) %>%
bind_rows(df %>%
summarize(Group = "All",
Count_Dist = sum(Count/sum(Count)),
Weighted_Avg_Total = sum((Count/sum(Count))*Total)))
which gives:
Group Count_Dist Weighted_Avg_Total
1 A 0.09345794 50.0000
2 B 0.14018692 300.0000
3 C 0.11214953 600.0000
4 D 0.18691589 400.0000
5 E 0.46728972 1000.0000
6 All 1.00000000 656.0748
Another possible solution is to alter the order in which the new variables are calculated in dplyr
and then use select
to get the column-order back into what you originally wanted:
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Weighted_Avg_Total = Total) %>%
bind_rows(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Group = "All",
Weighted_Avg_Total = sum(Count_Dist*Total),
Count_Dist = sum(Count_Dist)) %>%
select(Group, Count_Dist, Weighted_Avg_Total))
If you want to include the Count
-column as well, you could do (based on my comment from below):
df %>%
transmute(Group = Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total, Count) %>%
bind_rows(df %>%
summarize(Group = "All",
Count_Dist = sum(Count/sum(Count)),
Weighted_Avg_Total = sum((Count/sum(Count))*Total),
Count = sum(Count)))
standard evaluation in dplyr: summarise a variable given as a character string
dplyr
1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr
programming vignette here:
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
The new way to refer to columns when their identifier is stored as a character vector is to use the .data
pronoun from rlang
, and then subset as you would in base R.
library(dplyr)
key <- "v3"
val <- "v2"
drp <- "v1"
df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
df %>%
select(-matches(drp)) %>%
group_by(.data[[key]]) %>%
summarise(total = sum(.data[[val]], na.rm = TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> v3 total
#> <chr> <int>
#> 1 A 21
#> 2 B 19
If your code is in a package function, you can @importFrom rlang .data
to avoid R check notes about undefined globals.
Variables to summarise data in dplyr and R statistics: Refer to column names stored as strings with the `.data` pronoun:
using the outline in the comments by Luis. Translating into my function:
summariseData <- function(df, column_to_summerise, target, kpi_target)
{
column_to_summerise <- enquo(column_to_summerise)
calc_df <- df %>%
group_by(Date_Received) %>%
dplyr:: summarise(med=median(!!column_to_summerise, na.rm = TRUE),
per95=quantile(!!column_to_summerise, probs = kpi_target, na.rm = TRUE),
In_Target = sum(!!column_to_summerise <= target, na.rm = TRUE),
Out_Target = sum(!!column_to_summerise > target, na.rm = TRUE),
Total_Data = n())
return(calc_df)
}
Key point is to use the enquo() function prior to using !!
From (https://dplyr.tidyverse.org/articles/programming.html) By
analogy to strings, we don’t want "", instead we want some function
that turns an argument into a string. That’s the job of enquo().
enquo() uses some dark magic to look at the argument, see what the
user typed, and return that value as a quosureIn dplyr (and in tidyeval in general) you use !! to say that you want
to unquote an input so that it’s evaluated
Dplyr Summarise Groups as Column Names
There's lots of ways to go about it, but I would simplify it by pivoting to a longer data frame initially, and then grouping by var
and group
. Then you can just pivot wider to get the final result you want. Note that I used summarize(across())
which replaces the deprecated summarize_all()
, even though with a single column could've just manually specified Mean = ...
and Sum = ...
.
set.seed(123)
test_df %>%
pivot_longer(
var1:var2,
names_to = "var"
) %>%
group_by(Group, var) %>%
summarize(
across(
everything(),
list(Mean = mean, Sum = sum),
.names = "{.fn}"
),
.groups = "drop"
) %>%
pivot_wider(
names_from = "Group",
values_from = c(Mean, Sum),
names_glue = "{Group}_{.value}"
)
#> # A tibble: 2 × 7
#> var A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
#> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
#> 1 var1 1 2.5 3.2 1 10 16
#> 2 var2 5 4.5 4.4 5 18 22
Dynamic naming for the output of pipe with group by, mutate, select
We can use assign
here
assign( paste(source_name, "constant", sep="_"), df %>% select(a,b,c) )
dplyr summarize: create variables from named vector
You can also try this with do()
:
toy_df %>%
group_by(group) %>%
do(res = toy_fn(.$value))
R:dplyr summarise data by group with nth() call with variable n calculated during aggregation
Let's add variable z
and n
in summarise
part. Those variables are defined as below.
df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE),
d=sum(d, na.rm=T), ,i.start=first(rn),
i.end=last(rn),
z = round(first(rn)+(last(rn)-first(rn))/2-1),
n = n())
grp x d i.start i.end z n
<dbl> <dbl> <dbl> <int> <int> <dbl> <int>
1 1 0.0746 89.0 1 4 2 4
2 2 0.0759 97.6 5 8 6 4
3 3 0.0535 105. 9 12 10 4
4 4 0.0650 106. 13 16 14 4
5 5 0.0860 98.2 17 20 18 4
6 6 0.0626 84.4 21 23 21 3
7 7 0.0479 112. 24 27 24 4
8 8 0.0394 83.5 28 30 28 3
9 9 0.0706 110. 31 34 32 4
10 10 0.0575 112. 35 38 36 4
11 11 0.0647 83.0 39 41 39 3
12 12 0.0659 108. 42 45 42 4
13 13 0.0854 111. 46 49 46 4
14 14 0.0204 27.9 50 50 49 1
In dataframe above, n
indicates sample size of each groups separated by grp
. However, as you state group_by(grp)
, when you call nth(y, z)
, YOU WILL CALL Z-TH VALUE BY GROUP.
It means that for 5th group, although there exists only 4 values, you call 18th value of y
. So it prints NA
.
To get this easy, the most simple way I think is use n()
.
df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE),
d=sum(d, na.rm=T), ,i.start=first(rn),
i.end=last(rn),
y=nth(y, round(n()/2)))
grp x d i.start i.end y
<dbl> <dbl> <dbl> <int> <int> <dbl>
1 1 0.0746 89.0 1 4 19.8
2 2 0.0759 97.6 5 8 19.8
3 3 0.0535 105. 9 12 19.8
4 4 0.0650 106. 13 16 19.8
5 5 0.0860 98.2 17 20 19.8
6 6 0.0626 84.4 21 23 19.8
7 7 0.0479 112. 24 27 19.8
8 8 0.0394 83.5 28 30 19.8
9 9 0.0706 110. 31 34 19.8
10 10 0.0575 112. 35 38 19.8
11 11 0.0647 83.0 39 41 19.8
12 12 0.0659 108. 42 45 19.8
13 13 0.0854 111. 46 49 19.8
14 14 0.0204 27.9 50 50 NA
You'll call floor(n/2)
th y
, which means y
that locates middle of each group. Note that you can also try floor(n/2)+1
.
R: How to summarize and group by variables as column names
We can use across
from the new version of dplyr
library(dplyr)
df %>%
group_by(across(colums_to_group)) %>%
summarise(across(all_of(columns_to_sum), sum, na.rm = TRUE), .groups = 'drop')
# A tibble: 2 x 3
# A B C
# <chr> <int> <int>
#1 X 6 21
#2 Y 9 19
In the previous version, we could use group_by_at
along with summarise_at
df %>%
group_by_at(colums_to_group) %>%
summarise_at(vars(columns_to_sum), sum, na.rm = TRUE)
Is there a way to input dplyr::summarise variables?
Use :=
notation to assign column names
library(dplyr)
get_summarised<- function(df, col){
df %>% summarise({{col}} := mean({{ col }}))
}
get_summarised(mtcars, mpg)
# mpg
#1 20.09062
Related Topics
Changing Line Color in Ggplot Based on Slope
In R Data.Frame, Promote Rownames to Actual Column
Adding an Image to a Datatable in R
Shiny Error in Match.Arg(Position):'Arg' Must Be Null or a Character Vector
R Read Abbreviated Month Form a Date That Is Not in English
How to Use User Input to Obtain a Data.Frame from My Environment in Shiny
Scale Value Inside of Aes_String()
How to Use Geom_Rect with Discrete Axis Values
Adding a New Column to Matrix Error
R - Calculate Test Mse Given a Trained Model from a Training Set and a Test Set
Labelling Points with Ggplot2 and Directlabels
How to Select Dropdown Box Using Rselenium
Creating "Word" Cloud of Phrases, Not Individual Words in R