How to interpret dplyr message `summarise()` regrouping output by 'x' (override with `.groups` argument)?
It is just a friendly warning message. By default, if there is any grouping before the summarise
, it drops one group variable i.e. the last one specified in the group_by
. If there is only one grouping variable, there won't be any grouping attribute after the summarise
and if there are more than one i.e. here it is two, so, the attribute for grouping is reduce to 1 i.e. the data would have the 'year' as grouping attribute. As a reproducible example
library(dplyr)
mtcars %>%
group_by(am) %>%
summarise(mpg = sum(mpg))
#`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
# am mpg
#* <dbl> <dbl>
#1 0 326.
#2 1 317.
The message is that it is ungroup
ing i.e when there is a single group_by
, it drops that grouping after the summarise
mtcars %>%
group_by(am, vs) %>%
summarise(mpg = sum(mpg))
#`summarise()` regrouping output by 'am' (override with `.groups` argument)
# A tibble: 4 x 3
# Groups: am [2]
# am vs mpg
# <dbl> <dbl> <dbl>
#1 0 0 181.
#2 0 1 145.
#3 1 0 118.
#4 1 1 199.
Here, it drops the last grouping and regroup with the 'am'
If we check the ?summarise
, there is .groups
argument which by default is "drop_last"
and the other options are "drop"
, "keep"
, "rowwise"
.groups - Grouping structure of the result.
"drop_last": dropping the last level of grouping. This was the only supported option before version 1.0.0.
"drop": All levels of grouping are dropped.
"keep": Same grouping structure as .data.
"rowwise": Each row is it's own group.
When .groups is not specified, you either get "drop_last" when all the results are size 1, or "keep" if the size varies. In addition, a message informs you of that choice, unless the option "dplyr.summarise.inform" is set to FALSE.
i.e. if we change the .groups
in summarise
, we don't get the message because the group attributes are removed
mtcars %>%
group_by(am) %>%
summarise(mpg = sum(mpg), .groups = 'drop')
# A tibble: 2 x 2
# am mpg
#* <dbl> <dbl>
#1 0 326.
#2 1 317.
mtcars %>%
group_by(am, vs) %>%
summarise(mpg = sum(mpg), .groups = 'drop')
# A tibble: 4 x 3
# am vs mpg
#* <dbl> <dbl> <dbl>
#1 0 0 181.
#2 0 1 145.
#3 1 0 118.
#4 1 1 199.
mtcars %>%
group_by(am, vs) %>%
summarise(mpg = sum(mpg), .groups = 'drop') %>%
str
#tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
# $ am : num [1:4] 0 0 1 1
# $ vs : num [1:4] 0 1 0 1
# $ mpg: num [1:4] 181 145 118 199
Previously, this warning was not issued and it could lead to situations where the OP does a mutate
or something else assuming there is no grouping and results in unexpected output. Now, the warning gives the user an indication that we should be careful that there is a grouping attribute
NOTE: The .groups
right now is experimental
in its lifecycle. So, the behaviour could be modified in the future releases
Depending upon whether we need any transformation of the data based on the same grouping variable (or not needed), we could select the different options in .groups
.
How can I override using `.groups` argument to add the Date Variable to my Summary function?
We don't need the data$ inside the tidyverse
. The daily_activity$
extracts the full column from the data instead of the values for each group 'Id'. It is not clear which function should be applied on the ActivityDate
. If we need to return the min
or first
, apply that.
library(dplyr)
daily_activity %>%
group_by(Id)%>%
summarise(Date_exersized= min(ActivityDate), .groups = 'drop')
Instead, if we want to create a new column, use mutate
instead of summarise
. The warning message is a friendly one. We can specify the .groups
with one of the values i.e. 'drop'
removes the group attribute. By default, it removes the last group
Stop warnings with summarise
You can set global option to not display this message by using:
options(dplyr.summarise.inform = FALSE)
Eliminate the ungroup... message from tidyverse package
We can specify the .groups
argument in summarise
with different options if we want to avoid getting the message. Also, to extract as a vector
, in the tidyverse, there is pull
to pull the column
library(dplyr)
hsb %>%
dplyr::select(sch.id) %>%
group_by(sch.id) %>%
summarise(n=n(), .groups = 'drop') %>%
pull(n)
Or another option is to bypass the group_by/summarise
altogether and use count
hsb %>%
count(sch.id) %>%
pull(n)
Or with tally
hsb %>%
group_by(sch.id) %>%
tally()
Why does summarize( ) behave differently in different machines?
Try to check your R and packages versions with running:
sessionInfo()
You probably have a different versions of software accross machines. Particularly in your case it seems to be a newer version of dplyr
package, try that on your different machines:
packageVersion("dplyr")
#> [1] ‘1.0.0’
The message about regrouping output is only for your information to be clear what happens with your data frame during summarizing. It is not a warning nor error.
For more information about grouping see also:
- How to interpret dplyr message
summarise()
regrouping output by 'x' (override with.groups
argument)? dplyr::summarise()
Difference between .groups argument and ungroup() in dplyr?
This is a special behavior/capability of summarize
. When you group data by multiple variables, summarize
defaults to keeping the first grouping in the output data frame.
library(wec)
library(dplyr)
data(PUMS)
PUMS %>%
group_by(race, education.cat) %>%
summarise(hi = mean(wage))
# # A tibble: 8 × 3
# # Groups: race [4]
# race education.cat hi
# <fct> <fct> <dbl>
# 1 Hispanic High school 35149.
# 2 Hispanic Degree 52344.
# 3 Black High school 30552.
# 4 Black Degree 48243.
# 5 Asian High school 35350
# 6 Asian Degree 78213.
# 7 White High school 38532.
# 8 White Degree 69135.
Notice that the above data frame still has 4 groups. If you use the .groups = "drop"
argument in summarize
, the output numbers are identical but the data frame has no groups.
PUMS %>%
group_by(race, education.cat) %>%
summarise(hi = mean(wage), .groups = "drop")
# # A tibble: 8 × 3
# race education.cat hi
# <fct> <fct> <dbl>
# 1 Hispanic High school 35149.
# 2 Hispanic Degree 52344.
# 3 Black High school 30552.
# 4 Black Degree 48243.
# 5 Asian High school 35350
# 6 Asian Degree 78213.
# 7 White High school 38532.
# 8 White Degree 69135.
The mutate
function in the first of your examples does not have a built in .groups
functionality, so you have to take an extra line to ungroup()
if you wish to do so afterwards.
How to inject weight into list of dplyr summarise name-value pairs?
Here is one option to interpolate the 'weights' into expression passed in ...
by converting the multiple expressions into a single string and parse it to evaluate
weighted_summarise <- function(data, weights, ...) {
weights <- rlang::as_string(rlang::ensym(weights))
v1 <- purrr::map_chr(rlang::enexprs(...),
~ stringr::str_replace(rlang::as_label(.x), "\\(",
function(x) stringr::str_c("(", weights, "*")))
eval(rlang::parse_expr(stringr::str_c("data %>%
summarise(", stringr::str_c(names(v1), v1, sep = "=",
collapse = ", "), ")")))
}
-testing
> data %>%
weighted_summarise(weights, a = sum(b), c = mean(d))
# A tibble: 1 × 2
a c
<dbl> <dbl>
1 -2.95 1.13
# testing with the original summarise code outside the function
> data %>%
dplyr::summarise(a = sum(weights * b), c = mean(weights * d))
# A tibble: 1 × 2
a c
<dbl> <dbl>
1 -2.95 1.13
data
data <- structure(list(b = c(-0.545880758366027, 0.536585304107612, 0.419623148618683,
-0.583627199210279, 0.847460017311944, 0.266021979364892, 0.444585270360416,
-0.466495123565759, -0.848370043948898, 0.00231194241576697),
d = c(-1.31690812429962, 0.598269112694685, -0.7622143703459,
-1.42909030324076, 0.332244449013422, -0.469060687608488,
-0.334986793584065, 1.53625215550584, 0.609994533253692,
0.51633569843567), weights = 1:10), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L))
Related Topics
Consistent Width For Geom_Bar in the Event of Missing Data
Select Equivalent Rows [A-B & B-A]
Workflow For Statistical Analysis and Report Writing
Construct a Manual Legend For a Complicated Plot
Basic Lag in R Vector/Dataframe
Replace Missing Values With Column Mean
Plotting Lines and the Group Aesthetic in Ggplot2
How to Change the Order of Facet Labels in Ggplot (Custom Facet Wrap Labels)
Select Groups Which Have At Least One of a Certain Value
Create Sequence of Repeated Values, in Sequence
Conditional Merge/Replacement in R
Error in ≪My Code≫: Target of Assignment Expands to Non-Language Object
Create a Variable Name With "Paste" in R
Assign Multiple New Variables on Lhs in a Single Line
Painless Way to Install a New Version of R
How to Order the Fill-Colours Within Ggplot2 Geom_Bar
Subset Data to Contain Only Columns Whose Names Match a Condition