Dplyr Summarise: Equivalent of ".Drop=False" to Keep Groups With Zero Length in Output

dplyr summarise: Equivalent of .drop=FALSE to keep groups with zero length in output

Since dplyr 0.8 group_by gained the .drop argument that does just what you asked for:

df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
df$b = factor(df$b, levels=1:3)

df %>%
group_by(b, .drop=FALSE) %>%
summarise(count_a=length(a))

#> # A tibble: 3 x 2
#> b count_a
#> <fct> <int>
#> 1 1 6
#> 2 2 6
#> 3 3 0

One additional note to go with @Moody_Mudskipper's answer: Using .drop=FALSE can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:

library(dplyr)
data(iris)

# Add an additional level to Species
iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level"))

# Species is a factor and empty groups are included in the output
iris %>% group_by(Species, .drop=FALSE) %>% tally

#> Species n
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
#> 4 empty_level 0

# Add character column
iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25))

# Empty groups involving combinations of Species and group2 are not included in output
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally

#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 versicolor A 25
#> 4 versicolor B 25
#> 5 virginica B 25
#> 6 virginica C 25
#> 7 empty_level <NA> 0

# Turn group2 into a factor
iris$group2 = factor(iris$group2)

# Now all possible combinations of Species and group2 are included in the output,
# whether present in the data or not
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally

#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 setosa C 0
#> 4 versicolor A 25
#> 5 versicolor B 25
#> 6 versicolor C 0
#> 7 virginica A 0
#> 8 virginica B 25
#> 9 virginica C 25
#> 10 empty_level A 0
#> 11 empty_level B 0
#> 12 empty_level C 0

Created on 2019-03-13 by the reprex package (v0.2.1)

Why does Dplyr group_by not respect .drop=FALSE

The .drop is correct, but when you use length, the data.frame doesn't exist at all, so length is it, will be weird. Try this below:

set.seed(100)
IdxData = data.frame(MktDate=sample(1:3,10,replace=TRUE),
IndexName=sample(LETTERS[1:3],10,replace=TRUE))

IdxData %>% count(MktDate,IndexName,.drop=FALSE)
# A tibble: 9 x 3
MktDate IndexName n
<int> <fct> <int>
1 1 A 0
2 1 B 0
3 1 C 1
4 2 A 1
5 2 B 1
6 2 C 4
7 3 A 0
8 3 B 2
9 3 C 1

Or if you need the name "CountSecurity" (thanks to @arg0naut91 ) :

IdxData %>% 
count(MktDate,IndexName,.drop=FALSE,name="CountSecurity")

dplyr::count omits unrepresented levels

You can use tidyr::complete to complete the missing factor levels; this also gives you the option to specify how to fill (default is NA).

library(dplyr)
library(tidyr)
df %>% count(condition) %>% complete(condition, fill = list(n = 0))
## A tibble: 7 x 2
# condition n
# <fct> <dbl>
#1 A 3.
#2 B 2.
#3 C 1.
#4 D 0.
#5 E 0.
#6 F 0.
#7 G 0.

dplyr-summarise, keep the original group name

You can use which.min to get index of minimum value of Sepal.Length, this index can be used to subset corresponding subgroup value.

library(dplyr)

iris %>%
mutate(subgroup=rep(c('A','B'),75)) %>%
group_by(Species) %>%
summarise(SLmin=min(Sepal.Length),
subgroup = subgroup[which.min(Sepal.Length)])

# Species SLmin subgroup
# <fct> <dbl> <chr>
#1 setosa 4.3 B
#2 versicolor 4.9 B
#3 virginica 4.9 A

Also an alternative is to select the minimum row for each Species and then select only those columns that we need in the final output.

iris %>% 
mutate(subgroup=rep(c('A','B'),75)) %>%
group_by(Species) %>%
slice(which.min(Sepal.Length))

summarise() giving empty output

If it is row wise mean, either use rowMeans or with rowwise. If both plyr and dplyr are loaded, the masking can be resolved by :: directing the function to the specific package

library(dplyr)
data %>%
rowwise %>%
dplyr::summarise(id, ave.score = mean(c(Q1, Q2, Q3), na.rm = TRUE),
.groups = 'drop')

Another option is also with c_across

data %>%
rowwise %>%
dplyr::summarise(id, ave.score = mean(c_across(starts_with('Q')),
na.rm = TRUE), .groups = 'drop')

Summarise after using pivot_wider

You haven't shared enough data but you can try :

library(dplyr)
library(tidyr)

bookings %>%
group_by(property_id, for_business) %>%
summarize(avg_review_score = mean(review_score, na.rm = TRUE)) %>%
ungroup %>%
mutate(for_business = c("tourist", "business")[for_business + 1]) %>%
pivot_wider(names_from = for_business, values_from = avg_review_score) %>%
mutate(diff = business - tourist) %>%
summarize(avg_diff = mean(diff, na.rm = TRUE))

How to interpret dplyr message `summarise()` regrouping output by 'x' (override with `.groups` argument)?

It is just a friendly warning message. By default, if there is any grouping before the summarise, it drops one group variable i.e. the last one specified in the group_by. If there is only one grouping variable, there won't be any grouping attribute after the summarise and if there are more than one i.e. here it is two, so, the attribute for grouping is reduce to 1 i.e. the data would have the 'year' as grouping attribute. As a reproducible example

library(dplyr)
mtcars %>%
group_by(am) %>%
summarise(mpg = sum(mpg))
#`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
# am mpg
#* <dbl> <dbl>
#1 0 326.
#2 1 317.

The message is that it is ungrouping i.e when there is a single group_by, it drops that grouping after the summarise

mtcars %>% 
group_by(am, vs) %>%
summarise(mpg = sum(mpg))
#`summarise()` regrouping output by 'am' (override with `.groups` argument)
# A tibble: 4 x 3
# Groups: am [2]
# am vs mpg
# <dbl> <dbl> <dbl>
#1 0 0 181.
#2 0 1 145.
#3 1 0 118.
#4 1 1 199.

Here, it drops the last grouping and regroup with the 'am'

If we check the ?summarise, there is .groups argument which by default is "drop_last" and the other options are "drop", "keep", "rowwise"

.groups - Grouping structure of the result.

"drop_last": dropping the last level of grouping. This was the only supported option before version 1.0.0.

"drop": All levels of grouping are dropped.

"keep": Same grouping structure as .data.

"rowwise": Each row is it's own group.

When .groups is not specified, you either get "drop_last" when all the results are size 1, or "keep" if the size varies. In addition, a message informs you of that choice, unless the option "dplyr.summarise.inform" is set to FALSE.

i.e. if we change the .groups in summarise, we don't get the message because the group attributes are removed

mtcars %>% 
group_by(am) %>%
summarise(mpg = sum(mpg), .groups = 'drop')
# A tibble: 2 x 2
# am mpg
#* <dbl> <dbl>
#1 0 326.
#2 1 317.


mtcars %>%
group_by(am, vs) %>%
summarise(mpg = sum(mpg), .groups = 'drop')
# A tibble: 4 x 3
# am vs mpg
#* <dbl> <dbl> <dbl>
#1 0 0 181.
#2 0 1 145.
#3 1 0 118.
#4 1 1 199.


mtcars %>%
group_by(am, vs) %>%
summarise(mpg = sum(mpg), .groups = 'drop') %>%
str
#tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
# $ am : num [1:4] 0 0 1 1
# $ vs : num [1:4] 0 1 0 1
# $ mpg: num [1:4] 181 145 118 199

Previously, this warning was not issued and it could lead to situations where the OP does a mutate or something else assuming there is no grouping and results in unexpected output. Now, the warning gives the user an indication that we should be careful that there is a grouping attribute

NOTE: The .groups right now is experimental in its lifecycle. So, the behaviour could be modified in the future releases

Depending upon whether we need any transformation of the data based on the same grouping variable (or not needed), we could select the different options in .groups.

Summarise but keep length variable (dplyr)

To get the proportion of respondents who chose an option when that variable is binary, you can take the mean. To do this with your test data, you can use sapply:

sapply(test, mean)
CompanyA CompanyB CompanyC
0.5 1.0 0.8

If you wanted to do this in a more complicated fashion (say your data is not binary encoded, but is stored as 1 and 2 instead), you could do that with the following:

test %>% 
gather(key='Company') %>%
group_by(Company) %>%
summarise(proportion = sum(value == 1) / n())

# A tibble: 3 x 2
Company proportion
<chr> <dbl>
1 CompanyA 0.5
2 CompanyB 1
3 CompanyC 0.8

Including missing values in summarise output

If there are only max one observation per 'seq_num' for each 'id', then it is possible to coerce to NA where there are no cases with [1]

library(dplyr)
dat %>%
group_by(id) %>%
summarise(seq_0_time = time[seq_num ==0][1],
seq_1_time = time[seq_num == 1][1], .groups = 'drop')

-output

# A tibble: 3 × 3
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
3 3 9 NA

It is just that the length of 0 can be modified to length 1 by assigning NA Or similarly this can be used to replicate NA to fill for 2, 3, etc, by specifying the index that didn't occur

> with(dat, time[seq_num==1 & id == 3])
numeric(0)
> with(dat, time[seq_num==1 & id == 3][1])
[1] NA
> numeric(0)
numeric(0)
> numeric(0)[1]
[1] NA
> numeric(0)[1:2]
[1] NA NA

Or using length<-

> `length<-`(numeric(0), 3)
[1] NA NA NA


Related Topics



Leave a reply



Submit