How to Add Rows with 0 Counts to Summarised Output

Proper idiom for adding zero count rows in tidyr/dplyr

Since dplyr 0.8 you can do it by setting the parameter .drop = FALSE in group_by:

X.tidy <- X.raw %>% group_by(x, y, .drop = FALSE) %>% summarise(count=sum(z))
X.tidy
# # A tibble: 4 x 3
# # Groups: x [2]
# x y count
# <fct> <fct> <int>
# 1 A i 1
# 2 A ii 5
# 3 B i 15
# 4 B ii 0

This will keep groups made of all the levels of factor columns so if you have character columns you might want to convert them (thanks to Pake for the note).

Inserting rows into data frame when values missing in category

Option 1

Thanks to @Frank for the better solution, using tidyr:

library(tidyr)
complete(df, day, product, fill = list(sales = 0))

Using this approach, you no longer need to worry about selecting product names, etc.

Which gives you:

  day product      sales
1 a 1 0.52042809
2 b 1 0.00000000
3 c 1 0.46373882
4 a 2 0.11155348
5 b 2 0.04937618
6 c 2 0.26433153
7 a 3 0.69100939
8 b 3 0.90596172
9 c 3 0.00000000


Option 2

You can do this using the tidyr package (and dplyr)

df %>% 
spread(product, sales, fill = 0) %>%
gather(`1`:`3`, key = "product", value = "sales")

Which gives the same result

This works by using spread to create a wide data frame, with each product as its own column. The argument fill = 0 will cause all empty cells to be filled with a 0 (the default is NA).

Next, gather works to convert the 'wide' data frame back into the original 'long' data frame. The first argument is the columns of the products (in this case '1':'3'). We then set the key and value to the original column names.

I would suggestion option 1, but option 2 might still prove to have some use in certain circumstances.


Both options should work for all days you have at least one sale recorded. If there are missing days, I suggest you look into the package padr and then using the above tidyr to do the rest.

Summarize in dplyr and insert 0 for categories with no values

Using tidyr::complete

library(dplyr)
library(tidyr)
set.seed(2021)

age <- floor(runif(35, min = 20, max = 25))

dat <- data.frame(age)

incomplete_data <- dat %>%
mutate(education = sample(c("Low", "Mid-level", "High"),
size = nrow(dat), prob = c(0.55, 0.2, 0.25), replace = TRUE)) %>%
group_by(age, education) %>%
summarise(n = n(), .groups = "drop")

incomelpete data

# A tibble: 11 x 3
age education n
* <dbl> <chr> <int>
1 20 High 1
2 20 Low 2
3 21 Low 3
4 21 Mid-level 2
5 22 High 2
6 22 Low 4
7 23 Low 4
8 23 Mid-level 2
9 24 High 1
10 24 Low 10
11 24 Mid-level 4

Using complete function

complete_data <- incomplete_data %>% 
complete(age, education, fill = list(n = 0))

Output

# A tibble: 15 x 3
age education n
<dbl> <chr> <dbl>
1 20 High 1
2 20 Low 2
3 20 Mid-level 0
4 21 High 0
5 21 Low 3
6 21 Mid-level 2
7 22 High 2
8 22 Low 4
9 22 Mid-level 0
10 23 High 0
11 23 Low 4
12 23 Mid-level 2
13 24 High 1
14 24 Low 10
15 24 Mid-level 4

dplyr summarise: Equivalent of .drop=FALSE to keep groups with zero length in output

Since dplyr 0.8 group_by gained the .drop argument that does just what you asked for:

df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
df$b = factor(df$b, levels=1:3)

df %>%
group_by(b, .drop=FALSE) %>%
summarise(count_a=length(a))

#> # A tibble: 3 x 2
#> b count_a
#> <fct> <int>
#> 1 1 6
#> 2 2 6
#> 3 3 0

One additional note to go with @Moody_Mudskipper's answer: Using .drop=FALSE can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:

library(dplyr)
data(iris)

# Add an additional level to Species
iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level"))

# Species is a factor and empty groups are included in the output
iris %>% group_by(Species, .drop=FALSE) %>% tally

#> Species n
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
#> 4 empty_level 0

# Add character column
iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25))

# Empty groups involving combinations of Species and group2 are not included in output
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally

#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 versicolor A 25
#> 4 versicolor B 25
#> 5 virginica B 25
#> 6 virginica C 25
#> 7 empty_level <NA> 0

# Turn group2 into a factor
iris$group2 = factor(iris$group2)

# Now all possible combinations of Species and group2 are included in the output,
# whether present in the data or not
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally

#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 setosa C 0
#> 4 versicolor A 25
#> 5 versicolor B 25
#> 6 versicolor C 0
#> 7 virginica A 0
#> 8 virginica B 25
#> 9 virginica C 25
#> 10 empty_level A 0
#> 11 empty_level B 0
#> 12 empty_level C 0

Created on 2019-03-13 by the reprex package (v0.2.1)

How to count rows with NA values across a selection of columns and include 0 count?

You can do:

library(tidyverse)
df %>%
mutate(missing = apply(across(num_range('Var', 2:4)), 1, function(x) any(is.na(x)))) %>%
group_by(ID) %>%
summarize(n = sum(missing))

# A tibble: 3 x 2
ID n
<chr> <int>
1 AL01 2
2 AL02 1
3 AL03 0

R group by show count of all factor levels even when zero dplyr

We can convert 'ID' to factor with levels specified and just use table

table(factor(dat$ID, levels = letters))

Or using the same with tidyverse

library(tidyverse)
dat %>%
mutate(ID=factor(ID, levels = letters)) %>%
complete(ID) %>%
group_by(ID) %>%
summarise(no_rows = n())

Make the value as 0, if rows not available in Kusto

Instead of summarize you need to use make-series which will fill the gaps with a default value for you.

exceptions
| where name == 'my_scheduler' and timestamp > ago(30d)
| extend day = split(tostring(timestamp + 19800s), 'T')[0]
| make-series count() on tolong(x) step 1

You might want to add from and to to make-series in order for it to also fill gaps at the beginning and the end of the 30d period.

Add rows to grouped data with dplyr?

Without dplyr it can be done like this:

as.data.frame(xtabs(Demand ~ Week + Article, data))

giving:

       Week Article Freq
1 2013-W01 10004 1215
2 2013-W02 10004 900
3 2013-W03 10004 774
4 2013-W04 10004 1170
5 2013-W01 10006 0
6 2013-W02 10006 0
7 2013-W03 10006 0
8 2013-W04 10006 5
9 2013-W01 10007 2
10 2013-W02 10007 0
11 2013-W03 10007 0
12 2013-W04 10007 0

and this can be rewritten as a magrittr or dplyr pipeline like this:

data %>% xtabs(formula = Demand ~ Week + Article) %>% as.data.frame()

The as.data.frame() at the end could be omitted if a wide form solution were desired.

How to use R dplyr's summarize to count the number of rows that match a criteria?

You can use sum on logical vectors - it will automatically convert them into numeric values (TRUE being equal to 1 and FALSE being equal to 0), so you need only do:

test %>%
group_by(location) %>%
summarize(total_score = sum(score),
n_outliers = sum(more_than_300))
#> # A tibble: 2 x 3
#> location total_score n_outliers
#> <chr> <dbl> <int>
#> 1 away 927 2
#> 2 home 552 0

Or, if these are your only 3 columns, an equivalent would be:

test %>%
group_by(location) %>%
summarize(across(everything(), sum))

In fact, you don't need to make the more_than_300 column - it would suffice to do:

test %>%
group_by(location) %>%
summarize(total_score = sum(score),
n_outliers = sum(score > 300))


Related Topics



Leave a reply



Submit