Proper idiom for adding zero count rows in tidyr/dplyr
Since dplyr 0.8
you can do it by setting the parameter .drop = FALSE
in group_by
:
X.tidy <- X.raw %>% group_by(x, y, .drop = FALSE) %>% summarise(count=sum(z))
X.tidy
# # A tibble: 4 x 3
# # Groups: x [2]
# x y count
# <fct> <fct> <int>
# 1 A i 1
# 2 A ii 5
# 3 B i 15
# 4 B ii 0
This will keep groups made of all the levels of factor columns so if you have character columns you might want to convert them (thanks to Pake for the note).
Inserting rows into data frame when values missing in category
Option 1
Thanks to @Frank for the better solution, using tidyr
:
library(tidyr)
complete(df, day, product, fill = list(sales = 0))
Using this approach, you no longer need to worry about selecting product names, etc.
Which gives you:
day product sales
1 a 1 0.52042809
2 b 1 0.00000000
3 c 1 0.46373882
4 a 2 0.11155348
5 b 2 0.04937618
6 c 2 0.26433153
7 a 3 0.69100939
8 b 3 0.90596172
9 c 3 0.00000000
Option 2
You can do this using the tidyr
package (and dplyr
)
df %>%
spread(product, sales, fill = 0) %>%
gather(`1`:`3`, key = "product", value = "sales")
Which gives the same result
This works by using spread
to create a wide data frame, with each product as its own column. The argument fill = 0
will cause all empty cells to be filled with a 0
(the default is NA
).
Next, gather
works to convert the 'wide' data frame back into the original 'long' data frame. The first argument is the columns of the products (in this case '1':'3'
). We then set the key
and value
to the original column names.
I would suggestion option 1, but option 2 might still prove to have some use in certain circumstances.
Both options should work for all days you have at least one sale recorded. If there are missing days, I suggest you look into the package padr
and then using the above tidyr
to do the rest.
Summarize in dplyr and insert 0 for categories with no values
Using tidyr::complete
library(dplyr)
library(tidyr)
set.seed(2021)
age <- floor(runif(35, min = 20, max = 25))
dat <- data.frame(age)
incomplete_data <- dat %>%
mutate(education = sample(c("Low", "Mid-level", "High"),
size = nrow(dat), prob = c(0.55, 0.2, 0.25), replace = TRUE)) %>%
group_by(age, education) %>%
summarise(n = n(), .groups = "drop")
incomelpete data
# A tibble: 11 x 3
age education n
* <dbl> <chr> <int>
1 20 High 1
2 20 Low 2
3 21 Low 3
4 21 Mid-level 2
5 22 High 2
6 22 Low 4
7 23 Low 4
8 23 Mid-level 2
9 24 High 1
10 24 Low 10
11 24 Mid-level 4
Using complete
function
complete_data <- incomplete_data %>%
complete(age, education, fill = list(n = 0))
Output
# A tibble: 15 x 3
age education n
<dbl> <chr> <dbl>
1 20 High 1
2 20 Low 2
3 20 Mid-level 0
4 21 High 0
5 21 Low 3
6 21 Mid-level 2
7 22 High 2
8 22 Low 4
9 22 Mid-level 0
10 23 High 0
11 23 Low 4
12 23 Mid-level 2
13 24 High 1
14 24 Low 10
15 24 Mid-level 4
dplyr summarise: Equivalent of .drop=FALSE to keep groups with zero length in output
Since dplyr 0.8 group_by
gained the .drop
argument that does just what you asked for:
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
df$b = factor(df$b, levels=1:3)
df %>%
group_by(b, .drop=FALSE) %>%
summarise(count_a=length(a))
#> # A tibble: 3 x 2
#> b count_a
#> <fct> <int>
#> 1 1 6
#> 2 2 6
#> 3 3 0
One additional note to go with @Moody_Mudskipper's answer: Using .drop=FALSE
can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:
library(dplyr)
data(iris)
# Add an additional level to Species
iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level"))
# Species is a factor and empty groups are included in the output
iris %>% group_by(Species, .drop=FALSE) %>% tally
#> Species n
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
#> 4 empty_level 0
# Add character column
iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25))
# Empty groups involving combinations of Species and group2 are not included in output
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 versicolor A 25
#> 4 versicolor B 25
#> 5 virginica B 25
#> 6 virginica C 25
#> 7 empty_level <NA> 0
# Turn group2 into a factor
iris$group2 = factor(iris$group2)
# Now all possible combinations of Species and group2 are included in the output,
# whether present in the data or not
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 setosa C 0
#> 4 versicolor A 25
#> 5 versicolor B 25
#> 6 versicolor C 0
#> 7 virginica A 0
#> 8 virginica B 25
#> 9 virginica C 25
#> 10 empty_level A 0
#> 11 empty_level B 0
#> 12 empty_level C 0
Created on 2019-03-13 by the reprex package (v0.2.1)
How to count rows with NA values across a selection of columns and include 0 count?
You can do:
library(tidyverse)
df %>%
mutate(missing = apply(across(num_range('Var', 2:4)), 1, function(x) any(is.na(x)))) %>%
group_by(ID) %>%
summarize(n = sum(missing))
# A tibble: 3 x 2
ID n
<chr> <int>
1 AL01 2
2 AL02 1
3 AL03 0
R group by show count of all factor levels even when zero dplyr
We can convert 'ID' to factor
with levels
specified and just use table
table(factor(dat$ID, levels = letters))
Or using the same with tidyverse
library(tidyverse)
dat %>%
mutate(ID=factor(ID, levels = letters)) %>%
complete(ID) %>%
group_by(ID) %>%
summarise(no_rows = n())
Make the value as 0, if rows not available in Kusto
Instead of summarize
you need to use make-series
which will fill the gaps with a default value for you.
exceptions
| where name == 'my_scheduler' and timestamp > ago(30d)
| extend day = split(tostring(timestamp + 19800s), 'T')[0]
| make-series count() on tolong(x) step 1
You might want to add from
and to
to make-series
in order for it to also fill gaps at the beginning and the end of the 30d period.
Add rows to grouped data with dplyr?
Without dplyr it can be done like this:
as.data.frame(xtabs(Demand ~ Week + Article, data))
giving:
Week Article Freq
1 2013-W01 10004 1215
2 2013-W02 10004 900
3 2013-W03 10004 774
4 2013-W04 10004 1170
5 2013-W01 10006 0
6 2013-W02 10006 0
7 2013-W03 10006 0
8 2013-W04 10006 5
9 2013-W01 10007 2
10 2013-W02 10007 0
11 2013-W03 10007 0
12 2013-W04 10007 0
and this can be rewritten as a magrittr or dplyr pipeline like this:
data %>% xtabs(formula = Demand ~ Week + Article) %>% as.data.frame()
The as.data.frame()
at the end could be omitted if a wide form solution were desired.
How to use R dplyr's summarize to count the number of rows that match a criteria?
You can use sum
on logical vectors - it will automatically convert them into numeric values (TRUE
being equal to 1 and FALSE
being equal to 0), so you need only do:
test %>%
group_by(location) %>%
summarize(total_score = sum(score),
n_outliers = sum(more_than_300))
#> # A tibble: 2 x 3
#> location total_score n_outliers
#> <chr> <dbl> <int>
#> 1 away 927 2
#> 2 home 552 0
Or, if these are your only 3 columns, an equivalent would be:
test %>%
group_by(location) %>%
summarize(across(everything(), sum))
In fact, you don't need to make the more_than_300
column - it would suffice to do:
test %>%
group_by(location) %>%
summarize(total_score = sum(score),
n_outliers = sum(score > 300))
Related Topics
How to Log Transform the Y-Axis of R Geom_Histogram in the Right Direction
Compare Two Columns Element-Wise
Dataframe Is Subseted by Row Number and Not by Cell Value After Clicking on Dt::Datatable
How to 'Subset' a Named Vector in R
Highlight a Single "Bar" in Ggplot
In R, Switch Uppercase to Lowercase and Vice-Versa in a String
R: Holt-Winters with Daily Data (Forecast Package)
Creating a Prng Engine for <Random> in C++11 That Matches Prng Results in R
Empty Output When Reading a CSV File into Rstudio Using Sparkr
Convert Data with One Column and Multiple Rows into Multi Column Multi Row Data
Axis Does Not Plot with Date Labels
Rbind Corresponding Elements in Two or More Lists in R
Using Ggplot2 with Columns That Have Spaces in Their Names
Error: Could Not Find Build Tools Necessary to Build Dplyr