Complete column with group_by and complete
Using complete
from the tidyr package should work. You can find documentation about it here.
What probably happened is that you did not remove the grouping. Then complete tries to add each of the combinations of YEAR
and Region
within each group. But all these combinations are already in the grouping. Thus first remove the grouping and then do the complete.
datasetALL %>%
group_by(YEAR,Region) %>%
summarise(count_number = n()) %>%
ungroup() %>%
complete(Year, Region, fill = list(count_number = 1))
Complete a data.frame with new values by group
You can complete
the missing observations per id
:
library(dplyr)
df %>% group_by(id) %>% tidyr::complete(year = min(year):max(year), semester)
# id year semester
# <dbl> <dbl> <dbl>
# 1 1 2000 1
# 2 1 2000 2
# 3 1 2001 1
# 4 1 2001 2
# 5 2 1999 1
# 6 2 1999 2
# 7 2 2000 1
# 8 2 2000 2
# 9 2 2001 1
#10 2 2001 2
Using tidyr::complete with group_by
You could do it using complete
and group_by
, but you have to use a do
statement:
df %>%
group_by(ID) %>%
do(complete(., Col1, Col2, fill = list(ID = .$ID)))
Does tidyr complete() use the dplyr group_by() function?
As the reference page you link states:
Turns implicit missing values into explicit missing values. This is a wrapper around
expand()
,dplyr::left_join()
andreplace_na()
that's useful for completing missing combinations of data.
So the three operations that are used do not include group_by()
and indeed from a logical standpoint there is no need for a grouping-operation in complete()
.
Finally as @Matt states:
You can also use
is_grouped_df()
This will simply confirm that the dataframe is not grouped.
Unable to use tidyselect `everything()` in combination with `group_by()` and `fill()`
You can do:
df %>%
group_by(x1) %>%
fill(-x1, .direction = "updown")
x1 x2 x3
<chr> <dbl> <dbl>
1 A 8 3
2 A 8 6
3 A 8 5
4 B 5 9
5 B 5 1
6 B 5 9
This behavior is documented in the documentation of tidyr
(also look at the comment from @Gregor):
You can supply bare variable names, select all variables between x and
z with x:z, exclude y with -y.
data.table equivalent of tidyr::complete with group_by
Based on some initial benchmarking the data.table
approach seems to be faster
library(data.table)
setDT(df)[, .(date1 = seq(min(date1), length.out = 3, by = 'month'), date2 = date2[1]), id]
Benchmarks
df <- data_frame(
id = rep(1:3000, each = 2),
date1 = rep(as.Date(c("2013-01-01", "2013-02-01", "2015-04-01", "2015-05-01")),
length.out = 6000),
date2 = rep(as.Date(c("2012-12-09", "2012-12-09", "2015-03-10", "2015-03-10")),
length.out = 6000))
system.time({
df %>%
group_by(id) %>%
complete(date1 = seq.Date(from = min(date1),
length.out = 3, by = "month"), date2 = date2[1])
})
#user system elapsed
#64.05 21.27 86.05
system.time({
setDT(df)[, .(date1 = seq(min(date1), length.out = 3, by = 'month'), date2 = date2[1]), id]
})
#user system elapsed
# 0.14 0.00 0.14
R: complete/expand a dataset with a new column added
It's pretty messy using for
loop, and it will give very random position of GOOOOD
comp_dummy <- original %>%
group_by(ID) %>%
expand(A = A, a = 1:A, B = B, b = 1:B)
original <- original %>%
group_by(ID, A, B, b) %>%
summarise(n = n())
vec <- rep(NA_character_, nrow(comp_dummy))
for (i in 1:nrow(original)){
x <- original[i,]
y <- comp_dummy %>%
rownames_to_column(., "row") %>%
filter(ID == x$ID, A == x$A, B == x$B, b == x$b) %>%
pull(row)
z <- sample(y, x$n, replace = FALSE) %>% as.numeric()
print(z)
vec[{z}] <- "GOOOOD"
}
comp_dummy$detail <- vec
comp_dummy
ID A a B b detail
<chr> <dbl> <int> <dbl> <int> <chr>
1 John 3 1 4 1 NA
2 John 3 1 4 2 GOOOOD
3 John 3 1 4 3 NA
4 John 3 1 4 4 NA
5 John 3 2 4 1 NA
6 John 3 2 4 2 NA
7 John 3 2 4 3 NA
8 John 3 2 4 4 NA
9 John 3 3 4 1 NA
10 John 3 3 4 2 GOOOOD
11 John 3 3 4 3 GOOOOD
12 John 3 3 4 4 NA
13 Steve 1 1 2 1 NA
14 Steve 1 1 2 2 GOOOOD
Completing a sequence of integers by group with tidyverse in R
You could do something like:
df %>%
group_by(Group) %>%
mutate(newseq = seq_along(Group) + (first(na.omit(Seq)) - sum(cumall(is.na(Seq)))) - 1) %>%
ungroup()
Or
df %>%
group_by(Group) %>%
mutate(newseq = seq(first(na.omit(Seq)) - sum(cumall(is.na(Seq))), length.out = n())) %>%
ungroup()
Or
df %>%
group_by(Group) %>%
mutate(newseq = 0:(n() - 1) + (first(na.omit(Seq)) - sum(cumall(is.na(Seq))))) %>%
ungroup()
All these do the same thing: shift the start of the sequence by the difference of the first non-NA value and the number of NAs before it.
Output
Group Seq newseq
<int> <int> <dbl>
1 1 NA 3
2 1 NA 4
3 1 NA 5
4 1 6 6
5 1 7 7
6 1 8 8
7 1 NA 9
8 1 10 10
9 1 11 11
10 1 NA 12
# ... with 35 more rows
Related Topics
How to Draw a Nice Arrow in Ggplot2
How to Pass Dynamic Column Names in Dplyr into Custom Function
Moving Columns Within a Data.Frame() Without Retyping
How to Use a List as a Hash in R? If So, Why Is It So Slow
R Ggplot2: Stat_Count() Must Not Be Used with a Y Aesthetic Error in Bar Graph
Knitr Gets Tricked by Data.Table ':=' Assignment
Writing Robust R Code: Namespaces, Masking and Using the '::' Operator
How to Use Map from Purrr with Dplyr::Mutate to Create Multiple New Columns Based on Column Pairs
Plotting Pca Biplot with Ggplot2
How to 'Print' or 'Cat' When Using Parallel
Convert Data Frame with Date Column to Timeseries
Case-Insensitive Search of a List in R
How to Document Data Sets with Roxygen
Cut() Error - 'Breaks' Are Not Unique
MAC Os X R Error "Ld: Warning: Directory Not Found for Option"
Suggestions for Speeding Up Random Forests
R: How to Split a Data Frame into Training, Validation, and Test Sets