Complete Column with Group_By and Complete

Complete column with group_by and complete

Using complete from the tidyr package should work. You can find documentation about it here.

What probably happened is that you did not remove the grouping. Then complete tries to add each of the combinations of YEAR and Region within each group. But all these combinations are already in the grouping. Thus first remove the grouping and then do the complete.

datasetALL %>% 
group_by(YEAR,Region) %>%
summarise(count_number = n()) %>%
ungroup() %>%
complete(Year, Region, fill = list(count_number = 1))

Complete a data.frame with new values by group

You can complete the missing observations per id :

library(dplyr)

df %>% group_by(id) %>% tidyr::complete(year = min(year):max(year), semester)

# id year semester
# <dbl> <dbl> <dbl>
# 1 1 2000 1
# 2 1 2000 2
# 3 1 2001 1
# 4 1 2001 2
# 5 2 1999 1
# 6 2 1999 2
# 7 2 2000 1
# 8 2 2000 2
# 9 2 2001 1
#10 2 2001 2

Using tidyr::complete with group_by

You could do it using complete and group_by, but you have to use a do statement:

df %>% 
group_by(ID) %>%
do(complete(., Col1, Col2, fill = list(ID = .$ID)))

Does tidyr complete() use the dplyr group_by() function?

As the reference page you link states:

Turns implicit missing values into explicit missing values. This is a wrapper around expand(), dplyr::left_join() and replace_na() that's useful for completing missing combinations of data.

So the three operations that are used do not include group_by() and indeed from a logical standpoint there is no need for a grouping-operation in complete().

Finally as @Matt states:

You can also use is_grouped_df()

This will simply confirm that the dataframe is not grouped.

Unable to use tidyselect `everything()` in combination with `group_by()` and `fill()`

You can do:

df %>%
group_by(x1) %>%
fill(-x1, .direction = "updown")

x1 x2 x3
<chr> <dbl> <dbl>
1 A 8 3
2 A 8 6
3 A 8 5
4 B 5 9
5 B 5 1
6 B 5 9

This behavior is documented in the documentation of tidyr (also look at the comment from @Gregor):

You can supply bare variable names, select all variables between x and
z with x:z, exclude y with -y.

data.table equivalent of tidyr::complete with group_by

Based on some initial benchmarking the data.table approach seems to be faster

library(data.table)
setDT(df)[, .(date1 = seq(min(date1), length.out = 3, by = 'month'), date2 = date2[1]), id]

Benchmarks

 df <- data_frame(
id = rep(1:3000, each = 2),
date1 = rep(as.Date(c("2013-01-01", "2013-02-01", "2015-04-01", "2015-05-01")),
length.out = 6000),
date2 = rep(as.Date(c("2012-12-09", "2012-12-09", "2015-03-10", "2015-03-10")),
length.out = 6000))

system.time({
df %>%
group_by(id) %>%
complete(date1 = seq.Date(from = min(date1),
length.out = 3, by = "month"), date2 = date2[1])
})
#user system elapsed
#64.05 21.27 86.05

system.time({
setDT(df)[, .(date1 = seq(min(date1), length.out = 3, by = 'month'), date2 = date2[1]), id]
})
#user system elapsed
# 0.14 0.00 0.14

R: complete/expand a dataset with a new column added

It's pretty messy using for loop, and it will give very random position of GOOOOD

comp_dummy <- original %>%
group_by(ID) %>%
expand(A = A, a = 1:A, B = B, b = 1:B)

original <- original %>%
group_by(ID, A, B, b) %>%
summarise(n = n())

vec <- rep(NA_character_, nrow(comp_dummy))

for (i in 1:nrow(original)){
x <- original[i,]

y <- comp_dummy %>%
rownames_to_column(., "row") %>%
filter(ID == x$ID, A == x$A, B == x$B, b == x$b) %>%
pull(row)
z <- sample(y, x$n, replace = FALSE) %>% as.numeric()
print(z)
vec[{z}] <- "GOOOOD"
}

comp_dummy$detail <- vec
comp_dummy

ID A a B b detail
<chr> <dbl> <int> <dbl> <int> <chr>
1 John 3 1 4 1 NA
2 John 3 1 4 2 GOOOOD
3 John 3 1 4 3 NA
4 John 3 1 4 4 NA
5 John 3 2 4 1 NA
6 John 3 2 4 2 NA
7 John 3 2 4 3 NA
8 John 3 2 4 4 NA
9 John 3 3 4 1 NA
10 John 3 3 4 2 GOOOOD
11 John 3 3 4 3 GOOOOD
12 John 3 3 4 4 NA
13 Steve 1 1 2 1 NA
14 Steve 1 1 2 2 GOOOOD

Completing a sequence of integers by group with tidyverse in R

You could do something like:

df %>% 
group_by(Group) %>%
mutate(newseq = seq_along(Group) + (first(na.omit(Seq)) - sum(cumall(is.na(Seq)))) - 1) %>%
ungroup()

Or

df %>% 
group_by(Group) %>%
mutate(newseq = seq(first(na.omit(Seq)) - sum(cumall(is.na(Seq))), length.out = n())) %>%
ungroup()

Or

df %>% 
group_by(Group) %>%
mutate(newseq = 0:(n() - 1) + (first(na.omit(Seq)) - sum(cumall(is.na(Seq))))) %>%
ungroup()

All these do the same thing: shift the start of the sequence by the difference of the first non-NA value and the number of NAs before it.

Output

   Group   Seq newseq
<int> <int> <dbl>
1 1 NA 3
2 1 NA 4
3 1 NA 5
4 1 6 6
5 1 7 7
6 1 8 8
7 1 NA 9
8 1 10 10
9 1 11 11
10 1 NA 12
# ... with 35 more rows


Related Topics



Leave a reply



Submit