Filling Missing Dates in a Grouped Time Series - a Tidyverse-Way

Filling missing dates in a grouped time series - a tidyverse-way?

tidyr has some great tools for these sorts of problems. Take a look at complete.

library(dplyr)
library(tidyr)
library(lubridate)

want <- df.missing %>% 
  ungroup() %>%
  complete(nesting(d1, d2), date = seq(min(date), max(date), by = "day"))

want %>% filter(d1 == "A" & d2 == 5) 

#> # A tibble: 10 x 5
#>        d1    d2       date         v1        v2
#>    <fctr> <dbl>     <date>      <dbl>     <dbl>
#>  1      A     5 2017-01-01         NA        NA
#>  2      A     5 2017-01-02 0.21879954 0.1335497
#>  3      A     5 2017-01-03 0.32977018 0.9802127
#>  4      A     5 2017-01-04 0.23902573 0.1206089
#>  5      A     5 2017-01-05 0.19617465 0.7378315
#>  6      A     5 2017-01-06 0.13373890 0.9493668
#>  7      A     5 2017-01-07 0.48613541 0.3392834
#>  8      A     5 2017-01-08 0.35698708 0.3696965
#>  9      A     5 2017-01-09 0.08498474 0.8354756
#> 10      A     5 2017-01-10         NA        NA

Fill missing dates by group

`tidyr::complete()` fills missing values

add id and date as the columns (...) to expand for

library(tidyverse)

complete(dat, id, date)


# A tibble: 16 x 3
      id date       value
   <dbl> <date>     <dbl>
 1  1.00 2017-01-01  30.0
 2  1.00 2017-02-01  30.0
 3  1.00 2017-03-01  NA  
 4  1.00 2017-04-01  25.0
 5  2.00 2017-01-01  NA  
 6  2.00 2017-02-01  25.0
 7  2.00 2017-03-01  NA  
 8  2.00 2017-04-01  NA  
 9  3.00 2017-01-01  25.0
10  3.00 2017-02-01  25.0
11  3.00 2017-03-01  25.0
12  3.00 2017-04-01  NA  
13  4.00 2017-01-01  20.0
14  4.00 2017-02-01  20.0
15  4.00 2017-03-01  NA  
16  4.00 2017-04-01  20.0

Append missing dates for each combination excluding result column

An option would be to use group_by and use the min and max from the whole 'Date' column instead of the min and max of each group

library(dplyr)
library(tidyr)
sample_data %>% 
   group_by(A, B, C) %>% 
   complete(Date = seq.Date(min(.$Date), max(.$Date), by="month"))
# A tibble: 14 x 5
# Groups:   A, B, C [2]
#   A         B C     Date       Result
#   <chr> <dbl> <chr> <date>      <dbl>
# 1 AL      123 12    2014-01-01     NA
# 2 AL      123 12    2014-02-01  12345
# 3 AL      123 12    2014-03-01     NA
# 4 AL      123 12    2014-04-01  12349
# 5 AL      123 12    2014-05-01     NA
# 6 AL      123 12    2014-06-01  12977
# 7 AL      123 12    2014-07-01     NA
# 8 AZ      123 12    2014-01-01  23435
# 9 AZ      123 12    2014-02-01     NA
#10 AZ      123 12    2014-03-01     NA
#11 AZ      123 12    2014-04-01 453454
#12 AZ      123 12    2014-05-01     NA
#13 AZ      123 12    2014-06-01     NA
#14 AZ      123 12    2014-07-01 123976

How to fill in missing dates by minute by group in R

library(dplyr); library(padr)
df %>%
  pad(group = 'Group', interval = 'min') %>%   # Explicitly fill by 1 min
  fill_by_value(Value)

#pad applied on the interval: min
#              Datetime Group Value
#1  2019-01-01 00:00:00     1     5
#2  2019-01-01 00:01:00     1     1
#3  2019-01-01 00:02:00     1     2
#4  2019-01-01 00:03:00     1     1
#5  2019-01-01 00:04:00     1     1
#6  2019-01-01 00:00:00     2     4
#7  2019-01-01 00:01:00     2     0    # added
#8  2019-01-01 00:02:00     2     2
#9  2019-01-01 00:03:00     2     2
#10 2019-01-01 00:00:00     3     2
#11 2019-01-01 00:01:00     3     0    # added
#12 2019-01-01 00:02:00     3     1

Data

df <- read.table(
  header = T,
  stringsAsFactors = F, sep = "|",
  text = "Datetime            | Group | Value
2019-01-01 00:00:00 |  1    |  5  
2019-01-01 00:00:00 |  2    |  4    
2019-01-01 00:00:00 |  3    |  2     
2019-01-01 00:01:00 |  1    |  1  
2019-01-01 00:02:00 |  1    |  2     
2019-01-01 00:02:00 |  2    |  2    
2019-01-01 00:02:00 |  3    |  1    
2019-01-01 00:03:00 |  1    |  1    
2019-01-01 00:03:00 |  2    |  2    
2019-01-01 00:04:00 |  1    |  1"
) 
df$Datetime = lubridate::ymd_hms(df$Datetime)

Adding missing dates in time series data

Use tidyr::complete :

library(dplyr)

df %>%
  mutate(Date = as.Date(Date, "%B %d, %Y")) %>%
  tidyr::complete(Date = seq(as.Date('2008-01-01'), as.Date('2020-03-31'), 
                           by = 'day'), fill = list(Val = 1)) %>%
  mutate(Date = format(Date, "%B %d, %Y"))


# A tibble: 4,475 x 2
#   Date               Val
#   <chr>            <dbl>
# 1 January 01, 2008     1
# 2 January 02, 2008     1
# 3 January 03, 2008     1
# 4 January 04, 2008     1
# 5 January 05, 2008    26
# 6 January 06, 2008     1
# 7 January 07, 2008     1
# 8 January 08, 2008     1
# 9 January 09, 2008     1
#10 January 10, 2008     1
# … with 4,465 more rows

data

df <- structure(list(Date = c("September 16, 2012", "September 19, 2014", 
"January 05, 2008", "June 07, 2017", "December 15, 2019", "May 28, 2020"
), Val = c(32L, 33L, 26L, 2L, 3L, 18L)), class = "data.frame", 
row.names = c(NA, -6L))

How do you populate missing dates for lag?

The dplyr::summarise function can be used to add rows for the next 7 days for each combination of name and date:

library(tidyverse)

ndays=7

df.filled = df %>% 
  mutate(date = as.Date(date)) %>% 
  arrange(name, date) %>% 
  group_by(name, date, hits) %>%
  summarise(date = date + 0:ndays, 
            hits = c(hits, rep(0, ndays))) %>% 
  ungroup()

df.filled %>% filter(name=="Joe") %>% print(n=Inf)
#> # A tibble: 48 × 3
#>    name  date        hits
#>    <chr> <date>     <dbl>
#>  1 Joe   2004-02-01     5
#>  2 Joe   2004-02-02     0
#>  3 Joe   2004-02-03     0
#>  4 Joe   2004-02-04     0
#>  5 Joe   2004-02-05     0
#>  6 Joe   2004-02-06     0
#>  7 Joe   2004-02-07     0
#>  8 Joe   2004-02-08     0
#>  9 Joe   2004-03-05     4
#> 10 Joe   2004-03-06     0
#> 11 Joe   2004-03-07     0
#> 12 Joe   2004-03-08     0
#> 13 Joe   2004-03-09     0
#> 14 Joe   2004-03-10     0
#> 15 Joe   2004-03-11     0
#> 16 Joe   2004-03-12     0
#> 17 Joe   2004-08-09    10
#> 18 Joe   2004-08-10     0
#> 19 Joe   2004-08-11     0
#> 20 Joe   2004-08-12     0
#> 21 Joe   2004-08-13     0
#> 22 Joe   2004-08-14     0
#> 23 Joe   2004-08-15     0
#> 24 Joe   2004-08-16     0
#> 25 Joe   2004-08-13     9
#> 26 Joe   2004-08-14     0
#> 27 Joe   2004-08-15     0
#> 28 Joe   2004-08-16     0
#> 29 Joe   2004-08-17     0
#> 30 Joe   2004-08-18     0
#> 31 Joe   2004-08-19     0
#> 32 Joe   2004-08-20     0
#> 33 Joe   2004-10-20    15
#> 34 Joe   2004-10-21     0
#> 35 Joe   2004-10-22     0
#> 36 Joe   2004-10-23     0
#> 37 Joe   2004-10-24     0
#> 38 Joe   2004-10-25     0
#> 39 Joe   2004-10-26     0
#> 40 Joe   2004-10-27     0
#> 41 Joe   2004-11-02     1
#> 42 Joe   2004-11-03     0
#> 43 Joe   2004-11-04     0
#> 44 Joe   2004-11-05     0
#> 45 Joe   2004-11-06     0
#> 46 Joe   2004-11-07     0
#> 47 Joe   2004-11-08     0
#> 48 Joe   2004-11-09     0

Note, however, that with the code above you could end up with repeated dates if a given name has two dates that are less than 7 days apart. Thus, it's probably safer to do the following: In the code below, we fill in every date from the first to the last + 7 days for each name. Then we join that back to the original data to populate the dates that have non-zero hits.

df$date = as.Date(df$date)

df.filled2 = df %>% 
  group_by(name) %>% 
  summarise(date = seq(min(date), max(date)+7,"1 day")) %>% 
  left_join(df) %>%
  mutate(hits=replace_na(hits, 0))

df.filled2 %>% filter(name=="Joe") %>% print(n=Inf)
#> # A tibble: 283 × 3
#> # Groups:   name [1]
#>     name  date        hits
#>     <chr> <date>     <dbl>
#>   1 Joe   2004-02-01     5
#>   2 Joe   2004-02-02     0
#>   3 Joe   2004-02-03     0
#>   4 Joe   2004-02-04     0
#>   5 Joe   2004-02-05     0
#>   6 Joe   2004-02-06     0
#>   7 Joe   2004-02-07     0
#>   8 Joe   2004-02-08     0
#>   9 Joe   2004-02-09     0
#>  10 Joe   2004-02-10     0
#>  11 Joe   2004-02-11     0
#>  12 Joe   2004-02-12     0
#>  13 Joe   2004-02-13     0
#>  14 Joe   2004-02-14     0
#>  15 Joe   2004-02-15     0
#>  16 Joe   2004-02-16     0
#>  17 Joe   2004-02-17     0
#>  18 Joe   2004-02-18     0
#>  19 Joe   2004-02-19     0
#>  20 Joe   2004-02-20     0
#>  21 Joe   2004-02-21     0
#>  22 Joe   2004-02-22     0
#>  23 Joe   2004-02-23     0
#>  24 Joe   2004-02-24     0
#>  25 Joe   2004-02-25     0
#>  26 Joe   2004-02-26     0
#>  27 Joe   2004-02-27     0
#>  28 Joe   2004-02-28     0
#>  29 Joe   2004-02-29     0
#>  30 Joe   2004-03-01     0
#>  31 Joe   2004-03-02     0
#>  32 Joe   2004-03-03     0
#>  33 Joe   2004-03-04     0
#>  34 Joe   2004-03-05     4
#>  35 Joe   2004-03-06     0
#>  36 Joe   2004-03-07     0
#>  37 Joe   2004-03-08     0
#>  38 Joe   2004-03-09     0
#>  39 Joe   2004-03-10     0
#>  40 Joe   2004-03-11     0
#>  41 Joe   2004-03-12     0
#>  42 Joe   2004-03-13     0
#>  43 Joe   2004-03-14     0
#>  44 Joe   2004-03-15     0
#>  45 Joe   2004-03-16     0
#>  46 Joe   2004-03-17     0
#>  47 Joe   2004-03-18     0
#>  48 Joe   2004-03-19     0
#>  49 Joe   2004-03-20     0
#>  50 Joe   2004-03-21     0
#>  51 Joe   2004-03-22     0
#>  52 Joe   2004-03-23     0
#>  53 Joe   2004-03-24     0
#>  54 Joe   2004-03-25     0
#>  55 Joe   2004-03-26     0
#>  56 Joe   2004-03-27     0
#>  57 Joe   2004-03-28     0
#>  58 Joe   2004-03-29     0
#>  59 Joe   2004-03-30     0
#>  60 Joe   2004-03-31     0
#>  61 Joe   2004-04-01     0
#>  62 Joe   2004-04-02     0
#>  63 Joe   2004-04-03     0
#>  64 Joe   2004-04-04     0
#>  65 Joe   2004-04-05     0
#>  66 Joe   2004-04-06     0
#>  67 Joe   2004-04-07     0
#>  68 Joe   2004-04-08     0
#>  69 Joe   2004-04-09     0
#>  70 Joe   2004-04-10     0
#>  71 Joe   2004-04-11     0
#>  72 Joe   2004-04-12     0
#>  73 Joe   2004-04-13     0
#>  74 Joe   2004-04-14     0
#>  75 Joe   2004-04-15     0
#>  76 Joe   2004-04-16     0
#>  77 Joe   2004-04-17     0
#>  78 Joe   2004-04-18     0
#>  79 Joe   2004-04-19     0
#>  80 Joe   2004-04-20     0
#>  81 Joe   2004-04-21     0
#>  82 Joe   2004-04-22     0
#>  83 Joe   2004-04-23     0
#>  84 Joe   2004-04-24     0
#>  85 Joe   2004-04-25     0
#>  86 Joe   2004-04-26     0
#>  87 Joe   2004-04-27     0
#>  88 Joe   2004-04-28     0
#>  89 Joe   2004-04-29     0
#>  90 Joe   2004-04-30     0
#>  91 Joe   2004-05-01     0
#>  92 Joe   2004-05-02     0
#>  93 Joe   2004-05-03     0
#>  94 Joe   2004-05-04     0
#>  95 Joe   2004-05-05     0
#>  96 Joe   2004-05-06     0
#>  97 Joe   2004-05-07     0
#>  98 Joe   2004-05-08     0
#>  99 Joe   2004-05-09     0
#> 100 Joe   2004-05-10     0
#> 101 Joe   2004-05-11     0
#> 102 Joe   2004-05-12     0
#> 103 Joe   2004-05-13     0
#> 104 Joe   2004-05-14     0
#> 105 Joe   2004-05-15     0
#> 106 Joe   2004-05-16     0
#> 107 Joe   2004-05-17     0
#> 108 Joe   2004-05-18     0
#> 109 Joe   2004-05-19     0
#> 110 Joe   2004-05-20     0
#> 111 Joe   2004-05-21     0
#> 112 Joe   2004-05-22     0
#> 113 Joe   2004-05-23     0
#> 114 Joe   2004-05-24     0
#> 115 Joe   2004-05-25     0
#> 116 Joe   2004-05-26     0
#> 117 Joe   2004-05-27     0
#> 118 Joe   2004-05-28     0
#> 119 Joe   2004-05-29     0
#> 120 Joe   2004-05-30     0
#> 121 Joe   2004-05-31     0
#> 122 Joe   2004-06-01     0
#> 123 Joe   2004-06-02     0
#> 124 Joe   2004-06-03     0
#> 125 Joe   2004-06-04     0
#> 126 Joe   2004-06-05     0
#> 127 Joe   2004-06-06     0
#> 128 Joe   2004-06-07     0
#> 129 Joe   2004-06-08     0
#> 130 Joe   2004-06-09     0
#> 131 Joe   2004-06-10     0
#> 132 Joe   2004-06-11     0
#> 133 Joe   2004-06-12     0
#> 134 Joe   2004-06-13     0
#> 135 Joe   2004-06-14     0
#> 136 Joe   2004-06-15     0
#> 137 Joe   2004-06-16     0
#> 138 Joe   2004-06-17     0
#> 139 Joe   2004-06-18     0
#> 140 Joe   2004-06-19     0
#> 141 Joe   2004-06-20     0
#> 142 Joe   2004-06-21     0
#> 143 Joe   2004-06-22     0
#> 144 Joe   2004-06-23     0
#> 145 Joe   2004-06-24     0
#> 146 Joe   2004-06-25     0
#> 147 Joe   2004-06-26     0
#> 148 Joe   2004-06-27     0
#> 149 Joe   2004-06-28     0
#> 150 Joe   2004-06-29     0
#> 151 Joe   2004-06-30     0
#> 152 Joe   2004-07-01     0
#> 153 Joe   2004-07-02     0
#> 154 Joe   2004-07-03     0
#> 155 Joe   2004-07-04     0
#> 156 Joe   2004-07-05     0
#> 157 Joe   2004-07-06     0
#> 158 Joe   2004-07-07     0
#> 159 Joe   2004-07-08     0
#> 160 Joe   2004-07-09     0
#> 161 Joe   2004-07-10     0
#> 162 Joe   2004-07-11     0
#> 163 Joe   2004-07-12     0
#> 164 Joe   2004-07-13     0
#> 165 Joe   2004-07-14     0
#> 166 Joe   2004-07-15     0
#> 167 Joe   2004-07-16     0
#> 168 Joe   2004-07-17     0
#> 169 Joe   2004-07-18     0
#> 170 Joe   2004-07-19     0
#> 171 Joe   2004-07-20     0
#> 172 Joe   2004-07-21     0
#> 173 Joe   2004-07-22     0
#> 174 Joe   2004-07-23     0
#> 175 Joe   2004-07-24     0
#> 176 Joe   2004-07-25     0
#> 177 Joe   2004-07-26     0
#> 178 Joe   2004-07-27     0
#> 179 Joe   2004-07-28     0
#> 180 Joe   2004-07-29     0
#> 181 Joe   2004-07-30     0
#> 182 Joe   2004-07-31     0
#> 183 Joe   2004-08-01     0
#> 184 Joe   2004-08-02     0
#> 185 Joe   2004-08-03     0
#> 186 Joe   2004-08-04     0
#> 187 Joe   2004-08-05     0
#> 188 Joe   2004-08-06     0
#> 189 Joe   2004-08-07     0
#> 190 Joe   2004-08-08     0
#> 191 Joe   2004-08-09    10
#> 192 Joe   2004-08-10     0
#> 193 Joe   2004-08-11     0
#> 194 Joe   2004-08-12     0
#> 195 Joe   2004-08-13     9
#> 196 Joe   2004-08-14     0
#> 197 Joe   2004-08-15     0
#> 198 Joe   2004-08-16     0
#> 199 Joe   2004-08-17     0
#> 200 Joe   2004-08-18     0
#> 201 Joe   2004-08-19     0
#> 202 Joe   2004-08-20     0
#> 203 Joe   2004-08-21     0
#> 204 Joe   2004-08-22     0
#> 205 Joe   2004-08-23     0
#> 206 Joe   2004-08-24     0
#> 207 Joe   2004-08-25     0
#> 208 Joe   2004-08-26     0
#> 209 Joe   2004-08-27     0
#> 210 Joe   2004-08-28     0
#> 211 Joe   2004-08-29     0
#> 212 Joe   2004-08-30     0
#> 213 Joe   2004-08-31     0
#> 214 Joe   2004-09-01     0
#> 215 Joe   2004-09-02     0
#> 216 Joe   2004-09-03     0
#> 217 Joe   2004-09-04     0
#> 218 Joe   2004-09-05     0
#> 219 Joe   2004-09-06     0
#> 220 Joe   2004-09-07     0
#> 221 Joe   2004-09-08     0
#> 222 Joe   2004-09-09     0
#> 223 Joe   2004-09-10     0
#> 224 Joe   2004-09-11     0
#> 225 Joe   2004-09-12     0
#> 226 Joe   2004-09-13     0
#> 227 Joe   2004-09-14     0
#> 228 Joe   2004-09-15     0
#> 229 Joe   2004-09-16     0
#> 230 Joe   2004-09-17     0
#> 231 Joe   2004-09-18     0
#> 232 Joe   2004-09-19     0
#> 233 Joe   2004-09-20     0
#> 234 Joe   2004-09-21     0
#> 235 Joe   2004-09-22     0
#> 236 Joe   2004-09-23     0
#> 237 Joe   2004-09-24     0
#> 238 Joe   2004-09-25     0
#> 239 Joe   2004-09-26     0
#> 240 Joe   2004-09-27     0
#> 241 Joe   2004-09-28     0
#> 242 Joe   2004-09-29     0
#> 243 Joe   2004-09-30     0
#> 244 Joe   2004-10-01     0
#> 245 Joe   2004-10-02     0
#> 246 Joe   2004-10-03     0
#> 247 Joe   2004-10-04     0
#> 248 Joe   2004-10-05     0
#> 249 Joe   2004-10-06     0
#> 250 Joe   2004-10-07     0
#> 251 Joe   2004-10-08     0
#> 252 Joe   2004-10-09     0
#> 253 Joe   2004-10-10     0
#> 254 Joe   2004-10-11     0
#> 255 Joe   2004-10-12     0
#> 256 Joe   2004-10-13     0
#> 257 Joe   2004-10-14     0
#> 258 Joe   2004-10-15     0
#> 259 Joe   2004-10-16     0
#> 260 Joe   2004-10-17     0
#> 261 Joe   2004-10-18     0
#> 262 Joe   2004-10-19     0
#> 263 Joe   2004-10-20    15
#> 264 Joe   2004-10-21     0
#> 265 Joe   2004-10-22     0
#> 266 Joe   2004-10-23     0
#> 267 Joe   2004-10-24     0
#> 268 Joe   2004-10-25     0
#> 269 Joe   2004-10-26     0
#> 270 Joe   2004-10-27     0
#> 271 Joe   2004-10-28     0
#> 272 Joe   2004-10-29     0
#> 273 Joe   2004-10-30     0
#> 274 Joe   2004-10-31     0
#> 275 Joe   2004-11-01     0
#> 276 Joe   2004-11-02     1
#> 277 Joe   2004-11-03     0
#> 278 Joe   2004-11-04     0
#> 279 Joe   2004-11-05     0
#> 280 Joe   2004-11-06     0
#> 281 Joe   2004-11-07     0
#> 282 Joe   2004-11-08     0
#> 283 Joe   2004-11-09     0

The second approach will in general result in many more rows of data. If you want to keep a maximum of 7 rows after any date with non-zero hits, you can do the following:

df.filled2 = df.filled2 %>% 
  group_by(name) %>% 
  mutate(test=cumsum(hits > 0)) %>% 
  group_by(name, test) %>% 
  slice(1:8) %>%
  ungroup %>% 
  select(-test)

Is there a way to fill in missing dates with 0s using dplyr?

In sparklyr, you must use Spark functions. This is a job for coalesce. First you have to fill out all the pairs of ids and dates you expect to see, so maybe something like:
(edit)

all_id <- old_data %>% distinct(id) %>% mutate(common=0)
all_date <- old_data %>% distinct(date) %>% mutate(common=0)
all_both <- all_id %>% full_join(all_date,by='common')
data <- old_data %>%
  right_join(all_both %>% select(-common),by=c('id','date')) %>%
  mutate(value=`coalesce(value,0)`)

I have assumed you have all the dates and ids you care about in your old data, though that might not be the case.

Insert rows for missing dates/times

I think the easiest thing ist to set Date first as already described, convert to zoo, and then just set a merge:

df$timestamp<-as.POSIXct(df$timestamp,format="%m/%d/%y %H:%M")

df1.zoo<-zoo(df[,-1],df[,1]) #set date to Index

df2 <- merge(df1.zoo,zoo(,seq(start(df1.zoo),end(df1.zoo),by="min")), all=TRUE)

Start and end are given from your df1 (original data) and you are setting by - e.g min - as you need for your example. all=TRUE sets all missing values at the missing dates to NAs.

Filling Missing Dates in a Grouped Time Series - a Tidyverse-Way