Filling Missing Dates in a Grouped Time Series - a Tidyverse-Way

Filling missing dates in a grouped time series - a tidyverse-way?

tidyr has some great tools for these sorts of problems. Take a look at complete.



library(dplyr)
library(tidyr)
library(lubridate)

want <- df.missing %>%
ungroup() %>%
complete(nesting(d1, d2), date = seq(min(date), max(date), by = "day"))

want %>% filter(d1 == "A" & d2 == 5)

#> # A tibble: 10 x 5
#> d1 d2 date v1 v2
#> <fctr> <dbl> <date> <dbl> <dbl>
#> 1 A 5 2017-01-01 NA NA
#> 2 A 5 2017-01-02 0.21879954 0.1335497
#> 3 A 5 2017-01-03 0.32977018 0.9802127
#> 4 A 5 2017-01-04 0.23902573 0.1206089
#> 5 A 5 2017-01-05 0.19617465 0.7378315
#> 6 A 5 2017-01-06 0.13373890 0.9493668
#> 7 A 5 2017-01-07 0.48613541 0.3392834
#> 8 A 5 2017-01-08 0.35698708 0.3696965
#> 9 A 5 2017-01-09 0.08498474 0.8354756
#> 10 A 5 2017-01-10 NA NA

Fill missing dates by group


tidyr::complete() fills missing values

add id and date as the columns (...) to expand for

library(tidyverse)

complete(dat, id, date)


# A tibble: 16 x 3
id date value
<dbl> <date> <dbl>
1 1.00 2017-01-01 30.0
2 1.00 2017-02-01 30.0
3 1.00 2017-03-01 NA
4 1.00 2017-04-01 25.0
5 2.00 2017-01-01 NA
6 2.00 2017-02-01 25.0
7 2.00 2017-03-01 NA
8 2.00 2017-04-01 NA
9 3.00 2017-01-01 25.0
10 3.00 2017-02-01 25.0
11 3.00 2017-03-01 25.0
12 3.00 2017-04-01 NA
13 4.00 2017-01-01 20.0
14 4.00 2017-02-01 20.0
15 4.00 2017-03-01 NA
16 4.00 2017-04-01 20.0

Append missing dates for each combination excluding result column

An option would be to use group_by and use the min and max from the whole 'Date' column instead of the min and max of each group

library(dplyr)
library(tidyr)
sample_data %>%
group_by(A, B, C) %>%
complete(Date = seq.Date(min(.$Date), max(.$Date), by="month"))
# A tibble: 14 x 5
# Groups: A, B, C [2]
# A B C Date Result
# <chr> <dbl> <chr> <date> <dbl>
# 1 AL 123 12 2014-01-01 NA
# 2 AL 123 12 2014-02-01 12345
# 3 AL 123 12 2014-03-01 NA
# 4 AL 123 12 2014-04-01 12349
# 5 AL 123 12 2014-05-01 NA
# 6 AL 123 12 2014-06-01 12977
# 7 AL 123 12 2014-07-01 NA
# 8 AZ 123 12 2014-01-01 23435
# 9 AZ 123 12 2014-02-01 NA
#10 AZ 123 12 2014-03-01 NA
#11 AZ 123 12 2014-04-01 453454
#12 AZ 123 12 2014-05-01 NA
#13 AZ 123 12 2014-06-01 NA
#14 AZ 123 12 2014-07-01 123976

How to fill in missing dates by minute by group in R


library(dplyr); library(padr)
df %>%
pad(group = 'Group', interval = 'min') %>% # Explicitly fill by 1 min
fill_by_value(Value)

#pad applied on the interval: min
# Datetime Group Value
#1 2019-01-01 00:00:00 1 5
#2 2019-01-01 00:01:00 1 1
#3 2019-01-01 00:02:00 1 2
#4 2019-01-01 00:03:00 1 1
#5 2019-01-01 00:04:00 1 1
#6 2019-01-01 00:00:00 2 4
#7 2019-01-01 00:01:00 2 0 # added
#8 2019-01-01 00:02:00 2 2
#9 2019-01-01 00:03:00 2 2
#10 2019-01-01 00:00:00 3 2
#11 2019-01-01 00:01:00 3 0 # added
#12 2019-01-01 00:02:00 3 1

Data

df <- read.table(
header = T,
stringsAsFactors = F, sep = "|",
text = "Datetime | Group | Value
2019-01-01 00:00:00 | 1 | 5
2019-01-01 00:00:00 | 2 | 4
2019-01-01 00:00:00 | 3 | 2
2019-01-01 00:01:00 | 1 | 1
2019-01-01 00:02:00 | 1 | 2
2019-01-01 00:02:00 | 2 | 2
2019-01-01 00:02:00 | 3 | 1
2019-01-01 00:03:00 | 1 | 1
2019-01-01 00:03:00 | 2 | 2
2019-01-01 00:04:00 | 1 | 1"
)
df$Datetime = lubridate::ymd_hms(df$Datetime)

Adding missing dates in time series data

Use tidyr::complete :

library(dplyr)

df %>%
mutate(Date = as.Date(Date, "%B %d, %Y")) %>%
tidyr::complete(Date = seq(as.Date('2008-01-01'), as.Date('2020-03-31'),
by = 'day'), fill = list(Val = 1)) %>%
mutate(Date = format(Date, "%B %d, %Y"))


# A tibble: 4,475 x 2
# Date Val
# <chr> <dbl>
# 1 January 01, 2008 1
# 2 January 02, 2008 1
# 3 January 03, 2008 1
# 4 January 04, 2008 1
# 5 January 05, 2008 26
# 6 January 06, 2008 1
# 7 January 07, 2008 1
# 8 January 08, 2008 1
# 9 January 09, 2008 1
#10 January 10, 2008 1
# … with 4,465 more rows

data

df <- structure(list(Date = c("September 16, 2012", "September 19, 2014", 
"January 05, 2008", "June 07, 2017", "December 15, 2019", "May 28, 2020"
), Val = c(32L, 33L, 26L, 2L, 3L, 18L)), class = "data.frame",
row.names = c(NA, -6L))

How do you populate missing dates for lag?

The dplyr::summarise function can be used to add rows for the next 7 days for each combination of name and date:

library(tidyverse)

ndays=7

df.filled = df %>%
mutate(date = as.Date(date)) %>%
arrange(name, date) %>%
group_by(name, date, hits) %>%
summarise(date = date + 0:ndays,
hits = c(hits, rep(0, ndays))) %>%
ungroup()

df.filled %>% filter(name=="Joe") %>% print(n=Inf)
#> # A tibble: 48 × 3
#> name date hits
#> <chr> <date> <dbl>
#> 1 Joe 2004-02-01 5
#> 2 Joe 2004-02-02 0
#> 3 Joe 2004-02-03 0
#> 4 Joe 2004-02-04 0
#> 5 Joe 2004-02-05 0
#> 6 Joe 2004-02-06 0
#> 7 Joe 2004-02-07 0
#> 8 Joe 2004-02-08 0
#> 9 Joe 2004-03-05 4
#> 10 Joe 2004-03-06 0
#> 11 Joe 2004-03-07 0
#> 12 Joe 2004-03-08 0
#> 13 Joe 2004-03-09 0
#> 14 Joe 2004-03-10 0
#> 15 Joe 2004-03-11 0
#> 16 Joe 2004-03-12 0
#> 17 Joe 2004-08-09 10
#> 18 Joe 2004-08-10 0
#> 19 Joe 2004-08-11 0
#> 20 Joe 2004-08-12 0
#> 21 Joe 2004-08-13 0
#> 22 Joe 2004-08-14 0
#> 23 Joe 2004-08-15 0
#> 24 Joe 2004-08-16 0
#> 25 Joe 2004-08-13 9
#> 26 Joe 2004-08-14 0
#> 27 Joe 2004-08-15 0
#> 28 Joe 2004-08-16 0
#> 29 Joe 2004-08-17 0
#> 30 Joe 2004-08-18 0
#> 31 Joe 2004-08-19 0
#> 32 Joe 2004-08-20 0
#> 33 Joe 2004-10-20 15
#> 34 Joe 2004-10-21 0
#> 35 Joe 2004-10-22 0
#> 36 Joe 2004-10-23 0
#> 37 Joe 2004-10-24 0
#> 38 Joe 2004-10-25 0
#> 39 Joe 2004-10-26 0
#> 40 Joe 2004-10-27 0
#> 41 Joe 2004-11-02 1
#> 42 Joe 2004-11-03 0
#> 43 Joe 2004-11-04 0
#> 44 Joe 2004-11-05 0
#> 45 Joe 2004-11-06 0
#> 46 Joe 2004-11-07 0
#> 47 Joe 2004-11-08 0
#> 48 Joe 2004-11-09 0

Note, however, that with the code above you could end up with repeated dates if a given name has two dates that are less than 7 days apart. Thus, it's probably safer to do the following: In the code below, we fill in every date from the first to the last + 7 days for each name. Then we join that back to the original data to populate the dates that have non-zero hits.

df$date = as.Date(df$date)

df.filled2 = df %>%
group_by(name) %>%
summarise(date = seq(min(date), max(date)+7,"1 day")) %>%
left_join(df) %>%
mutate(hits=replace_na(hits, 0))

df.filled2 %>% filter(name=="Joe") %>% print(n=Inf)
#> # A tibble: 283 × 3
#> # Groups: name [1]
#> name date hits
#> <chr> <date> <dbl>
#> 1 Joe 2004-02-01 5
#> 2 Joe 2004-02-02 0
#> 3 Joe 2004-02-03 0
#> 4 Joe 2004-02-04 0
#> 5 Joe 2004-02-05 0
#> 6 Joe 2004-02-06 0
#> 7 Joe 2004-02-07 0
#> 8 Joe 2004-02-08 0
#> 9 Joe 2004-02-09 0
#> 10 Joe 2004-02-10 0
#> 11 Joe 2004-02-11 0
#> 12 Joe 2004-02-12 0
#> 13 Joe 2004-02-13 0
#> 14 Joe 2004-02-14 0
#> 15 Joe 2004-02-15 0
#> 16 Joe 2004-02-16 0
#> 17 Joe 2004-02-17 0
#> 18 Joe 2004-02-18 0
#> 19 Joe 2004-02-19 0
#> 20 Joe 2004-02-20 0
#> 21 Joe 2004-02-21 0
#> 22 Joe 2004-02-22 0
#> 23 Joe 2004-02-23 0
#> 24 Joe 2004-02-24 0
#> 25 Joe 2004-02-25 0
#> 26 Joe 2004-02-26 0
#> 27 Joe 2004-02-27 0
#> 28 Joe 2004-02-28 0
#> 29 Joe 2004-02-29 0
#> 30 Joe 2004-03-01 0
#> 31 Joe 2004-03-02 0
#> 32 Joe 2004-03-03 0
#> 33 Joe 2004-03-04 0
#> 34 Joe 2004-03-05 4
#> 35 Joe 2004-03-06 0
#> 36 Joe 2004-03-07 0
#> 37 Joe 2004-03-08 0
#> 38 Joe 2004-03-09 0
#> 39 Joe 2004-03-10 0
#> 40 Joe 2004-03-11 0
#> 41 Joe 2004-03-12 0
#> 42 Joe 2004-03-13 0
#> 43 Joe 2004-03-14 0
#> 44 Joe 2004-03-15 0
#> 45 Joe 2004-03-16 0
#> 46 Joe 2004-03-17 0
#> 47 Joe 2004-03-18 0
#> 48 Joe 2004-03-19 0
#> 49 Joe 2004-03-20 0
#> 50 Joe 2004-03-21 0
#> 51 Joe 2004-03-22 0
#> 52 Joe 2004-03-23 0
#> 53 Joe 2004-03-24 0
#> 54 Joe 2004-03-25 0
#> 55 Joe 2004-03-26 0
#> 56 Joe 2004-03-27 0
#> 57 Joe 2004-03-28 0
#> 58 Joe 2004-03-29 0
#> 59 Joe 2004-03-30 0
#> 60 Joe 2004-03-31 0
#> 61 Joe 2004-04-01 0
#> 62 Joe 2004-04-02 0
#> 63 Joe 2004-04-03 0
#> 64 Joe 2004-04-04 0
#> 65 Joe 2004-04-05 0
#> 66 Joe 2004-04-06 0
#> 67 Joe 2004-04-07 0
#> 68 Joe 2004-04-08 0
#> 69 Joe 2004-04-09 0
#> 70 Joe 2004-04-10 0
#> 71 Joe 2004-04-11 0
#> 72 Joe 2004-04-12 0
#> 73 Joe 2004-04-13 0
#> 74 Joe 2004-04-14 0
#> 75 Joe 2004-04-15 0
#> 76 Joe 2004-04-16 0
#> 77 Joe 2004-04-17 0
#> 78 Joe 2004-04-18 0
#> 79 Joe 2004-04-19 0
#> 80 Joe 2004-04-20 0
#> 81 Joe 2004-04-21 0
#> 82 Joe 2004-04-22 0
#> 83 Joe 2004-04-23 0
#> 84 Joe 2004-04-24 0
#> 85 Joe 2004-04-25 0
#> 86 Joe 2004-04-26 0
#> 87 Joe 2004-04-27 0
#> 88 Joe 2004-04-28 0
#> 89 Joe 2004-04-29 0
#> 90 Joe 2004-04-30 0
#> 91 Joe 2004-05-01 0
#> 92 Joe 2004-05-02 0
#> 93 Joe 2004-05-03 0
#> 94 Joe 2004-05-04 0
#> 95 Joe 2004-05-05 0
#> 96 Joe 2004-05-06 0
#> 97 Joe 2004-05-07 0
#> 98 Joe 2004-05-08 0
#> 99 Joe 2004-05-09 0
#> 100 Joe 2004-05-10 0
#> 101 Joe 2004-05-11 0
#> 102 Joe 2004-05-12 0
#> 103 Joe 2004-05-13 0
#> 104 Joe 2004-05-14 0
#> 105 Joe 2004-05-15 0
#> 106 Joe 2004-05-16 0
#> 107 Joe 2004-05-17 0
#> 108 Joe 2004-05-18 0
#> 109 Joe 2004-05-19 0
#> 110 Joe 2004-05-20 0
#> 111 Joe 2004-05-21 0
#> 112 Joe 2004-05-22 0
#> 113 Joe 2004-05-23 0
#> 114 Joe 2004-05-24 0
#> 115 Joe 2004-05-25 0
#> 116 Joe 2004-05-26 0
#> 117 Joe 2004-05-27 0
#> 118 Joe 2004-05-28 0
#> 119 Joe 2004-05-29 0
#> 120 Joe 2004-05-30 0
#> 121 Joe 2004-05-31 0
#> 122 Joe 2004-06-01 0
#> 123 Joe 2004-06-02 0
#> 124 Joe 2004-06-03 0
#> 125 Joe 2004-06-04 0
#> 126 Joe 2004-06-05 0
#> 127 Joe 2004-06-06 0
#> 128 Joe 2004-06-07 0
#> 129 Joe 2004-06-08 0
#> 130 Joe 2004-06-09 0
#> 131 Joe 2004-06-10 0
#> 132 Joe 2004-06-11 0
#> 133 Joe 2004-06-12 0
#> 134 Joe 2004-06-13 0
#> 135 Joe 2004-06-14 0
#> 136 Joe 2004-06-15 0
#> 137 Joe 2004-06-16 0
#> 138 Joe 2004-06-17 0
#> 139 Joe 2004-06-18 0
#> 140 Joe 2004-06-19 0
#> 141 Joe 2004-06-20 0
#> 142 Joe 2004-06-21 0
#> 143 Joe 2004-06-22 0
#> 144 Joe 2004-06-23 0
#> 145 Joe 2004-06-24 0
#> 146 Joe 2004-06-25 0
#> 147 Joe 2004-06-26 0
#> 148 Joe 2004-06-27 0
#> 149 Joe 2004-06-28 0
#> 150 Joe 2004-06-29 0
#> 151 Joe 2004-06-30 0
#> 152 Joe 2004-07-01 0
#> 153 Joe 2004-07-02 0
#> 154 Joe 2004-07-03 0
#> 155 Joe 2004-07-04 0
#> 156 Joe 2004-07-05 0
#> 157 Joe 2004-07-06 0
#> 158 Joe 2004-07-07 0
#> 159 Joe 2004-07-08 0
#> 160 Joe 2004-07-09 0
#> 161 Joe 2004-07-10 0
#> 162 Joe 2004-07-11 0
#> 163 Joe 2004-07-12 0
#> 164 Joe 2004-07-13 0
#> 165 Joe 2004-07-14 0
#> 166 Joe 2004-07-15 0
#> 167 Joe 2004-07-16 0
#> 168 Joe 2004-07-17 0
#> 169 Joe 2004-07-18 0
#> 170 Joe 2004-07-19 0
#> 171 Joe 2004-07-20 0
#> 172 Joe 2004-07-21 0
#> 173 Joe 2004-07-22 0
#> 174 Joe 2004-07-23 0
#> 175 Joe 2004-07-24 0
#> 176 Joe 2004-07-25 0
#> 177 Joe 2004-07-26 0
#> 178 Joe 2004-07-27 0
#> 179 Joe 2004-07-28 0
#> 180 Joe 2004-07-29 0
#> 181 Joe 2004-07-30 0
#> 182 Joe 2004-07-31 0
#> 183 Joe 2004-08-01 0
#> 184 Joe 2004-08-02 0
#> 185 Joe 2004-08-03 0
#> 186 Joe 2004-08-04 0
#> 187 Joe 2004-08-05 0
#> 188 Joe 2004-08-06 0
#> 189 Joe 2004-08-07 0
#> 190 Joe 2004-08-08 0
#> 191 Joe 2004-08-09 10
#> 192 Joe 2004-08-10 0
#> 193 Joe 2004-08-11 0
#> 194 Joe 2004-08-12 0
#> 195 Joe 2004-08-13 9
#> 196 Joe 2004-08-14 0
#> 197 Joe 2004-08-15 0
#> 198 Joe 2004-08-16 0
#> 199 Joe 2004-08-17 0
#> 200 Joe 2004-08-18 0
#> 201 Joe 2004-08-19 0
#> 202 Joe 2004-08-20 0
#> 203 Joe 2004-08-21 0
#> 204 Joe 2004-08-22 0
#> 205 Joe 2004-08-23 0
#> 206 Joe 2004-08-24 0
#> 207 Joe 2004-08-25 0
#> 208 Joe 2004-08-26 0
#> 209 Joe 2004-08-27 0
#> 210 Joe 2004-08-28 0
#> 211 Joe 2004-08-29 0
#> 212 Joe 2004-08-30 0
#> 213 Joe 2004-08-31 0
#> 214 Joe 2004-09-01 0
#> 215 Joe 2004-09-02 0
#> 216 Joe 2004-09-03 0
#> 217 Joe 2004-09-04 0
#> 218 Joe 2004-09-05 0
#> 219 Joe 2004-09-06 0
#> 220 Joe 2004-09-07 0
#> 221 Joe 2004-09-08 0
#> 222 Joe 2004-09-09 0
#> 223 Joe 2004-09-10 0
#> 224 Joe 2004-09-11 0
#> 225 Joe 2004-09-12 0
#> 226 Joe 2004-09-13 0
#> 227 Joe 2004-09-14 0
#> 228 Joe 2004-09-15 0
#> 229 Joe 2004-09-16 0
#> 230 Joe 2004-09-17 0
#> 231 Joe 2004-09-18 0
#> 232 Joe 2004-09-19 0
#> 233 Joe 2004-09-20 0
#> 234 Joe 2004-09-21 0
#> 235 Joe 2004-09-22 0
#> 236 Joe 2004-09-23 0
#> 237 Joe 2004-09-24 0
#> 238 Joe 2004-09-25 0
#> 239 Joe 2004-09-26 0
#> 240 Joe 2004-09-27 0
#> 241 Joe 2004-09-28 0
#> 242 Joe 2004-09-29 0
#> 243 Joe 2004-09-30 0
#> 244 Joe 2004-10-01 0
#> 245 Joe 2004-10-02 0
#> 246 Joe 2004-10-03 0
#> 247 Joe 2004-10-04 0
#> 248 Joe 2004-10-05 0
#> 249 Joe 2004-10-06 0
#> 250 Joe 2004-10-07 0
#> 251 Joe 2004-10-08 0
#> 252 Joe 2004-10-09 0
#> 253 Joe 2004-10-10 0
#> 254 Joe 2004-10-11 0
#> 255 Joe 2004-10-12 0
#> 256 Joe 2004-10-13 0
#> 257 Joe 2004-10-14 0
#> 258 Joe 2004-10-15 0
#> 259 Joe 2004-10-16 0
#> 260 Joe 2004-10-17 0
#> 261 Joe 2004-10-18 0
#> 262 Joe 2004-10-19 0
#> 263 Joe 2004-10-20 15
#> 264 Joe 2004-10-21 0
#> 265 Joe 2004-10-22 0
#> 266 Joe 2004-10-23 0
#> 267 Joe 2004-10-24 0
#> 268 Joe 2004-10-25 0
#> 269 Joe 2004-10-26 0
#> 270 Joe 2004-10-27 0
#> 271 Joe 2004-10-28 0
#> 272 Joe 2004-10-29 0
#> 273 Joe 2004-10-30 0
#> 274 Joe 2004-10-31 0
#> 275 Joe 2004-11-01 0
#> 276 Joe 2004-11-02 1
#> 277 Joe 2004-11-03 0
#> 278 Joe 2004-11-04 0
#> 279 Joe 2004-11-05 0
#> 280 Joe 2004-11-06 0
#> 281 Joe 2004-11-07 0
#> 282 Joe 2004-11-08 0
#> 283 Joe 2004-11-09 0

The second approach will in general result in many more rows of data. If you want to keep a maximum of 7 rows after any date with non-zero hits, you can do the following:

df.filled2 = df.filled2 %>% 
group_by(name) %>%
mutate(test=cumsum(hits > 0)) %>%
group_by(name, test) %>%
slice(1:8) %>%
ungroup %>%
select(-test)

Is there a way to fill in missing dates with 0s using dplyr?

In sparklyr, you must use Spark functions. This is a job for coalesce. First you have to fill out all the pairs of ids and dates you expect to see, so maybe something like:
(edit)

all_id <- old_data %>% distinct(id) %>% mutate(common=0)
all_date <- old_data %>% distinct(date) %>% mutate(common=0)
all_both <- all_id %>% full_join(all_date,by='common')
data <- old_data %>%
right_join(all_both %>% select(-common),by=c('id','date')) %>%
mutate(value=`coalesce(value,0)`)

I have assumed you have all the dates and ids you care about in your old data, though that might not be the case.

Insert rows for missing dates/times

I think the easiest thing ist to set Date first as already described, convert to zoo, and then just set a merge:

df$timestamp<-as.POSIXct(df$timestamp,format="%m/%d/%y %H:%M")

df1.zoo<-zoo(df[,-1],df[,1]) #set date to Index

df2 <- merge(df1.zoo,zoo(,seq(start(df1.zoo),end(df1.zoo),by="min")), all=TRUE)

Start and end are given from your df1 (original data) and you are setting by - e.g min - as you need for your example. all=TRUE sets all missing values at the missing dates to NAs.



Related Topics



Leave a reply



Submit