Efficient Way to Fill Time-Series Per Group

Efficient way to Fill Time-Series per group

It appears that data.table is really much faster than the tidyverse option. So merely translating the above into data.table(compliments of @Frank) completed the operation in little under 3 minutes.

library(data.table)

mDT = setDT(d1)[, .(grp = seq(min(grp), max(grp), by = "hour")), by = source]
new_D <- d1[mDT, on = names(mDT)]

new_D <- new_D[, cnt := replace(cnt, is.na(cnt), 0)] #If needed

Fill in missing rows for dates by group

We can do a join. Create the sequence of 'time' from min to max by '1 month', expand the dataset grouped by 'customer_id', 'account_id' and join on with those columns and the 'time'

ts1 <- seq(min(my_data$time), max(my_data$time), by = "1 month")
my_data[my_data[, .(time =ts1 ), .(customer_id, account_id)], 
             on = .(customer_id, account_id, time)]
#    customer_id account_id       time tenor variable_x
# 1:           1         11 2017-01-01     1         87
# 2:           1         11 2017-02-01    NA         NA
# 3:           1         11 2017-03-01    NA         NA
# 4:           1         11 2017-04-01    NA         NA
# 5:           1         11 2017-05-01     2         90
# 6:           1         11 2017-06-01     3        100
# 7:           2         55 2017-01-01    NA         NA
# 8:           2         55 2017-02-01     1        120
# 9:           2         55 2017-03-01    NA         NA
#10:           2         55 2017-04-01     2        130
#11:           2         55 2017-05-01     3        150
#12:           2         55 2017-06-01     4         12
#13:           3         38 2017-01-01     1         13
#14:           3         38 2017-02-01    NA         NA
#15:           3         38 2017-03-01    NA         NA
#16:           3         38 2017-04-01     2         15
#17:           3         38 2017-05-01     3         14
#18:           3         38 2017-06-01    NA         NA

Or using tidyverse

library(tidyverse)
distinct(my_data, customer_id, account_id) %>%
      mutate(time = list(ts1)) %>% 
      unnest %>% 
      left_join(my_data)

Or with complete from tidyr

my_data %>% 
     complete(nesting(customer_id, account_id), time = ts1)

Filling missing dates in a grouped time series - a tidyverse-way?

tidyr has some great tools for these sorts of problems. Take a look at complete.

library(dplyr)
library(tidyr)
library(lubridate)

want <- df.missing %>% 
  ungroup() %>%
  complete(nesting(d1, d2), date = seq(min(date), max(date), by = "day"))

want %>% filter(d1 == "A" & d2 == 5) 

#> # A tibble: 10 x 5
#>        d1    d2       date         v1        v2
#>    <fctr> <dbl>     <date>      <dbl>     <dbl>
#>  1      A     5 2017-01-01         NA        NA
#>  2      A     5 2017-01-02 0.21879954 0.1335497
#>  3      A     5 2017-01-03 0.32977018 0.9802127
#>  4      A     5 2017-01-04 0.23902573 0.1206089
#>  5      A     5 2017-01-05 0.19617465 0.7378315
#>  6      A     5 2017-01-06 0.13373890 0.9493668
#>  7      A     5 2017-01-07 0.48613541 0.3392834
#>  8      A     5 2017-01-08 0.35698708 0.3696965
#>  9      A     5 2017-01-09 0.08498474 0.8354756
#> 10      A     5 2017-01-10         NA        NA

Fill gaps in time series pandas dataframe in specific time intervall

I case anyone is interested I think I found a solution:

   # group the time sires by dates (using the FillDate Column) and than apply 
   # the "Reindex_by_Date" Function to generate the index for each date in the
   # given time frame and fill missing tim stamps 
   df_process = df_process.groupby('FillDate').apply(reindex_by_date, intervall=update_interval)

   #drop the helper index "FillDate"
   df_process = df_process.reset_index('FillDate', drop=True).drop('FillDate',1)

   # since we reindexed by each date only it can happen that if there is a value missing
   # on the boarder of the index (e.g. last or fist entry) it might have NaN as value
   # we fix this here (we forward fill for example taking the last value from the previous day)
   df_process_out = df_process.fillna(method='ffill')

   # Helper Function for Closing data gaps with Pandas Groupby and resample
   def reindex_by_date(df, intervall):
        start_range = df.index.date.min().strftime('%Y-%m-%d') +" 09:30:00"
        end_range = df.index.date.max().strftime('%Y-%m-%d') +" 16:00:00"
        dates = pd.date_range(start_range, end_range, freq=intervall)
        return df.reindex(dates).ffill()here

Comments are very welcome or if someone has a more efficient solution I would be very interested.
Thanks

Filling missing dates by group

You can use data.table from R. Assuming that 'time' column is of 'Date' class,

library(data.table)#v1.9.5+
DT1 <- setDT(df1)[, list(time=seq(min(time), max(time), by ='day')),
                    by =.(shop_id, item_id)]
setkeyv(df1, names(df1)[1:3])[DT1][is.na(value), value:=0]
#   shop_id item_id       time value
#1:     150       1 2015-07-10     3
#2:     150       1 2015-07-11     5
#3:     150       1 2015-07-12     0
#4:     150       1 2015-07-13     2
#5:     150       2 2015-07-10    15
#6:     150       2 2015-07-11     0
#7:     150       2 2015-07-12    12

In the devel version, you can also do this without setting the 'key'. Instructions to install the devel version are here

 df1[DT1, on =c('shop_id', 'item_id', 'time')][is.na(value), value:=0]
 #   shop_id item_id       time value
 #1:     150       1 2015-07-10     3
 #2:     150       1 2015-07-11     5
 #3:     150       1 2015-07-12     0
 #4:     150       1 2015-07-13     2
 #5:     150       2 2015-07-10    15
 #6:     150       2 2015-07-11     0
 #7:     150       2 2015-07-12    12

Or as @Arun suggested, a more efficient option would be

 DT1[, value := 0L][df1, value := i.value, on = c('shop_id', 'item_id', 'time')]
 DT1

Efficiently fill NAs by group

This is the code I have used: Your code vs akrun vs mine. Sometimes zoo is not the fastest process but it is the cleanest. Anyway, you can test it.

UPDATE:
It has been tested with more data (100.000) and Process 03 (subset and merge) wins by far.

Last UPDATE
Function comparison with rbenchmark:

library(dplyr)
library(tidyr)
library(base)
library(data.table)
library(zoo)
library(rbenchmark)

#data.frame of 100 individuals with 10 observations each
data <- data.frame(group = rep(1:10000,each=10),value = NA)
data$value[seq(5,5000,10)] <- rnorm(50) #first 50 individuals get a value at the fifth observation, others don't have value

#Process01
P01 <- function (data){
    data01 <- data %>% 
        group_by(group) %>% #by group
            fill(value) %>% #default direction down
            fill(value, .direction = "up") #also fill NAs upwards
    return(data01)
}

#Process02
P02 <- function (data){
    data02 <- setDT(data)[, value := na.locf(na.locf(value, na.rm = FALSE), 
                                             fromLast = TRUE), group]
    return(data02)
}

#Process03
P03 <- function (data){
    dataU <- subset(unique(data), value!='NA') #keep row number
    dataM <- merge(data, dataU, by = "group", all=T) #merge tables
    data03 <- data.frame(group=dataM$group, value = dataM$value.y) #idem shape of data
    return(data03)
}

benchmark("P01_dplyr" = {data01 <- P01(data)},
          "P02_zoo" = {data02 <- P02(data)},
          "P03_data.table" = {data03 <- P03(data)},
          replications = 10,
          columns = c("test", "replications", "elapsed")
          )

Results with data=10.000, 10 reps and I5 7400:

    test replications elapsed
1      P01_dplyr           10  257.78
2        P02_zoo           10   10.35
3 P03_data.table           10    0.09

How can I fill in missing rows for monthly time series data?

You can use tidyr::complete for this, but you have the additional wrinkle that you have dates on different days in each month. First then you need to make a column to count months on, which we can do with the day(x) <- setter from lubridate.

Here's an example using the provided data truncated to 2014 for conciseness. Note that you should use seq.Date to specify the full range of dates that you want to be included in the month column, and you also will have NAs in the date column. (you can replace with the first of the month if you want)

library(tidyverse)
library(lubridate)

tbl <- structure(list(date = structure(c(16076, 16129, 16143, 16184, 16276, 16303, 16343, 16374, 16400, 16417), tzone = "UTC", tclass = "Date", class = "Date"), AverageTemp = c(18.3387983091787, 18.41875, 19.7, 20.2508333333333, 17.6387426900585, 18.1770897832817, 17.5400297619048, 17.7547246376812, 17.246412037037, 17.0846153846154)), row.names = c(NA, -10L), class = "data.frame")

tbl %>%
  mutate(month = date %>% `day<-`(1)) %>%
  complete(month = seq.Date(min(month), max(month), by = "month"))
#> # A tibble: 12 x 3
#>    month      date       AverageTemp
#>    <date>     <date>           <dbl>
#>  1 2014-01-01 2014-01-06        18.3
#>  2 2014-02-01 2014-02-28        18.4
#>  3 2014-03-01 2014-03-14        19.7
#>  4 2014-04-01 2014-04-24        20.3
#>  5 2014-05-01 NA                NA  
#>  6 2014-06-01 NA                NA  
#>  7 2014-07-01 2014-07-25        17.6
#>  8 2014-08-01 2014-08-21        18.2
#>  9 2014-09-01 2014-09-30        17.5
#> 10 2014-10-01 2014-10-31        17.8
#> 11 2014-11-01 2014-11-26        17.2
#> 12 2014-12-01 2014-12-13        17.1

As an alternative, you can instead just get the year and month components and use complete on the combination of the two:

tbl %>%
  mutate(year = year(date), month = month(date)) %>%
  complete(year = min(year):max(year), month = 1:12)
#> # A tibble: 12 x 4
#>     year month date       AverageTemp
#>    <dbl> <dbl> <date>           <dbl>
#>  1  2014     1 2014-01-06        18.3
#>  2  2014     2 2014-02-28        18.4
#>  3  2014     3 2014-03-14        19.7
#>  4  2014     4 2014-04-24        20.3
#>  5  2014     5 NA                NA  
#>  6  2014     6 NA                NA  
#>  7  2014     7 2014-07-25        17.6
#>  8  2014     8 2014-08-21        18.2
#>  9  2014     9 2014-09-30        17.5
#> 10  2014    10 2014-10-31        17.8
#> 11  2014    11 2014-11-26        17.2
#> 12  2014    12 2014-12-13        17.1

^{Created on 2019-03-20 by the reprex package (v0.2.1)}

Pandas filling missing dates and values within group

Initial Dataframe:

            dt  user    val
0   2016-01-01     a      1
1   2016-01-02     a     33
2   2016-01-05     b      2
3   2016-01-06     b      1

First, convert the dates to datetime:

x['dt'] = pd.to_datetime(x['dt'])

Then, generate the dates and unique users:

dates = x.set_index('dt').resample('D').asfreq().index

>> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', name='dt', freq='D')

users = x['user'].unique()

>> array(['a', 'b'], dtype=object)

This will allow you to create a MultiIndex:

idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])

>> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
           names=['dt', 'user'])

You can use that to reindex your DataFrame:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index()
Out: 
           dt user  val
0  2016-01-01    a    1
1  2016-01-01    b    0
2  2016-01-02    a   33
3  2016-01-02    b    0
4  2016-01-03    a    0
5  2016-01-03    b    0
6  2016-01-04    a    0
7  2016-01-04    b    0
8  2016-01-05    a    0
9  2016-01-05    b    2
10 2016-01-06    a    0
11 2016-01-06    b    1

which then can be sorted by users:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user')
Out: 
           dt user  val
0  2016-01-01    a    1
2  2016-01-02    a   33
4  2016-01-03    a    0
6  2016-01-04    a    0
8  2016-01-05    a    0
10 2016-01-06    a    0
1  2016-01-01    b    0
3  2016-01-02    b    0
5  2016-01-03    b    0
7  2016-01-04    b    0
9  2016-01-05    b    2
11 2016-01-06    b    1

Efficient Way to Fill Time-Series Per Group