How to Fill The Gaps

How to fill the gaps?

You have 2 issues you're trying to resolve. The first issue is how to fill the gaps. The second issue is populating the Count field for those missing records.

Issue 1: This can be resolved by either using a Dates Lookup table or by creating a recursive common table expression. I would recommend creating a Dates Lookup table for this if that is an option. If you cannot create such a table, then you're going to need something like this.

WITH CTE AS (
  SELECT MAX(dt) maxdate, MIN(dt) mindate
  FROM yourtable
  ),
RecursiveCTE AS (
  SELECT mindate dtfield
  FROM CTE
  UNION ALL
  SELECT DATEADD(day, 1, dtfield)
  FROM RecursiveCTE R 
    JOIN CTE T
      ON R.dtfield < T.maxdate
  )

That should create you a list of dates starting with the MIN date in your table and ending in the MAX.

Issue 2: Here is where a correlated subquery would come in handy (as much as I generally stay away from them) to get the last cnt from your original table:

SELECT r.dtfield, 
   (SELECT TOP 1 cnt
    FROM yourtable 
    WHERE dt <= r.dtfield 
    ORDER BY dt DESC) cnt
FROM RecursiveCTE r

SQL Fiddle Demo

How to fill gaps associating same value to specific factor? R

A sample of your data would be of help. My best guess is that you can get what you want using the function fill from tidyr (also using functions in dplyr in the example):

library(tidyr)
library(dplyr)

df <- tibble(Sample_n=rep(1:3, each = 3), Station = rep(letters[1:3], each = 3),
             lat = c(NA, 50, NA, 40, NA, NA, NA, 55, NA),
             lon = c(NA, 150, NA, 140, NA, NA, NA, 155, NA))

df

# A tibble: 9 x 4
  Sample_n Station   lat   lon
     <int> <chr>   <dbl> <dbl>
1        1 a          NA    NA
2        1 a          50   150
3        1 a          NA    NA
4        2 b          40   140
5        2 b          NA    NA
6        2 b          NA    NA
7        3 c          NA    NA
8        3 c          55   155
9        3 c          NA    NA

df %>% group_by(Sample_n, Station) %>% 
  fill(lat, lon, .direction="updown")

# A tibble: 9 x 4
# Groups:   Sample_n, Station [3]
  Sample_n Station   lat   lon
     <int> <chr>   <dbl> <dbl>
1        1 a          50   150
2        1 a          50   150
3        1 a          50   150
4        2 b          40   140
5        2 b          40   140
6        2 b          40   140
7        3 c          55   155
8        3 c          55   155
9        3 c          55   155

How to fill data gaps only when extremities have the same value, and limited to a maximum of occurrences?

We can use boolean masking and cumsum to identify the blocks of NaN values that starts and ends with the same value, then group the column on these blocks and forward fill with limit of 4

s = df['col']
m = s.notna()
s.mask(s[m] != s[m].shift(-1)).groupby(m.cumsum()).ffill(limit=4).fillna(s)

0     NaN
1     NaN
2     NaN
3     5.0
4     5.0
5     5.0
6     5.0
7     5.0
8     6.0
9     NaN
10    NaN
11    NaN
12    NaN
13    NaN
14    NaN
15    5.0
16    5.0
17    NaN
18    NaN
19    6.0
20    6.0
21    NaN
22    NaN
23    NaN
24    NaN
25    5.0
26    5.0
27    5.0
28    5.0
29    5.0
30    NaN
31    NaN
32    NaN
33    5.0
34    NaN
35    NaN
Name: col, dtype: float64

Google BigQuery SQL: How to fill in gaps in a table with dates?

Consider below

with temp as (
  select customer, dates from (    
    select customer, min(dates) min_date, max(dates) max_date
    from `project.dataset.table`
    group by customer
  ), unnest(generate_date_array(min_date, max_date)) dates
)
select customer, dates, 
  first_value(subscription ignore nulls) over win as subscription
from temp a
left join `project.dataset.table` b
using(customer, dates)
window win as (partition by customer order by dates desc rows between current row and unbounded following)
# order by dates, customer

If to apply to sample data in y our question - output is

Sample Image

How to fill the gaps?

You have 2 issues you're trying to resolve. The first issue is how to fill the gaps. The second issue is populating the Count field for those missing records.

WITH CTE AS (
  SELECT MAX(dt) maxdate, MIN(dt) mindate
  FROM yourtable
  ),
RecursiveCTE AS (
  SELECT mindate dtfield
  FROM CTE
  UNION ALL
  SELECT DATEADD(day, 1, dtfield)
  FROM RecursiveCTE R 
    JOIN CTE T
      ON R.dtfield < T.maxdate
  )

That should create you a list of dates starting with the MIN date in your table and ending in the MAX.

Issue 2: Here is where a correlated subquery would come in handy (as much as I generally stay away from them) to get the last cnt from your original table:

SELECT r.dtfield, 
   (SELECT TOP 1 cnt
    FROM yourtable 
    WHERE dt <= r.dtfield 
    ORDER BY dt DESC) cnt
FROM RecursiveCTE r

SQL Fiddle Demo

Is there a data.table way of filling in gaps of time periods?

Just data.table:

If no key is set, then

data2 <- data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)]
data2
#          Date  Card           A
#        <char> <num>       <num>
# 1: 2020-01-01     1  1.37095845
# 2: 2020-01-01     2 -0.56469817
# 3: 2020-01-01     3  0.36312841
# 4: 2020-02-01     1  0.63286260
# 5: 2020-02-01     2          NA
# 6: 2020-02-01     3  0.40426832
# 7: 2020-03-01     1 -0.10612452
# 8: 2020-03-01     2  1.51152200
# 9: 2020-03-01     3 -0.09465904

(updated/simplified, thanks to @sindri_baldur!)

If a key is set, then you can use @Frank's method:

data2 <- data[ do.call(CJ, c(mget(key(data)), unique = TRUE)), ]

And from here, you can use nafill as desired, perhaps

data2[, A := nafill(A, type = "locf"), by = .(Card)]
#          Date  Card           A
#        <char> <num>       <num>
# 1: 2020-01-01     1  1.37095845
# 2: 2020-01-01     2 -0.56469817
# 3: 2020-01-01     3  0.36312841
# 4: 2020-02-01     1  0.63286260
# 5: 2020-02-01     2 -0.56469817
# 6: 2020-02-01     3  0.40426832
# 7: 2020-03-01     1 -0.10612452
# 8: 2020-03-01     2  1.51152200
# 9: 2020-03-01     3 -0.09465904

(How to fill is based on your knowledge of the context of the data; it might just as easily be by=.(Date), or some form of imputation.)

Update: the above does an expansion of possible combinations, which might fill outside of a particular Card's span, in which case one might see:

data <- data[-1,]
data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)]
#          Date  Card           A
#        <char> <num>       <num>
# 1: 2020-01-01     1          NA
# 2: 2020-01-01     2 -0.42225588
# 3: 2020-01-01     3 -0.12235017
# 4: 2020-02-01     1  0.18819303
# 5: 2020-02-01     2          NA
# 6: 2020-02-01     3  0.11916096
# 7: 2020-03-01     1 -0.02509255
# 8: 2020-03-01     2  0.10807273
# 9: 2020-03-01     3 -0.48543524

I think there are two approaches to this:

Doing the above code and then removing leading (and trailing) NAs per group:

data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)
  ][, .SD[ !is.na(A) | !seq_len(.N) %in% c(1, .N),], by = Card]
#     Card       Date           A
#    <num>     <char>       <num>
# 1:     1 2020-02-01  0.18819303
# 2:     1 2020-03-01 -0.02509255
# 3:     2 2020-01-01 -0.42225588
# 4:     2 2020-02-01          NA
# 5:     2 2020-03-01  0.10807273
# 6:     3 2020-01-01 -0.12235017
# 7:     3 2020-02-01  0.11916096
# 8:     3 2020-03-01 -0.48543524

Completely different approach (assuming Date-class, not strictly required above):

data[,Date := as.Date(Date)]
data[data[, .(Date = do.call(seq, c(as.list(range(Date)), by = "month"))), 
          by = .(Card)],
     on = .(Date, Card)]
#          Date  Card           A
#        <Date> <num>       <num>
# 1: 2020-01-01     2 -0.42225588
# 2: 2020-02-01     2          NA
# 3: 2020-03-01     2  0.10807273
# 4: 2020-01-01     3 -0.12235017
# 5: 2020-02-01     3  0.11916096
# 6: 2020-03-01     3 -0.48543524
# 7: 2020-02-01     1  0.18819303
# 8: 2020-03-01     1 -0.02509255

Google sheets: How to add a helper row to fill gaps in a list of customer ID row

use:

=ARRAYFORMULA(IF(B2:B="",, VLOOKUP(ROW(A2:A), IF(A2:A<>"", {ROW(A2:A), A2:A}), 2, 1)))

Sample Image

update:

=ARRAYFORMULA(IF(B2:B="";; VLOOKUP(ROW(A2:A); IF(A2:A<>""; {ROW(A2:A)\ A2:A}); 2; 1)))

or:

=SCAN(; A2:INDEX(A:A; MAX(ROW(B:B)*(B:B<>""))); LAMBDA(a; b; IF(b=""; a; b)))

see: stackoverflow.com/questions/73767719

Sample Image

How to Fill The Gaps

How to fill the gaps?

How to fill gaps associating same value to specific factor? R

How to fill data gaps only when extremities have the same value, and limited to a maximum of occurrences?

Google BigQuery SQL: How to fill in gaps in a table with dates?

How to fill the gaps?

Is there a data.table way of filling in gaps of time periods?

Google sheets: How to add a helper row to fill gaps in a list of customer ID row

update:

Related Topics

Leave a reply