Fill in the Date Gaps with Date Table

Fill In The Date Gaps With Date Table

The problem is that you need all customers for all dates. When you do the left outer join, you are getting NULL for the customer field.

The following sets up a driver table by cross joining the customer names and dates:

SELECT driver.customer, driver.fulldate, o.amount 
FROM (select d.fulldate, customer
from datetable d cross join
(select customer
from orders
where year(orderdate) in (2012)
) o
where d.calendaryear IN ( 2012 )
) driver LEFT OUTER JOIN
orders o
ON driver.fulldate = o.orderdate and
driver.customer = o.customer;

Note that this version assumes that calendaryear is the same as year(orderdate).

Google BigQuery SQL: How to fill in gaps in a table with dates?

Consider below

with temp as (
select customer, dates from (
select customer, min(dates) min_date, max(dates) max_date
from `project.dataset.table`
group by customer
), unnest(generate_date_array(min_date, max_date)) dates
)
select customer, dates,
first_value(subscription ignore nulls) over win as subscription
from temp a
left join `project.dataset.table` b
using(customer, dates)
window win as (partition by customer order by dates desc rows between current row and unbounded following)
# order by dates, customer

If to apply to sample data in y our question - output is

Sample Image

How to fill in missing dates

Here is a query that would work. Start by cross joining all combinations of dates and users (add filters as needed), then left join the users table and calculate quota using the last_value() function (note that if you are using Snowflake, you must specify "rows between unbounded preceding and current row" as documented here):

with all_dates_users as (
--all combinations of dates and users
select date, user
from dates
cross join (select distinct user_email as user from users)
),
joined as (
--left join users table to the previous
select DU.date, DU.user, U.sent_at, U.user_email, U.score, U.quota
from all_dates_users DU
left join users U on U.sent_at = DU.date and U.user_email = DU.user
)
--calculate quota as previous quota using last_value() function
select date, user, nvl(score, 0) as score, last_value(quota) ignore nulls over (partition by user order by date desc rows between unbounded preceding and current row) as quota
from joined
order by date desc;

Fill missing gaps in data using a date column

Because SQL Server does not support IGNORE NULLS in LAG() this is a bit tricky. I would go for a recursive subquery of the form:

with cte as (
select price, date, dateadd(day, -1, lead(date) over (order by date)) as last_date
from t
union all
select price, dateadd(day, 1, date), last_date
from cte
where date < last_date
)
select price, date
from cte
order by date;

Here is a db<>fiddle.

In SQL Server 2008, you can replace the lead() with:

with cte as (
select price, date,
(select min(date)
from t t2
where t2.date > t.date
) as last_date
from t
union all
select price, dateadd(day, 1, date), last_date
from cte
where date < last_date
)
select price, date
from cte
order by date;

Fill gaps in SQL Server dates ranges

I don't think this is that complicated. If you expand the periods into individual dates and do a left join, then this becomes a gaps-and-islands problem:

with dates as (
select periodid, periodstart as dte, periodend
from periods
union all
select periodid, dateadd(day, 1, dte), periodend
from dates
where dte < periodend
)
select userid, activityid, min(dte), max(dte)
from (select d.dte, d.periodid, u.userid, a.activityid,
row_number() over (partition by u.userid, a.activityid order by d.dte) as seqnum
from dates d cross join
(select distinct userid from activities) u left join
activities a
on a.userid = u.userid and
a.activitystart <= d.dte and a.activityend >= d.dte
) da
group by userid, activityid, periodid, dateadd(day, -seqnum, dte)
order by userid, min(dte);

Here is a db<>fiddle.

Note: This produces results for all users and all periods -- which seems reasonable given your description. It is pretty simple to modify to filter out users with no activity during a given period.

Also, this does not go to the end of the month. Instead, it includes the complete periods. I don't see why months would play into this -- except to confuse matters -- consider if two periods have days in the same month, for instance.

View to fill date gaps but only if no records for a given date

You can use a table-valued function, which will allow you to use parameters, so there is no need to hard-code them.

The table returned includes an extra column Ins; it is 0 for the original data and 1 for the inserted.

Rextester demo

If you prefer a view you can easily extract the logic.

create function myFun(@start date, @end date) 
returns @result table (MyName varchar(10), MyDate date, Ins bit) as
begin
;
with dates as (
SELECT DATEADD(DAY, nbr - 1, @start) dt
FROM ( SELECT ROW_NUMBER() OVER ( ORDER BY c.object_id ) AS Nbr
FROM sys.columns c
) nbrs
WHERE nbr - 1 <= DATEDIFF(DAY, @start, @end)
), last_avail as (
select dates.dt, max(t.MyDate) prev_dt
from dates join t on dates.dt >= t.MyDate
group by dates.dt
), empty as (
select * from last_avail where dt <> prev_dt
)
insert @result (MyName, MyDate, Ins)
select MyName, MyDate, 0 from t
union all
select t.MyName, x.dt, 1
from empty x join t on x.prev_dt = t.MyDate;

return;
end

Usage example:

declare @start date, @end date;

set @start = '2017-04-08';
set @end = '2017-04-11';
select * from dbo.myFun(@start, @end) order by MyDate desc, MyName asc;

Attribution

The Common Table Expression which plays the role of a calendar table has been found here.

Fill all gaps between starting and ending dates with dplyr

library(dplyr)
user_df %>%
arrange(start_date) %>%
group_by(person_id) %>%
mutate(nextstart = lead(start_date)) %>%
filter(end_date < nextstart) %>%
mutate(start_date = end_date, end_date = nextstart, unemployed = 1L) %>%
select(-nextstart) %>%
bind_rows(mutate(user_df, unemployed = 0L)) %>%
arrange(person_id, start_date) %>%
ungroup()
# # A tibble: 14 x 6
# person_id job_number job_type start_date end_date unemployed
# <int> <int> <chr> <chr> <chr> <int>
# 1 1 1 B 2012-11-01 2014-01-01 0
# 2 1 1 B 2014-01-01 2016-02-01 1
# 3 1 2 A 2016-02-01 2016-10-01 0
# 4 1 2 A 2016-10-01 2016-12-01 1
# 5 1 3 A 2016-12-01 2020-01-01 0
# 6 1 4 B 2020-01-01 2021-01-01 0
# 7 2 1 A 2011-03-01 2012-08-01 0
# 8 2 1 A 2012-08-01 2013-01-01 1
# 9 2 2 B 2013-01-01 2020-01-01 0
# 10 2 3 A 2020-01-01 2021-01-01 0
# 11 2 4 B 2021-01-01 2021-01-17 0
# 12 3 1 A 2005-03-01 2011-03-01 0
# 13 3 1 A 2011-03-01 2012-01-01 1
# 14 3 2 B 2012-01-01 2014-01-01 0

Technically, this is comparing by the alphabetic sort of dates; in this case, its effect is the same (the format is good for that) though it'll be slightly less efficient (integer/numeric sorting is faster than alphabetic sorting).

This works by first creating and then capture just the unemployed periods of time,

user_df %>%
arrange(start_date) %>%
group_by(person_id) %>%
mutate(nextstart = lead(start_date)) %>%
filter(end_date < nextstart)
# # A tibble: 4 x 6
# # Groups: person_id [3]
# person_id job_number job_type start_date end_date nextstart
# <int> <int> <chr> <chr> <chr> <chr>
# 1 3 1 A 2005-03-01 2011-03-01 2012-01-01
# 2 2 1 A 2011-03-01 2012-08-01 2013-01-01
# 3 1 1 B 2012-11-01 2014-01-01 2016-02-01
# 4 1 2 A 2016-02-01 2016-10-01 2016-12-01

then shifting the variables, then adding unemployed, and then finally returning it to the original dataset. In this case, I added unemployed to the original mid-bind_rows; where to do this is mostly preference.



Related Topics



Leave a reply



Submit