Combine Consecutive Date Ranges

Combine consecutive date ranges

The strange bit you see with my use of the date '31211231' is just a very large date to handle your "no-end-date" scenario. I have assumed you won't really have many date ranges per employee, so I've used a simple Recursive Common Table Expression to combine the ranges.

To make it run faster, the starting anchor query keeps only those dates that will not link up to a prior range (per employee). The rest is just tree-walking the date ranges and growing the range. The final GROUP BY keeps only the largest date range built up per starting ANCHOR (employmentid, startdate) combination.


SQL Fiddle

MS SQL Server 2008 Schema Setup:

create table Tbl (
employmentid int,
startdate datetime,
enddate datetime);

insert Tbl values
(5, '2007-12-03', '2011-08-26'),
(5, '2013-05-02', null),
(30, '2006-10-02', '2011-01-16'),
(30, '2011-01-17', '2012-08-12'),
(30, '2012-08-13', null),
(66, '2007-09-24', null);

/*
-- expected outcome
EmploymentId StartDate EndDate
5 2007-12-03 2011-08-26
5 2013-05-02 NULL
30 2006-10-02 NULL
66 2007-09-24 NULL
*/

Query 1:

;with cte as (
select a.employmentid, a.startdate, a.enddate
from Tbl a
left join Tbl b on a.employmentid=b.employmentid and a.startdate-1=b.enddate
where b.employmentid is null
union all
select a.employmentid, a.startdate, b.enddate
from cte a
join Tbl b on a.employmentid=b.employmentid and b.startdate-1=a.enddate
)
select employmentid,
startdate,
nullif(max(isnull(enddate,'32121231')),'32121231') enddate
from cte
group by employmentid, startdate
order by employmentid

Results:

| EMPLOYMENTID |                        STARTDATE |                       ENDDATE |
-----------------------------------------------------------------------------------
| 5 | December, 03 2007 00:00:00+0000 | August, 26 2011 00:00:00+0000 |
| 5 | May, 02 2013 00:00:00+0000 | (null) |
| 30 | October, 02 2006 00:00:00+0000 | (null) |
| 66 | September, 24 2007 00:00:00+0000 | (null) |

Merge consecutive and overlapping date ranges

  • within customer groupby("cust", as_index=False) look for overlapping dates in rows
  • cumsum() sums booleans to generate a group for overlapping dates
  • finally simple case on min/max within groups

df.groupby("cust", as_index=False).apply(
lambda d: d.sort_values(["start_date", "end_date"])
.groupby(
["cust", (~(d["start_date"] <= (d["end_date"].shift() + pd.Timedelta(days=1)))).cumsum()],
as_index=False
)
.agg({"start_date": "min", "end_date": "max"})
).reset_index(drop=True)






























custstart_dateend_date
0CUST1232021-01-01 00:00:002021-01-31 00:00:00
1CUST1232021-02-02 00:00:002021-02-28 00:00:00
2CUST4562021-01-05 00:00:002021-01-31 00:00:00

Merging records with consecutive dates in SQL

This is a CTE so it'll all need to be executed as one, but I'll explain as I go.

First I'll set the parameters for the date range we are interested in:

DECLARE @StartDate DateTime; SET @StartDate = '2011-11-01';  
DECLARE @EndDate DateTime; SET @EndDate = '2011-11-30';

Then I'll turn them into a list of dates using a recursive CTE

WITH 
ValidDates ( ValidDate ) AS
(
SELECT @StartDate
UNION ALL
SELECT DateAdd(day, 1, ValidDate)
FROM ValidDates
WHERE ValidDate < @EndDate
),

By joining that with ranges in the original records I get a list of individual days absence.

Using a combination of row_number and datediff I can group consecutive dates. This assumes that there are no duplicates.

    DaysAbsent AS 
(
SELECT
A.RecordID
, A.EmpID
, A.AbsCode
, DateDiff(Day, @StartDate, D.ValidDate)
- row_number()
over (partition by A.EmpID, A.AbsCode
order by D.ValidDate) AS DayGroup
, D.ValidDate AS AbsentDay
FROM
dbo.Absence A
INNER JOIN
ValidDates D
ON D.ValidDate >= DateFrom
and D.ValidDate <= DateTo
)

Now it's a simple select with min and max to turn it back into ranges.

SELECT 
EmpID
, AbsCode
, MIN(AbsentDay) AS DateFrom
, MAX(AbsentDay) AS DateTo
FROM
DaysAbsent
GROUP BY
EmpID
, AbsCode
, DayGroup

The DayGroup isn't needed in the output but is needed for the grouping, otherwise non consecutive groups will be collapsed into one.

How to merge consecutive date into one date?

This is a type of gaps-and-islands problem. You can use lag() and a cumulative sum to identify the groupings. Then aggregate:

select col1, col2, min(col3), max(col4)
from (select t.*,
sum(case when prev_col4 = col3 then 0 else 1 end) over
(partition by col1, col2
order by col3
rows between unbounded preceding and current row
) as grp
from (select t.*,
lag(col4) over (partition by col1, col2 order by col3) as prev_col4
from t
) t
) t
group by col1, col2, grp;

Joining together consecutive date validity intervals

This is a gaps-and-islands problem. There are various ways to approach it; this uses lead and lag analytic functions:

select distinct product,
case when start_date is null then lag(start_date)
over (partition by product order by rn) else start_date end as start_date,
case when end_date is null then lead(end_date)
over (partition by product order by rn) else end_date end as end_date
from (
select product, start_date, end_date, rn
from (
select t.product,
case when lag(end_date)
over (partition by product order by start_date) is null
or lag(end_date)
over (partition by product order by start_date) != start_date - 1
then start_date end as start_date,
case when lead(start_date)
over (partition by product order by start_date) is null
or lead(start_date)
over (partition by product order by start_date) != end_date + 1
then end_date end as end_date,
row_number() over (partition by product order by start_date) as rn
from t
)
where start_date is not null or end_date is not null
)
order by start_date, product;

PRODUCT START_DATE END_DATE
------- ---------- ---------
A 01-JUL-13 30-SEP-13
B 01-OCT-13 30-NOV-13
A 01-DEC-13 31-MAR-14

SQL Fiddle

The innermost query looks at the preceding and following records for the product, and only retains the start and/or end time if the records are not contiguous:

select t.product,
case when lag(end_date)
over (partition by product order by start_date) is null
or lag(end_date)
over (partition by product order by start_date) != start_date - 1
then start_date end as start_date,
case when lead(start_date)
over (partition by product order by start_date) is null
or lead(start_date)
over (partition by product order by start_date) != end_date + 1
then end_date end as end_date
from t;

PRODUCT START_DATE END_DATE
------- ---------- ---------
A 01-JUL-13
A
A 30-SEP-13
A 01-DEC-13
A
A
A 31-MAR-14
B 01-OCT-13
B 30-NOV-13

The next level of select removes those which are mid-period, where both dates were blanked by the inner query, which gives:

PRODUCT START_DATE END_DATE
------- ---------- ---------
A 01-JUL-13
A 30-SEP-13
A 01-DEC-13
A 31-MAR-14
B 01-OCT-13
B 30-NOV-13

The outer query then collapses those adjacent pairs; I've used the easy route of creating duplicates and then eliminating them with distinct, but you can do it other ways, like putting both values into one of the pairs of rows and leaving both values in the other null, and then eliminating those with another layer of select, but I think distinct is OK here.

If your real-world use case has times, not just dates, then you'll need to adjust the comparison in the inner query; rather than +/- 1, an interval of 1 second perhaps, or 1/86400 if you prefer, but depends on the precision of your values.

Find non consecutive date ranges

You can detect gaps with LAG() and mark them. Then, it's easy to filter out the rows. For example:

select *
from (
select *,
case when dateadd(day, -1, start_date) >
lag(end_date) over(partition by client_id order by start_date)
then 1 else 0 end as i
from t
) x
where i = 1

Or simpler...

select *
from (
select *,
lag(end_date) over(partition by client_id order by start_date) as prev_end
from t
) x
where dateadd(day, -1, start_date) > prev_end


Related Topics



Leave a reply



Submit