Combine Consecutive Date Ranges

Combine consecutive date ranges

The strange bit you see with my use of the date '31211231' is just a very large date to handle your "no-end-date" scenario. I have assumed you won't really have many date ranges per employee, so I've used a simple Recursive Common Table Expression to combine the ranges.

To make it run faster, the starting anchor query keeps only those dates that will not link up to a prior range (per employee). The rest is just tree-walking the date ranges and growing the range. The final GROUP BY keeps only the largest date range built up per starting ANCHOR (employmentid, startdate) combination.

SQL Fiddle

MS SQL Server 2008 Schema Setup:

create table Tbl (
  employmentid int,
  startdate datetime,
  enddate datetime);

insert Tbl values
(5, '2007-12-03', '2011-08-26'),
(5, '2013-05-02', null),
(30, '2006-10-02', '2011-01-16'),
(30, '2011-01-17', '2012-08-12'),
(30, '2012-08-13', null),
(66, '2007-09-24', null);

/*
-- expected outcome
EmploymentId StartDate   EndDate
5            2007-12-03  2011-08-26
5            2013-05-02  NULL
30           2006-10-02  NULL
66           2007-09-24  NULL
*/

Query 1:

;with cte as (
   select a.employmentid, a.startdate, a.enddate
     from Tbl a
left join Tbl b on a.employmentid=b.employmentid and a.startdate-1=b.enddate
    where b.employmentid is null
    union all
   select a.employmentid, a.startdate, b.enddate
     from cte a
     join Tbl b on a.employmentid=b.employmentid and b.startdate-1=a.enddate
)
   select employmentid,
          startdate,
          nullif(max(isnull(enddate,'32121231')),'32121231') enddate
     from cte
 group by employmentid, startdate
 order by employmentid

Results:

| EMPLOYMENTID |                        STARTDATE |                       ENDDATE |
-----------------------------------------------------------------------------------
|            5 |  December, 03 2007 00:00:00+0000 | August, 26 2011 00:00:00+0000 |
|            5 |       May, 02 2013 00:00:00+0000 |                        (null) |
|           30 |   October, 02 2006 00:00:00+0000 |                        (null) |
|           66 | September, 24 2007 00:00:00+0000 |                        (null) |

Merge consecutive and overlapping date ranges

within customer groupby("cust", as_index=False) look for overlapping dates in rows
cumsum() sums booleans to generate a group for overlapping dates
finally simple case on min/max within groups


df.groupby("cust", as_index=False).apply(
    lambda d: d.sort_values(["start_date", "end_date"])
    .groupby(
        ["cust", (~(d["start_date"] <= (d["end_date"].shift() + pd.Timedelta(days=1)))).cumsum()],
        as_index=False
    )
    .agg({"start_date": "min", "end_date": "max"})
).reset_index(drop=True)

	cust	start_date	end_date
0	CUST123	2021-01-01 00:00:00	2021-01-31 00:00:00
1	CUST123	2021-02-02 00:00:00	2021-02-28 00:00:00
2	CUST456	2021-01-05 00:00:00	2021-01-31 00:00:00

Merging records with consecutive dates in SQL

This is a CTE so it'll all need to be executed as one, but I'll explain as I go.

First I'll set the parameters for the date range we are interested in:

DECLARE @StartDate DateTime; SET @StartDate = '2011-11-01';  
DECLARE @EndDate DateTime; SET @EndDate = '2011-11-30';

Then I'll turn them into a list of dates using a recursive CTE

WITH 
    ValidDates ( ValidDate ) AS 
        (
            SELECT @StartDate 
                UNION ALL
            SELECT DateAdd(day, 1, ValidDate) 
                FROM ValidDates 
                WHERE ValidDate < @EndDate
        ),

By joining that with ranges in the original records I get a list of individual days absence.

Using a combination of row_number and datediff I can group consecutive dates. This assumes that there are no duplicates.

    DaysAbsent AS 
        (
            SELECT 
                  A.RecordID
                , A.EmpID
                , A.AbsCode
                , DateDiff(Day, @StartDate, D.ValidDate) 
                    - row_number() 
                        over (partition by A.EmpID, A.AbsCode  
                            order by D.ValidDate) AS DayGroup
                , D.ValidDate AS AbsentDay
            FROM 
                dbo.Absence A
                    INNER JOIN  
                ValidDates D
                    ON D.ValidDate >= DateFrom 
                       and  D.ValidDate <= DateTo 
        )

Now it's a simple select with min and max to turn it back into ranges.

SELECT 
      EmpID
    , AbsCode
    , MIN(AbsentDay) AS DateFrom
    , MAX(AbsentDay) AS DateTo
FROM
    DaysAbsent
GROUP BY
      EmpID
    , AbsCode
    , DayGroup

The DayGroup isn't needed in the output but is needed for the grouping, otherwise non consecutive groups will be collapsed into one.

How to merge consecutive date into one date?

This is a type of gaps-and-islands problem. You can use lag() and a cumulative sum to identify the groupings. Then aggregate:

select col1, col2, min(col3), max(col4)
from (select t.*,
             sum(case when prev_col4 = col3 then 0 else 1 end) over
                 (partition by col1, col2
                  order by col3
                  rows between unbounded preceding and current row
                 ) as grp
      from (select t.*,
                   lag(col4) over (partition by col1, col2 order by col3) as prev_col4
            from t
            ) t
     ) t
group by col1, col2, grp;

Joining together consecutive date validity intervals

This is a gaps-and-islands problem. There are various ways to approach it; this uses lead and lag analytic functions:

select distinct product,
  case when start_date is null then lag(start_date)
    over (partition by product order by rn) else start_date end as start_date,
  case when end_date is null then lead(end_date)
    over (partition by product order by rn) else end_date end as end_date
from (
  select product, start_date, end_date, rn
  from (
    select t.product,
      case when lag(end_date)
          over (partition by product order by start_date) is null
        or lag(end_date)
          over (partition by product order by start_date) != start_date - 1
        then start_date end as start_date,
      case when lead(start_date)
          over (partition by product order by start_date) is null
        or lead(start_date)
          over (partition by product order by start_date) != end_date + 1
        then end_date end as end_date,
      row_number() over (partition by product order by start_date) as rn
    from t
  )
  where start_date is not null or end_date is not null
)
order by start_date, product;

PRODUCT START_DATE END_DATE
------- ---------- ---------
A       01-JUL-13  30-SEP-13 
B       01-OCT-13  30-NOV-13 
A       01-DEC-13  31-MAR-14

SQL Fiddle

The innermost query looks at the preceding and following records for the product, and only retains the start and/or end time if the records are not contiguous:

select t.product,
  case when lag(end_date)
      over (partition by product order by start_date) is null
    or lag(end_date)
      over (partition by product order by start_date) != start_date - 1
    then start_date end as start_date,
  case when lead(start_date)
      over (partition by product order by start_date) is null
    or lead(start_date)
      over (partition by product order by start_date) != end_date + 1
    then end_date end as end_date
from t;

PRODUCT START_DATE END_DATE
------- ---------- ---------
A       01-JUL-13            
A                            
A                  30-SEP-13 
A       01-DEC-13            
A                            
A                            
A                  31-MAR-14 
B       01-OCT-13            
B                  30-NOV-13

The next level of select removes those which are mid-period, where both dates were blanked by the inner query, which gives:

PRODUCT START_DATE END_DATE
------- ---------- ---------
A       01-JUL-13            
A                  30-SEP-13 
A       01-DEC-13            
A                  31-MAR-14 
B       01-OCT-13            
B                  30-NOV-13

The outer query then collapses those adjacent pairs; I've used the easy route of creating duplicates and then eliminating them with distinct, but you can do it other ways, like putting both values into one of the pairs of rows and leaving both values in the other null, and then eliminating those with another layer of select, but I think distinct is OK here.

If your real-world use case has times, not just dates, then you'll need to adjust the comparison in the inner query; rather than +/- 1, an interval of 1 second perhaps, or 1/86400 if you prefer, but depends on the precision of your values.

Find non consecutive date ranges

You can detect gaps with LAG() and mark them. Then, it's easy to filter out the rows. For example:

select *
from (
  select *,
    case when dateadd(day, -1, start_date) >
       lag(end_date) over(partition by client_id order by start_date) 
    then 1 else 0 end as i
  from t
) x
where i = 1

Or simpler...

select *
from (
  select *,
    lag(end_date) over(partition by client_id order by start_date) as prev_end
  from t
) x
where dateadd(day, -1, start_date) > prev_end

Combine Consecutive Date Ranges