SQL Query to Collapse Duplicate Values by Date Range

SQL Query to Collapse Duplicate Values By Date Range

I got it to work as follows. It is heavy on analytic functions and is Oracle specific.

select distinct id, value,
decode(startMonth, null,
  lag(startMonth) over(partition by id, value order by startMonth, endMonth),  --if start is null, it's an end so take from the row before
startMonth) startMonth,

  decode(endMonth, null,
  lead(endMonth) over(partition by id, value order by startMonth, endMonth),  --if end is null, it's an start so take from the row after
endMonth) endMonth    

from (
select id, value, startMonth, endMonth from(
select id, value, 
decode(month+1, lead(month) over(partition by id,value order by month), null, month)     
startMonth, --get the beginning month for each interval
decode(month-1, lag(month) over(partition by id,value order by month), null, month)     
endMonth --get the end month for each interval from Tbl
) a 
where startMonth is not null or endMonth is not null --remain with start and ends only
)b

It might be possible to simplify some of the inner queries somewhat

The inner query checks if the month is a first/last month of the interval as follows: if the month + 1 == the next month (lag) for that grouping, then since there is a next month, this month is obviously not the end month. Otherwise, it is the last month of the interval. The same concept is used to check for the first month.

The outer query first filters out all rows that are not either start or end months (where startMonth is not null or endMonth is not null).
Then, each row is either a start month or an end month (or both), determined by whether start or end is not null). If the month is a start month, get the corresponding end month by getting the next (lead) endMonth for that id,value ordered by endMonth, and if it is an endMonth get the startMonth by looking for the previous startMonth (lag)

Complex SQL query - collapsing date ranges

Here is a standard way to do collapsing time ranges in SQL using analytic functions with an example.

Your table:

SQL> create table mytable (startdate,enddate)
  2  as
  3  select date '2000-01-01', date '2001-02-01' from dual union all
  4  select date '2001-02-02', date '2001-04-01' from dual union all
  5  select date '2004-04-01', date '2006-06-02' from dual union all
  6  select date '2006-06-03', date '2010-09-01' from dual union all
  7  select date '2011-08-01', date '2012-09-01' from dual
  8  /

Table created.

The query:

SQL> select min(startdate) startdate
  2       , max(enddate)   enddate
  3    from ( select startdate
  4                , enddate
  5                , max(rn) over (order by startdate) maxrn
  6             from ( select startdate
  7                         , enddate
  8                         , case lag(enddate) over (order by startdate)
  9                           when startdate-1 then
 10                             null
 11                           else
 12                             rownum
 13                           end rn
 14                      from mytable
 15                  )
 16         )
 17   group by maxrn
 18   order by startdate
 19  /

STARTDATE           ENDDATE
------------------- -------------------
01-01-2000 00:00:00 01-04-2001 00:00:00
01-04-2004 00:00:00 01-09-2010 00:00:00
01-08-2011 00:00:00 01-09-2012 00:00:00

3 rows selected.

It works in three phases:

assign a unique rownum to only those records that are the start of a group
give the records that are not the start of a group the same number as the start of the group (using analytic function MAX with a sliding window)
aggregate by group number

And the real beauty of this query is that only one TABLE ACCESS FULL is needed:

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'))
  2  /

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------------------------------------
SQL_ID  8v1suw8j53tqz, child number 0
-------------------------------------
select min(startdate) startdate      , max(enddate)   enddate   from ( select startdate               , enddate
             , max(rn) over (order by startdate) maxrn            from ( select startdate
, enddate                        , case lag(enddate) over (order by startdate)                          when
startdate-1 then                            null                          else                            rownum
                         end rn                     from mytable                 )        )  group by maxrn
order by startdate

Plan hash value: 2933657513

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------
|   1 |  SORT ORDER BY           |         |      1 |      5 |      3 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   2 |   HASH GROUP BY          |         |      1 |      5 |      3 |00:00:00.01 |       3 |       |       |          |
|   3 |    VIEW                  |         |      1 |      5 |      5 |00:00:00.01 |       3 |       |       |          |
|   4 |     WINDOW BUFFER        |         |      1 |      5 |      5 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   5 |      VIEW                |         |      1 |      5 |      5 |00:00:00.01 |       3 |       |       |          |
|   6 |       WINDOW SORT        |         |      1 |      5 |      5 |00:00:00.01 |       3 |  2048 |  2048 | 2048  (0)|
|   7 |        COUNT             |         |      1 |        |      5 |00:00:00.01 |       3 |       |       |          |
|   8 |         TABLE ACCESS FULL| MYTABLE |      1 |      5 |      5 |00:00:00.01 |       3 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------

24 rows selected.

Regards,

Rob.

Remove duplicate data in 'date-ranged' rows

EDITED, following comments. Try:

;with cte as (
select m1.ID, m1.StartDate, m1.EndDate, m1.a, m1.b, m1.c, m1.d, m1.e, m1.f
from sampledata m1
where not exists
(select null from sampledata m0
 where m1.ID = m0.ID and 
       m1.a = m0.a and 
       m1.b = m0.b and 
       m1.c = m0.c and 
       m1.d = m0.d and 
       m1.e = m0.e and 
       m1.f = m0.f and 
       dateadd(day, -1, m1.StartDate) = m0.EndDate)
union all
select m1.ID, m1.StartDate, m2.EndDate, m1.a, m1.b, m1.c, m1.d, m1.e, m1.f
from cte m1
join sampledata m2 
       on m1.ID = m2.ID and 
          m1.a = m2.a and 
          m1.b = m2.b and 
          m1.c = m2.c and 
          m1.d = m2.d and 
          m1.e = m2.e and 
          m1.f = m2.f and 
          dateadd(day, 1, m1.EndDate) = m2.StartDate)
select ID, StartDate, max(EndDate) EndDate, a, b, c, d, e, f
from cte 
group by ID, StartDate, a, b, c, d, e, f
OPTION (MAXRECURSION 32767)

Combining split date ranges in a SQL query

The following code should work. I've made a few assumptions as follows: there are no overlaps of date ranges, there are no NULL values in any of the fields, and the start date for a given row is always less than the end date. If your data doesn't fit these criteria, you'll need to adjust this method, but it should point you in the right direction.

You can use subqueries instead of the views, but that can be cumbersome so I used the views to make the code clearer.

CREATE VIEW dbo.StudentStartDates
AS
    SELECT
        S.StudentID,
        S.StartDate,
        S.Field1,
        S.Field2
    FROM
        dbo.Students S
    LEFT OUTER JOIN dbo.Students PREV ON
        PREV.StudentID = S.StudentID AND
        PREV.Field1 = S.Field1 AND
        PREV.Field2 = S.Field2 AND
        PREV.EndDate = DATEADD(dy, -1, S.StartDate)
    WHERE PREV.StudentID IS NULL
GO

CREATE VIEW dbo.StudentEndDates
AS
    SELECT
        S.StudentID,
        S.EndDate,
        S.Field1,
        S.Field2
    FROM
        dbo.Students S
    LEFT OUTER JOIN dbo.Students NEXT ON
        NEXT.StudentID = S.StudentID AND
        NEXT.Field1 = S.Field1 AND
        NEXT.Field2 = S.Field2 AND
        NEXT.StartDate = DATEADD(dy, 1, S.EndDate)
    WHERE NEXT.StudentID IS NULL
GO

SELECT
    SD.StudentID,
    SD.StartDate,
    ED.EndDate,
    SD.Field1,
    SD.Field2
FROM
    dbo.StudentStartDates SD
INNER JOIN dbo.StudentEndDates ED ON
    ED.StudentID = SD.StudentID AND
    ED.Field1 = SD.Field1 AND
    ED.Field2 = SD.Field2 AND
    ED.EndDate > SD.StartDate AND
    NOT EXISTS (SELECT * FROM dbo.StudentEndDates ED2 WHERE ED2.StudentID = SD.StudentID AND ED2.Field1 = SD.Field1 AND ED2.Field2 = SD.Field2 AND ED2.EndDate < ED.EndDate AND ED2.EndDate > SD.StartDate)
GO

Collapsing date ranges in SQL (Netezza)

You can use a left join to see if something is connected to the previous record. If there is no connection, then you have the beginning of a "continuous period". Then, a cumulative sum assigns a grouping, which can be used for aggregation.

That is how this query works:

select visitid, unit,
       min(admissiondate) as admissiondate,
       max(dischargedate) as dischargedate
from (select t.*,
             sum(case when tprev.visitid is null then 1 else 0 end) over
                 (partition by t.visitid, t.unit order by t.admissiondate 
                 ) as grp
      from t left join
           t tprev
           on t.visitid = tprev.visitid and t.unit = tprev.unit and
              t.admissiondate = tprev.dischargedate
     ) t
group by grp, visitid, unit;

Note: This assumes that the new admission date is exactly the same as the previous discharge date. Of course, you can introduce non-equality logic if you want to check that the admission occurred within, say 10 seconds or 5 minutes of the discharge.

Detect overlapping date ranges from the same table

If you already have entries for each day that should work, but if you don't the overhead is significant, and if that query is used often, if will affect performance.

If the data is in this format, you can detect overlaps using simple date arithmetic, because an overlap is simply one interval starting after a given interval, but before the given is finished, something like

select dr1.* from date_ranges dr1
inner join date_ranges dr2
on dr2.start > dr1.start -- start after dr1 is started
  and dr2.start < dr1.end -- start before dr1 is finished

If you need special handling for interval that are wholly within another interval, or you need to merge intervals, i.e.

PKey  Start       End         Type
====  =====       ===         ====
01    01/01/2010  20/01/2010  S
02    15/01/2010  31/01/2010  S

yielding

Start       End         Type
=====       ===         ====
01/01/2010  31/01/2010  S

you will need more complex calculation.

In my experience with this kind of problems, once you get how to do the calculation by hand, it's easy to transfer it into SQL :)

merge consecutive dates ranges using SQL

This is a Gaps & Islands problem. You can use the typical solution using LAG(). For example:

select
  max(client_id) as client_id,
  max(status) as status,
  min(start_date) as start_date,
  max(end_date) as end_date
from (
  select *, sum(i) over(partition by client_id order by start_date) as g
  from (
    select *,
      case when dateadd(day, -1, start_date) <> 
        lag(end_date) over(partition by client_id order by start_date) 
      then 1 else 0 end as i
    from t
  ) x
) y
group by client_id, g
order by client_id, g

Result:

 client_id  status  start_date  end_date   
 ---------- ------- ----------- ---------- 
 1          1       2022-01-01  2022-01-04 
 1          1       2022-01-12  2022-01-15 
 2          1       2022-01-03  2022-01-03 
 2          1       2022-01-05  2022-01-06

See running example at db<>fiddle.

SQL Query to Collapse Duplicate Values by Date Range