SQL Query to Collapse Duplicate Values by Date Range

SQL Query to Collapse Duplicate Values By Date Range

I got it to work as follows. It is heavy on analytic functions and is Oracle specific.

select distinct id, value,
decode(startMonth, null,
lag(startMonth) over(partition by id, value order by startMonth, endMonth), --if start is null, it's an end so take from the row before
startMonth) startMonth,

decode(endMonth, null,
lead(endMonth) over(partition by id, value order by startMonth, endMonth), --if end is null, it's an start so take from the row after
endMonth) endMonth

from (
select id, value, startMonth, endMonth from(
select id, value,
decode(month+1, lead(month) over(partition by id,value order by month), null, month)
startMonth, --get the beginning month for each interval
decode(month-1, lag(month) over(partition by id,value order by month), null, month)
endMonth --get the end month for each interval from Tbl
) a
where startMonth is not null or endMonth is not null --remain with start and ends only
)b

It might be possible to simplify some of the inner queries somewhat

The inner query checks if the month is a first/last month of the interval as follows: if the month + 1 == the next month (lag) for that grouping, then since there is a next month, this month is obviously not the end month. Otherwise, it is the last month of the interval. The same concept is used to check for the first month.

The outer query first filters out all rows that are not either start or end months (where startMonth is not null or endMonth is not null).
Then, each row is either a start month or an end month (or both), determined by whether start or end is not null). If the month is a start month, get the corresponding end month by getting the next (lead) endMonth for that id,value ordered by endMonth, and if it is an endMonth get the startMonth by looking for the previous startMonth (lag)

Complex SQL query - collapsing date ranges

Here is a standard way to do collapsing time ranges in SQL using analytic functions with an example.

Your table:

SQL> create table mytable (startdate,enddate)
2 as
3 select date '2000-01-01', date '2001-02-01' from dual union all
4 select date '2001-02-02', date '2001-04-01' from dual union all
5 select date '2004-04-01', date '2006-06-02' from dual union all
6 select date '2006-06-03', date '2010-09-01' from dual union all
7 select date '2011-08-01', date '2012-09-01' from dual
8 /

Table created.

The query:

SQL> select min(startdate) startdate
2 , max(enddate) enddate
3 from ( select startdate
4 , enddate
5 , max(rn) over (order by startdate) maxrn
6 from ( select startdate
7 , enddate
8 , case lag(enddate) over (order by startdate)
9 when startdate-1 then
10 null
11 else
12 rownum
13 end rn
14 from mytable
15 )
16 )
17 group by maxrn
18 order by startdate
19 /

STARTDATE ENDDATE
------------------- -------------------
01-01-2000 00:00:00 01-04-2001 00:00:00
01-04-2004 00:00:00 01-09-2010 00:00:00
01-08-2011 00:00:00 01-09-2012 00:00:00

3 rows selected.

It works in three phases:

  1. assign a unique rownum to only those records that are the start of a group
  2. give the records that are not the start of a group the same number as the start of the group (using analytic function MAX with a sliding window)
  3. aggregate by group number

And the real beauty of this query is that only one TABLE ACCESS FULL is needed:

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'))
2 /

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------------------------------------
SQL_ID 8v1suw8j53tqz, child number 0
-------------------------------------
select min(startdate) startdate , max(enddate) enddate from ( select startdate , enddate
, max(rn) over (order by startdate) maxrn from ( select startdate
, enddate , case lag(enddate) over (order by startdate) when
startdate-1 then null else rownum
end rn from mytable ) ) group by maxrn
order by startdate

Plan hash value: 2933657513

-------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------
| 1 | SORT ORDER BY | | 1 | 5 | 3 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)|
| 2 | HASH GROUP BY | | 1 | 5 | 3 |00:00:00.01 | 3 | | | |
| 3 | VIEW | | 1 | 5 | 5 |00:00:00.01 | 3 | | | |
| 4 | WINDOW BUFFER | | 1 | 5 | 5 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)|
| 5 | VIEW | | 1 | 5 | 5 |00:00:00.01 | 3 | | | |
| 6 | WINDOW SORT | | 1 | 5 | 5 |00:00:00.01 | 3 | 2048 | 2048 | 2048 (0)|
| 7 | COUNT | | 1 | | 5 |00:00:00.01 | 3 | | | |
| 8 | TABLE ACCESS FULL| MYTABLE | 1 | 5 | 5 |00:00:00.01 | 3 | | | |
-------------------------------------------------------------------------------------------------------------------------

24 rows selected.

Regards,

Rob.

Remove duplicate data in 'date-ranged' rows

EDITED, following comments. Try:

;with cte as (
select m1.ID, m1.StartDate, m1.EndDate, m1.a, m1.b, m1.c, m1.d, m1.e, m1.f
from sampledata m1
where not exists
(select null from sampledata m0
where m1.ID = m0.ID and
m1.a = m0.a and
m1.b = m0.b and
m1.c = m0.c and
m1.d = m0.d and
m1.e = m0.e and
m1.f = m0.f and
dateadd(day, -1, m1.StartDate) = m0.EndDate)
union all
select m1.ID, m1.StartDate, m2.EndDate, m1.a, m1.b, m1.c, m1.d, m1.e, m1.f
from cte m1
join sampledata m2
on m1.ID = m2.ID and
m1.a = m2.a and
m1.b = m2.b and
m1.c = m2.c and
m1.d = m2.d and
m1.e = m2.e and
m1.f = m2.f and
dateadd(day, 1, m1.EndDate) = m2.StartDate)
select ID, StartDate, max(EndDate) EndDate, a, b, c, d, e, f
from cte
group by ID, StartDate, a, b, c, d, e, f
OPTION (MAXRECURSION 32767)

Combining split date ranges in a SQL query

The following code should work. I've made a few assumptions as follows: there are no overlaps of date ranges, there are no NULL values in any of the fields, and the start date for a given row is always less than the end date. If your data doesn't fit these criteria, you'll need to adjust this method, but it should point you in the right direction.

You can use subqueries instead of the views, but that can be cumbersome so I used the views to make the code clearer.

CREATE VIEW dbo.StudentStartDates
AS
SELECT
S.StudentID,
S.StartDate,
S.Field1,
S.Field2
FROM
dbo.Students S
LEFT OUTER JOIN dbo.Students PREV ON
PREV.StudentID = S.StudentID AND
PREV.Field1 = S.Field1 AND
PREV.Field2 = S.Field2 AND
PREV.EndDate = DATEADD(dy, -1, S.StartDate)
WHERE PREV.StudentID IS NULL
GO

CREATE VIEW dbo.StudentEndDates
AS
SELECT
S.StudentID,
S.EndDate,
S.Field1,
S.Field2
FROM
dbo.Students S
LEFT OUTER JOIN dbo.Students NEXT ON
NEXT.StudentID = S.StudentID AND
NEXT.Field1 = S.Field1 AND
NEXT.Field2 = S.Field2 AND
NEXT.StartDate = DATEADD(dy, 1, S.EndDate)
WHERE NEXT.StudentID IS NULL
GO

SELECT
SD.StudentID,
SD.StartDate,
ED.EndDate,
SD.Field1,
SD.Field2
FROM
dbo.StudentStartDates SD
INNER JOIN dbo.StudentEndDates ED ON
ED.StudentID = SD.StudentID AND
ED.Field1 = SD.Field1 AND
ED.Field2 = SD.Field2 AND
ED.EndDate > SD.StartDate AND
NOT EXISTS (SELECT * FROM dbo.StudentEndDates ED2 WHERE ED2.StudentID = SD.StudentID AND ED2.Field1 = SD.Field1 AND ED2.Field2 = SD.Field2 AND ED2.EndDate < ED.EndDate AND ED2.EndDate > SD.StartDate)
GO

Collapsing date ranges in SQL (Netezza)

You can use a left join to see if something is connected to the previous record. If there is no connection, then you have the beginning of a "continuous period". Then, a cumulative sum assigns a grouping, which can be used for aggregation.

That is how this query works:

select visitid, unit,
min(admissiondate) as admissiondate,
max(dischargedate) as dischargedate
from (select t.*,
sum(case when tprev.visitid is null then 1 else 0 end) over
(partition by t.visitid, t.unit order by t.admissiondate
) as grp
from t left join
t tprev
on t.visitid = tprev.visitid and t.unit = tprev.unit and
t.admissiondate = tprev.dischargedate
) t
group by grp, visitid, unit;

Note: This assumes that the new admission date is exactly the same as the previous discharge date. Of course, you can introduce non-equality logic if you want to check that the admission occurred within, say 10 seconds or 5 minutes of the discharge.

Detect overlapping date ranges from the same table

If you already have entries for each day that should work, but if you don't the overhead is significant, and if that query is used often, if will affect performance.

If the data is in this format, you can detect overlaps using simple date arithmetic, because an overlap is simply one interval starting after a given interval, but before the given is finished, something like

select dr1.* from date_ranges dr1
inner join date_ranges dr2
on dr2.start > dr1.start -- start after dr1 is started
and dr2.start < dr1.end -- start before dr1 is finished

If you need special handling for interval that are wholly within another interval, or you need to merge intervals, i.e.

PKey  Start       End         Type
==== ===== === ====
01 01/01/2010 20/01/2010 S
02 15/01/2010 31/01/2010 S

yielding

Start       End         Type
===== === ====
01/01/2010 31/01/2010 S

you will need more complex calculation.

In my experience with this kind of problems, once you get how to do the calculation by hand, it's easy to transfer it into SQL :)

merge consecutive dates ranges using SQL

This is a Gaps & Islands problem. You can use the typical solution using LAG(). For example:

select
max(client_id) as client_id,
max(status) as status,
min(start_date) as start_date,
max(end_date) as end_date
from (
select *, sum(i) over(partition by client_id order by start_date) as g
from (
select *,
case when dateadd(day, -1, start_date) <>
lag(end_date) over(partition by client_id order by start_date)
then 1 else 0 end as i
from t
) x
) y
group by client_id, g
order by client_id, g

Result:

 client_id  status  start_date  end_date   
---------- ------- ----------- ----------
1 1 2022-01-01 2022-01-04
1 1 2022-01-12 2022-01-15
2 1 2022-01-03 2022-01-03
2 1 2022-01-05 2022-01-06

See running example at db<>fiddle.



Related Topics



Leave a reply



Submit