Find Min and Max for Subsets of Consecutive Rows - Gaps and Islands

Find min and max for subsets of consecutive rows - gaps and islands

You want to find consecutive groups of similar rows and then aggregation. I like the difference of row numbers approach:

select name, act, min(startt) as startt, max(endd) as endd
from (select i.*,
row_number() over (partition by name, act order by rn) as seqnum_na,
row_number() over (partition by name order by rn) as seqnum_n
from input i
) i
group by (seqnum_n - seqnum_na), name, act;

You can see how this works by looking at what the subquery does.

Finding the largest subsets of consecutive rows with a maximum gap size (gaps and islands)

Flag the start of group and aggregate groups

select min(t) start_t, max(t) end_t, count(*) cnt, sum(amt) total
from (
select t, amt, sum(flag) over(order by t) grp
from (
select t, amt, case when t - lag(t, 1, t-11) over(order by t) >= 10 then 1 end flag
from d
) t
) t
group by grp

Grouping subsets within ordered data in SQL

It is a gaps and islands problem, as was commented earlier, so you can check the linked articles to deep dive into the problem. But you need to group your data conditionally after finding islands (where you need to collapse only type = 'C' items.

Here's the code:

with s as (
select '1001' as order_nbr, '1' as step, 'P' as ex_type, timestamp '2021-01-01 00:00:00' as start_ts, timestamp '2021-01-01 09:00:00' as end_ts from dummy union all
select '1001' as order_nbr, '2' as step, 'C' as ex_type, timestamp '2021-01-04 03:00:00' as start_ts, timestamp '2021-01-04 06:00:00' as end_ts from dummy union all
select '1001' as order_nbr, '3' as step, 'C' as ex_type, timestamp '2021-01-03 07:00:00' as start_ts, timestamp '2021-01-03 08:00:00' as end_ts from dummy union all
select '1001' as order_nbr, '4' as step, 'C' as ex_type, timestamp '2021-01-05 10:00:00' as start_ts, timestamp '2021-01-05 15:00:00' as end_ts from dummy union all
select '1001' as order_nbr, '5' as step, 'Z' as ex_type, timestamp '2021-01-06 00:00:00' as start_ts, timestamp '2021-01-06 06:00:00' as end_ts from dummy union all
select '1001' as order_nbr, '6' as step, 'Z' as ex_type, timestamp '2021-01-06 16:00:00' as start_ts, timestamp '2021-01-06 20:00:00' as end_ts from dummy union all
select '1001' as order_nbr, '7' as step, 'C' as ex_type, timestamp '2021-01-07 08:00:00' as start_ts, timestamp '2021-01-07 09:00:00' as end_ts from dummy union all
select '1001' as order_nbr, '8' as step, 'C' as ex_type, timestamp '2021-01-07 10:00:00' as start_ts, timestamp '2021-01-07 12:00:00' as end_ts from dummy union all
select '1002' as order_nbr, '1' as step, 'P' as ex_type, timestamp '2021-01-04 08:00:00' as start_ts, timestamp '2021-01-04 16:00:00' as end_ts from dummy
)
, num as (
select
s.*
/*Find consecutive rows on ex_type field*/
, row_number() over(partition by order_nbr order by start_ts asc) as r1
, row_number() over(partition by order_nbr, ex_type order by start_ts asc) as r2
from s
)
select
order_nbr
, ex_type
, min(start_ts) as start_ts
, max(end_ts) as end_ts
, string_agg(step, ',' order by start_ts asc) as steps
from num
group by
order_nbr
, ex_type
, case
/*For C use group number, for others - use original row number not to collapse them*/
when ex_type = 'C'
then r1 - r2
else r1
end
order by
order_nbr
, start_ts asc

And here's db<>fiddle on PostgreSQL as HANA-syntactically identical platform for involved functions.

Gaps and islands on 2 columns - if column A consecutive and column B identical

There is not much to change in your query. You basically need to select name and number in the subquery and sort in the same order. Then you can group by name, number - rn in the outer query.

SELECT
min(number) first_number,
max(number) last_number,
count(*) AS no_records,
name
FROM (
SELECT c.*, @rn := @rn + 1 rn
from (
SELECT name, number
FROM `table`
WHERE cc = 1
ORDER BY name, number
LIMIT 99999999999999999
) AS c
CROSS JOIN (SELECT @rn := 0) r
) c
GROUP BY name, number - rn
ORDER BY first_number ASC, name ASC;

Result:

first_number  last_number  no_records  name
1 2 2 Apple
3 3 1 Bean
10 12 3 Hello
14 14 1 Deer
14 14 1 Door
15 15 1 Hello
17 17 1 Hello

db<>fiddle

I usually advocate against the use of session variables in this way. The reason is that such solutions depend on internal implementation, and can be broken by version updates or settings changes. For example: Once MariaDB decided to ignore the ORDER BY clause in subqueries without LIMIT. This is why I included a huge LIMIT.

I also replaced number with first_number in the outer ORDER BY clause to avoid problems with ONLY_FULL_GROUP_BY mode.

A more stable way to generate row numbers is to use an AOTO_INCREMENT column in a temporary table:

drop temporary table if exists tmp_tbl;

create temporary table tmp_tbl (
rn int unsigned auto_increment primary key,
name varchar(64) not null,
number int not null
);

insert into tmp_tbl (name, number)
select name, number
from `table`
order by name, number;

The final SELECT query is identical with the outer query above:

SELECT
min(number) first_number,
max(number) last_number,
count(*) AS no_records,
name
FROM tmp_tbl
GROUP BY name, number - rn
ORDER BY first_number ASC, name ASC;

db<>fiddle

In a more recent version (starting from MariaDB 10.2) you can use ROW_NUMBER() window function instead:

SELECT
min(number) first_number,
max(number) last_number,
count(*) AS no_records,
name
FROM (
SELECT
name,
number,
row_number() OVER (ORDER BY name, number) as rn
FROM `table`
WHERE cc = 1
) c
GROUP BY name, number - rn
ORDER BY first_number ASC, name ASC;

db<>fiddle



Related Topics



Leave a reply



Submit