Detect SQL Island Over Multiple Parameters and Conditions

Detect SQL island over multiple parameters and conditions

Answer for updated question

SELECT *
FROM (
SELECT *
,lag(val, 1, 0) OVER (PARTITION BY status ORDER BY id) last_val
,lag(status) OVER (PARTITION BY val ORDER BY id) last_status
FROM t1
) x
WHERE status = 1
AND (last_val <> val OR last_status = 0)

How?

Same as before, but this time combine two window functions. Switching on a device qualifies if ..

1. the last device switched on was a different one.

2. or the same device has been switched off in its last entry. The corner case with NULL for the first row of the partition is irrelevant, because then the row already qualified in 1.


Answer for original version of question.

If your I understand your task correctly, this simple query does the job:

SELECT *
FROM (
SELECT *
,lag(val, 1, 0) OVER (ORDER BY id) last_on
FROM t1
WHERE status = 1
) x
WHERE last_on <> val

Returns rows 1, 3, 6, 7 as requested.

How?

The subquery ignores all switching off, as that is just noise, according to your description. Leaves entries where a device is switched on. Among those, only those entries are disqualified, where the same device was on already (the last entry switching on). Use the window function lag() for that. In particular I provide 0 as default to cover the special case of the first row - assuming that there is no device with val = 0.

If there is, pick another impossible number.

If no number is impossible, leave the special case as NULL with lag(val) OVER ... and in the outer query check with:

WHERE last_on IS DISTINCT FROM val

statistics of sessions with gap-and-island problem

select session_id     
,timestamp
,user_id
,start_time
,count(diff) over()/2 as number_of_session_with_problem
from (
select *
,case when timestamp-lag(timestamp) over(partition by session_id order by timestamp) > '00:05:00.000' then 1 when lead(timestamp) over(partition by session_id order by timestamp) - timestamp > '00:05:00.000' then 1 end as diff
from raw_data join labels using(session_id)
) t
where diff = 1









































session_idtimestampuser_idstart_timenumber_of_session_with_problem
6562016-04-01 00:16:19.68792016-04-01 00:03:392
6562016-04-01 00:24:20.69192016-04-01 00:03:392
6572016-04-01 00:26:51.09692016-04-01 00:26:512
6572016-04-01 00:37:54.09292016-04-01 00:26:512

Gaps And Islands: Splitting Islands Based On External Table

The straight-forward method is to fetch the effective price for each row of History and then generate gaps and islands taking price into account.

It is not clear from the question what is the role of DestinationID. Sample data is of no help here.
I'll assume that we need to join and partition on both ProductID and DestinationID.

The following query returns effective Price for each row from History.
You need to add index to the PriceChange table

CREATE NONCLUSTERED INDEX [IX] ON [dbo].[PriceChange]
(
[ProductId] ASC,
[DestinationId] ASC,
[EffectiveDate] DESC
)
INCLUDE ([Price])

for this query to work efficiently.

Query for Prices

SELECT
History.ProductId
,History.DestinationId
,History.ScheduledDate
,History.Quantity
,A.Price
FROM
History
OUTER APPLY
(
SELECT TOP(1)
PriceChange.Price
FROM
PriceChange
WHERE
PriceChange.ProductID = History.ProductID
AND PriceChange.DestinationId = History.DestinationId
AND PriceChange.EffectiveDate <= History.ScheduledDate
ORDER BY
PriceChange.EffectiveDate DESC
) AS A
ORDER BY ProductID, ScheduledDate;

For each row from History there will be one seek in this index to pick the correct price.

This query returns:

Prices

+-----------+---------------+---------------+----------+-------+
| ProductId | DestinationId | ScheduledDate | Quantity | Price |
+-----------+---------------+---------------+----------+-------+
| 0 | 1000 | 2018-04-01 | 5 | 1 |
| 0 | 1000 | 2018-04-02 | 10 | 2 |
| 0 | 1000 | 2018-04-03 | 7 | 2 |
| 3 | 5000 | 2018-05-07 | 15 | 5 |
| 3 | 5000 | 2018-05-08 | 23 | 5 |
| 3 | 5000 | 2018-05-09 | 52 | 5 |
| 3 | 5000 | 2018-05-10 | 12 | 20 |
| 3 | 5000 | 2018-05-11 | 14 | 20 |
+-----------+---------------+---------------+----------+-------+

Now a standard gaps-and-island step to collapse consecutive days with the same price together. I use a difference of two row number sequences here.

I've added some more rows to your sample data to see the gaps within the same ProductId.

INSERT INTO History (ProductId, DestinationId, ScheduledDate, Quantity)
VALUES
(0, 1000, '20180601', 5),
(0, 1000, '20180602', 10),
(0, 1000, '20180603', 7),
(3, 5000, '20180607', 15),
(3, 5000, '20180608', 23),
(3, 5000, '20180609', 52),
(3, 5000, '20180610', 12),
(3, 5000, '20180611', 14);

If you run this intermediate query you'll see how it works:

WITH
CTE_Prices
AS
(
SELECT
History.ProductId
,History.DestinationId
,History.ScheduledDate
,History.Quantity
,A.Price
FROM
History
OUTER APPLY
(
SELECT TOP(1)
PriceChange.Price
FROM
PriceChange
WHERE
PriceChange.ProductID = History.ProductID
AND PriceChange.DestinationId = History.DestinationId
AND PriceChange.EffectiveDate <= History.ScheduledDate
ORDER BY
PriceChange.EffectiveDate DESC
) AS A
)
,CTE_rn
AS
(
SELECT
ProductId
,DestinationId
,ScheduledDate
,Quantity
,Price
,ROW_NUMBER() OVER (PARTITION BY ProductId, DestinationId, Price ORDER BY ScheduledDate) AS rn1
,DATEDIFF(day, '20000101', ScheduledDate) AS rn2
FROM
CTE_Prices
)
SELECT *
,rn2-rn1 AS Diff
FROM CTE_rn

Intermediate result

+-----------+---------------+---------------+----------+-------+-----+------+------+
| ProductId | DestinationId | ScheduledDate | Quantity | Price | rn1 | rn2 | Diff |
+-----------+---------------+---------------+----------+-------+-----+------+------+
| 0 | 1000 | 2018-04-01 | 5 | 1 | 1 | 6665 | 6664 |
| 0 | 1000 | 2018-04-02 | 10 | 2 | 1 | 6666 | 6665 |
| 0 | 1000 | 2018-04-03 | 7 | 2 | 2 | 6667 | 6665 |
| 0 | 1000 | 2018-06-01 | 5 | 2 | 3 | 6726 | 6723 |
| 0 | 1000 | 2018-06-02 | 10 | 2 | 4 | 6727 | 6723 |
| 0 | 1000 | 2018-06-03 | 7 | 2 | 5 | 6728 | 6723 |
| 3 | 5000 | 2018-05-07 | 15 | 5 | 1 | 6701 | 6700 |
| 3 | 5000 | 2018-05-08 | 23 | 5 | 2 | 6702 | 6700 |
| 3 | 5000 | 2018-05-09 | 52 | 5 | 3 | 6703 | 6700 |
| 3 | 5000 | 2018-05-10 | 12 | 20 | 1 | 6704 | 6703 |
| 3 | 5000 | 2018-05-11 | 14 | 20 | 2 | 6705 | 6703 |
| 3 | 5000 | 2018-06-07 | 15 | 20 | 3 | 6732 | 6729 |
| 3 | 5000 | 2018-06-08 | 23 | 20 | 4 | 6733 | 6729 |
| 3 | 5000 | 2018-06-09 | 52 | 20 | 5 | 6734 | 6729 |
| 3 | 5000 | 2018-06-10 | 12 | 20 | 6 | 6735 | 6729 |
| 3 | 5000 | 2018-06-11 | 14 | 20 | 7 | 6736 | 6729 |
+-----------+---------------+---------------+----------+-------+-----+------+------+

Now simply group by the Diff to get one row per interval.

Final query

WITH
CTE_Prices
AS
(
SELECT
History.ProductId
,History.DestinationId
,History.ScheduledDate
,History.Quantity
,A.Price
FROM
History
OUTER APPLY
(
SELECT TOP(1)
PriceChange.Price
FROM
PriceChange
WHERE
PriceChange.ProductID = History.ProductID
AND PriceChange.DestinationId = History.DestinationId
AND PriceChange.EffectiveDate <= History.ScheduledDate
ORDER BY
PriceChange.EffectiveDate DESC
) AS A
)
,CTE_rn
AS
(
SELECT
ProductId
,DestinationId
,ScheduledDate
,Quantity
,Price
,ROW_NUMBER() OVER (PARTITION BY ProductId, DestinationId, Price ORDER BY ScheduledDate) AS rn1
,DATEDIFF(day, '20000101', ScheduledDate) AS rn2
FROM
CTE_Prices
)
SELECT
ProductId
,DestinationId
,MIN(ScheduledDate) AS StartDate
,MAX(ScheduledDate) AS EndDate
,SUM(Quantity) AS TotalQuantity
,Price
FROM
CTE_rn
GROUP BY
ProductId
,DestinationId
,Price
,rn2-rn1
ORDER BY
ProductID
,DestinationId
,StartDate
;

Final result

+-----------+---------------+------------+------------+---------------+-------+
| ProductId | DestinationId | StartDate | EndDate | TotalQuantity | Price |
+-----------+---------------+------------+------------+---------------+-------+
| 0 | 1000 | 2018-04-01 | 2018-04-01 | 5 | 1 |
| 0 | 1000 | 2018-04-02 | 2018-04-03 | 17 | 2 |
| 0 | 1000 | 2018-06-01 | 2018-06-03 | 22 | 2 |
| 3 | 5000 | 2018-05-07 | 2018-05-09 | 90 | 5 |
| 3 | 5000 | 2018-05-10 | 2018-05-11 | 26 | 20 |
| 3 | 5000 | 2018-06-07 | 2018-06-11 | 116 | 20 |
+-----------+---------------+------------+------------+---------------+-------+

In the same column, compare each value with previous multiple values with condition

This would have been a good spot to use window functions with a date range specification. Alas, SQL Server does not support that (yet?).

The simplest approach might be exists and a correlated subquery:

select t.*
from mytable t
where exists (
select 1
from mytable t1
where
t1.visit_id = t.visit_id
and t1.collection_time >= dateadd(day, -2.collection_time)
and t1.collection_time < t.collection_time
and t1.value < t.value - 3
)

Or you can use cross apply:

select t.*
from mytable t
cross apply (
select min(t1.value) as min_value
from mytable t1
where
t1.visit_id = t.visit_id
and t1.collection_time >= dateadd(day, -2.collection_time)
and t1.collection_time < t.collection_time
) t1
where t1.min_value < t.value - 3

Gap-and-island for more than time threshold

The problem is we have two user_id and they're not defined so you need to specifically choose them and give them aliases.

select  session_id    
,timestamp
,user_id
,start_time
,count(diff) over() as number_of_sessions_with_problem
from (
select session_id
,timestamp
,labels.user_id
,start_time
,case when lead(timestamp) over(partition by session_id order by timestamp)-timestamp > '00:05:00.000' then 1 end as diff
from raw_data join labels using(session_id)
) t
where diff = 1



























session_idtimestampuser_idstart_timenumber_of_sessions_with_problem
6562016-04-01 00:16:19.68792016-04-01 00:03:392
6572016-04-01 00:26:51.09692016-04-01 00:26:512

Jump SQL gap over specific condition & proper lead() usage

Query with window functions

SELECT *
FROM (
SELECT *
, lag(val, 1, 0) OVER (PARTITION BY status ORDER BY id) AS last_val
, lag(status, 1, 0) OVER w2 AS last_status
, lag(next_id) OVER w2 AS next_id_of_last_status
FROM (
SELECT *, lead(id) OVER (PARTITION BY status ORDER BY id) AS next_id
FROM t1
) AS t
WINDOW w2 AS (PARTITION BY val ORDER BY id)
) x
WHERE (last_val <> val OR last_status <> status)
AND (status = 1
OR last_status = 1
AND ((next_id_of_last_status > id) OR next_id_of_last_status IS NULL)
)
ORDER BY id;

In addition to what we already had, we need valid OFF switches.

An OFF switch if valid if the device was switched ON before (last_status = 1) and the next ON operation after that comes after the OFF switch in question (next_id_of_last_status > id).

We have to provide for the special case that there is was the last ON operation, so we check for NULL in addition (OR next_id_of_last_status IS NULL).

The next_id_of_last_status comes from the same window that we take last_status from. Therefore I introduced additional syntax for explicit window declaration, so I don't have to repeat myself:

WINDOW w2 AS (PARTITION BY val ORDER BY id)

And we need to get the next id for the last status in a subquery earlier (subquery t).

If you've understood all that, you shouldn't have a problem slapping lead() on top of this query to get to your final destination. :)

PL/pgSQL function

Once it gets this complex, it's time to switch to procedural processing.

This comparatively simple plpgsql function nukes the performance of the complex window function query, for the simple reason that it has to scan the whole table only once.

CREATE OR REPLACE FUNCTION valid_t1 (OUT t t1)  -- row variable of table type
RETURNS SETOF t1
LANGUAGE plpgsql AS
$func$
DECLARE
_last_on int := -1; -- init with impossible value
BEGIN
FOR t IN
SELECT * FROM t1 ORDER BY id
LOOP
IF t.status = 1 THEN
IF _last_on <> t.val THEN
RETURN NEXT;
_last_on := t.val;
END IF;
ELSE
IF _last_on = t.val THEN
RETURN NEXT;
_last_on := -1;
END IF;
END IF;
END LOOP;
END
$func$;

Call:

SELECT * FROM valid_t1();

SQL - LAG to get previous value if condition using multiple previous columns satisfied

I see no answer here that uses window functions and a single scan of the table. We can do this query in a single scan as follows:

Let us assume you have the AwayTeam in another column.

If you don't have this yet and you wanted to parse it out of EventData:

We could use: SUBSTRING(EventData, CHARINDEX(' vs ', EventData) + 4)

I urge you to follow proper normalization and create this as a proper column in your table.

Our algorithm runs like this:

  1. Multiply out (unpivot) the two teams as separate rows, using CROSS APPLY
  2. Calculate the previous Metrics using LAG, partitioning by the merged Team column
  3. Filter back down the doubled up rows, so that we only get a single row for each of our original ones
SELECT id, HomeTeam, AwayTeam, Metric, Prev1, Prev2, Prev3
FROM (

SELECT *
,Prev1 = LAG(Metric, 1) OVER (PARTITION BY v.Team ORDER BY id)
,Prev2 = LAG(Metric, 2) OVER (PARTITION BY v.Team ORDER BY id)
,Prev3 = LAG(Metric, 3) OVER (PARTITION BY v.Team ORDER BY id)
-- more of these ......
FROM test_table
CROSS APPLY (VALUES (HomeTeam, 1),(AwayTeam, 0)) AS v(Team,IsHome)
) AS t

WHERE IsHome = 1
-- ORDER BY id --if necessary

Importantly, we can do this without the use of multiple different sorts, partitions or ordering, and without the use of a self-join. Just a single scan.

Result:
























































































idHomeTeamAwayTeamMetricPrev1Prev2Prev3
1Team ATeam B5(null)(null)(null)
2Team ATeam B75(null)(null)
3Team CTeam D6(null)(null)(null)
4Team ZTeam A8(null)(null)(null)
5Team ATeam B9875
6Team CTeam D36(null)(null)
7Team CTeam D136(null)
8Team ETeam F2(null)(null)(null)

group consecutive time intervals in sql

You can use variables just fine in PL/pgSQL.

I would solve this with a table function.

Assuming the table is called stock, my code would look like this:

CREATE OR REPLACE FUNCTION combine_periods() RETURNS SETOF stock
LANGUAGE plpgsql STABLE AS
$$DECLARE
s stock;
period stock;
BEGIN
FOR s IN
SELECT stock_name, action, start_date, end_date
FROM stock
ORDER BY stock_name, action, start_date
LOOP
/* is this a new period? */
IF period IS NOT NULL AND
(period.stock_name <> s.stock_name
OR period.action <> s.action
OR period.end_date <> s.start_date)
THEN
/* new period, output last period */
RETURN NEXT period;
period := NULL;
ELSE
IF period IS NOT NULL
THEN
/* period continues, update end_date */
period.end_date := s.end_date;
END IF;
END IF;

/* remember the beginning of a new period */
IF period IS NULL
THEN
period := s;
END IF;
END LOOP;

/* output the last period */
IF period IS NOT NULL
THEN
RETURN NEXT period;
END IF;

RETURN;
END;$$;

And I would call it like this:

test=> SELECT * FROM combine_periods();
┌────────────┬─────────┬────────────┬──────────┐
│ stock_name │ action │ start_date │ end_date │
├────────────┼─────────┼────────────┼──────────┤
│ google │ falling │ 3 │ 4 │
│ google │ growing │ 1 │ 3 │
│ google │ growing │ 4 │ 5 │
│ yahoo │ growing │ 1 │ 2 │
└────────────┴─────────┴────────────┴──────────┘
(4 rows)


Related Topics



Leave a reply



Submit