Detect SQL Island Over Multiple Parameters and Conditions

Detect SQL island over multiple parameters and conditions

Answer for updated question

SELECT *
FROM  (
   SELECT *
         ,lag(val, 1, 0) OVER (PARTITION BY status ORDER BY id) last_val
         ,lag(status) OVER (PARTITION BY val ORDER BY id) last_status
   FROM   t1
   ) x
WHERE  status = 1
AND    (last_val <> val OR last_status = 0)

How?

Same as before, but this time combine two window functions. Switching on a device qualifies if ..

1. the last device switched on was a different one.

2. or the same device has been switched off in its last entry. The corner case with NULL for the first row of the partition is irrelevant, because then the row already qualified in 1.

Answer for original version of question.

If your I understand your task correctly, this simple query does the job:

SELECT *
FROM  (
   SELECT *
         ,lag(val, 1, 0) OVER (ORDER BY id) last_on
   FROM   t1
   WHERE  status = 1
   ) x
WHERE  last_on <> val

Returns rows 1, 3, 6, 7 as requested.

How?

The subquery ignores all switching off, as that is just noise, according to your description. Leaves entries where a device is switched on. Among those, only those entries are disqualified, where the same device was on already (the last entry switching on). Use the window function lag() for that. In particular I provide 0 as default to cover the special case of the first row - assuming that there is no device with val = 0.

If there is, pick another impossible number.

If no number is impossible, leave the special case as NULL with lag(val) OVER ... and in the outer query check with:

WHERE last_on IS DISTINCT FROM val

statistics of sessions with gap-and-island problem

select session_id     
      ,timestamp    
      ,user_id  
      ,start_time   
      ,count(diff) over()/2 as number_of_session_with_problem
from  (
       select *
              ,case when timestamp-lag(timestamp) over(partition by session_id order by timestamp)    > '00:05:00.000' then 1 when lead(timestamp) over(partition by session_id order by timestamp) - timestamp > '00:05:00.000' then 1 end as diff
       from   raw_data join labels using(session_id)
      ) t
where diff = 1

session_id	timestamp	user_id	start_time	number_of_session_with_problem
656	2016-04-01 00:16:19.687	9	2016-04-01 00:03:39	2
656	2016-04-01 00:24:20.691	9	2016-04-01 00:03:39	2
657	2016-04-01 00:26:51.096	9	2016-04-01 00:26:51	2
657	2016-04-01 00:37:54.092	9	2016-04-01 00:26:51	2

Gaps And Islands: Splitting Islands Based On External Table

The straight-forward method is to fetch the effective price for each row of History and then generate gaps and islands taking price into account.

It is not clear from the question what is the role of DestinationID. Sample data is of no help here.
I'll assume that we need to join and partition on both ProductID and DestinationID.

The following query returns effective Price for each row from History.
You need to add index to the PriceChange table

CREATE NONCLUSTERED INDEX [IX] ON [dbo].[PriceChange]
(
    [ProductId] ASC,
    [DestinationId] ASC,
    [EffectiveDate] DESC
)
INCLUDE ([Price])

for this query to work efficiently.

Query for Prices

SELECT
    History.ProductId
    ,History.DestinationId
    ,History.ScheduledDate
    ,History.Quantity
    ,A.Price
FROM
    History
    OUTER APPLY
    (
        SELECT TOP(1)
            PriceChange.Price
        FROM
            PriceChange
        WHERE
            PriceChange.ProductID = History.ProductID
            AND PriceChange.DestinationId = History.DestinationId
            AND PriceChange.EffectiveDate <= History.ScheduledDate
        ORDER BY
            PriceChange.EffectiveDate DESC
    ) AS A
ORDER BY ProductID, ScheduledDate;

For each row from History there will be one seek in this index to pick the correct price.

This query returns:

Prices

+-----------+---------------+---------------+----------+-------+
| ProductId | DestinationId | ScheduledDate | Quantity | Price |
+-----------+---------------+---------------+----------+-------+
|         0 |          1000 | 2018-04-01    |        5 |     1 |
|         0 |          1000 | 2018-04-02    |       10 |     2 |
|         0 |          1000 | 2018-04-03    |        7 |     2 |
|         3 |          5000 | 2018-05-07    |       15 |     5 |
|         3 |          5000 | 2018-05-08    |       23 |     5 |
|         3 |          5000 | 2018-05-09    |       52 |     5 |
|         3 |          5000 | 2018-05-10    |       12 |    20 |
|         3 |          5000 | 2018-05-11    |       14 |    20 |
+-----------+---------------+---------------+----------+-------+

Now a standard gaps-and-island step to collapse consecutive days with the same price together. I use a difference of two row number sequences here.

I've added some more rows to your sample data to see the gaps within the same ProductId.

INSERT INTO History (ProductId, DestinationId, ScheduledDate, Quantity)
VALUES
  (0, 1000, '20180601', 5),
  (0, 1000, '20180602', 10),
  (0, 1000, '20180603', 7),
  (3, 5000, '20180607', 15),
  (3, 5000, '20180608', 23),
  (3, 5000, '20180609', 52),
  (3, 5000, '20180610', 12),
  (3, 5000, '20180611', 14);

If you run this intermediate query you'll see how it works:

WITH
CTE_Prices
AS
(
    SELECT
        History.ProductId
        ,History.DestinationId
        ,History.ScheduledDate
        ,History.Quantity
        ,A.Price
    FROM
        History
        OUTER APPLY
        (
            SELECT TOP(1)
                PriceChange.Price
            FROM
                PriceChange
            WHERE
                PriceChange.ProductID = History.ProductID
                AND PriceChange.DestinationId = History.DestinationId
                AND PriceChange.EffectiveDate <= History.ScheduledDate
            ORDER BY
                PriceChange.EffectiveDate DESC
        ) AS A
)
,CTE_rn
AS
(
    SELECT
        ProductId
        ,DestinationId
        ,ScheduledDate
        ,Quantity
        ,Price
        ,ROW_NUMBER() OVER (PARTITION BY ProductId, DestinationId, Price ORDER BY ScheduledDate) AS rn1
        ,DATEDIFF(day, '20000101', ScheduledDate) AS rn2
    FROM
        CTE_Prices
)
SELECT *
    ,rn2-rn1 AS Diff
FROM CTE_rn

Intermediate result

+-----------+---------------+---------------+----------+-------+-----+------+------+
| ProductId | DestinationId | ScheduledDate | Quantity | Price | rn1 | rn2  | Diff |
+-----------+---------------+---------------+----------+-------+-----+------+------+
|         0 |          1000 | 2018-04-01    |        5 |     1 |   1 | 6665 | 6664 |
|         0 |          1000 | 2018-04-02    |       10 |     2 |   1 | 6666 | 6665 |
|         0 |          1000 | 2018-04-03    |        7 |     2 |   2 | 6667 | 6665 |
|         0 |          1000 | 2018-06-01    |        5 |     2 |   3 | 6726 | 6723 |
|         0 |          1000 | 2018-06-02    |       10 |     2 |   4 | 6727 | 6723 |
|         0 |          1000 | 2018-06-03    |        7 |     2 |   5 | 6728 | 6723 |
|         3 |          5000 | 2018-05-07    |       15 |     5 |   1 | 6701 | 6700 |
|         3 |          5000 | 2018-05-08    |       23 |     5 |   2 | 6702 | 6700 |
|         3 |          5000 | 2018-05-09    |       52 |     5 |   3 | 6703 | 6700 |
|         3 |          5000 | 2018-05-10    |       12 |    20 |   1 | 6704 | 6703 |
|         3 |          5000 | 2018-05-11    |       14 |    20 |   2 | 6705 | 6703 |
|         3 |          5000 | 2018-06-07    |       15 |    20 |   3 | 6732 | 6729 |
|         3 |          5000 | 2018-06-08    |       23 |    20 |   4 | 6733 | 6729 |
|         3 |          5000 | 2018-06-09    |       52 |    20 |   5 | 6734 | 6729 |
|         3 |          5000 | 2018-06-10    |       12 |    20 |   6 | 6735 | 6729 |
|         3 |          5000 | 2018-06-11    |       14 |    20 |   7 | 6736 | 6729 |
+-----------+---------------+---------------+----------+-------+-----+------+------+

Now simply group by the Diff to get one row per interval.

Final query

WITH
CTE_Prices
AS
(
    SELECT
        History.ProductId
        ,History.DestinationId
        ,History.ScheduledDate
        ,History.Quantity
        ,A.Price
    FROM
        History
        OUTER APPLY
        (
            SELECT TOP(1)
                PriceChange.Price
            FROM
                PriceChange
            WHERE
                PriceChange.ProductID = History.ProductID
                AND PriceChange.DestinationId = History.DestinationId
                AND PriceChange.EffectiveDate <= History.ScheduledDate
            ORDER BY
                PriceChange.EffectiveDate DESC
        ) AS A
)
,CTE_rn
AS
(
    SELECT
        ProductId
        ,DestinationId
        ,ScheduledDate
        ,Quantity
        ,Price
        ,ROW_NUMBER() OVER (PARTITION BY ProductId, DestinationId, Price ORDER BY ScheduledDate) AS rn1
        ,DATEDIFF(day, '20000101', ScheduledDate) AS rn2
    FROM
        CTE_Prices
)
SELECT
    ProductId
    ,DestinationId
    ,MIN(ScheduledDate) AS StartDate
    ,MAX(ScheduledDate) AS EndDate
    ,SUM(Quantity) AS TotalQuantity
    ,Price
FROM
    CTE_rn
GROUP BY
    ProductId
    ,DestinationId
    ,Price
    ,rn2-rn1
ORDER BY
    ProductID
    ,DestinationId
    ,StartDate
;

Final result

+-----------+---------------+------------+------------+---------------+-------+
| ProductId | DestinationId | StartDate  |  EndDate   | TotalQuantity | Price |
+-----------+---------------+------------+------------+---------------+-------+
|         0 |          1000 | 2018-04-01 | 2018-04-01 |             5 |     1 |
|         0 |          1000 | 2018-04-02 | 2018-04-03 |            17 |     2 |
|         0 |          1000 | 2018-06-01 | 2018-06-03 |            22 |     2 |
|         3 |          5000 | 2018-05-07 | 2018-05-09 |            90 |     5 |
|         3 |          5000 | 2018-05-10 | 2018-05-11 |            26 |    20 |
|         3 |          5000 | 2018-06-07 | 2018-06-11 |           116 |    20 |
+-----------+---------------+------------+------------+---------------+-------+

In the same column, compare each value with previous multiple values with condition

This would have been a good spot to use window functions with a date range specification. Alas, SQL Server does not support that (yet?).

The simplest approach might be exists and a correlated subquery:

select t.*
from mytable t
where exists (
    select 1
    from mytable t1
    where 
        t1.visit_id = t.visit_id 
        and t1.collection_time >= dateadd(day, -2.collection_time)
        and t1.collection_time <  t.collection_time
        and t1.value < t.value - 3
)

Or you can use cross apply:

select t.*
from mytable t
cross apply (
    select min(t1.value) as min_value
    from mytable t1
    where 
        t1.visit_id = t.visit_id 
        and t1.collection_time >= dateadd(day, -2.collection_time)
        and t1.collection_time <  t.collection_time
) t1
where t1.min_value < t.value - 3

Gap-and-island for more than time threshold

The problem is we have two user_id and they're not defined so you need to specifically choose them and give them aliases.

select  session_id    
       ,timestamp   
       ,user_id 
       ,start_time  
       ,count(diff) over() as number_of_sessions_with_problem
from   (
       select session_id     
             ,timestamp    
             ,labels.user_id  
             ,start_time   
             ,case when lead(timestamp) over(partition by session_id order by timestamp)-timestamp > '00:05:00.000' then 1 end as diff
       from   raw_data join labels using(session_id)
       ) t
where  diff = 1

session_id	timestamp	user_id	start_time	number_of_sessions_with_problem
656	2016-04-01 00:16:19.687	9	2016-04-01 00:03:39	2
657	2016-04-01 00:26:51.096	9	2016-04-01 00:26:51	2

Jump SQL gap over specific condition & proper lead() usage

Query with window functions

SELECT *
FROM  (
   SELECT *
        , lag(val, 1, 0)    OVER (PARTITION BY status ORDER BY id) AS last_val
        , lag(status, 1, 0) OVER w2 AS last_status
        , lag(next_id)      OVER w2 AS next_id_of_last_status
   FROM  (
      SELECT *, lead(id) OVER (PARTITION BY status ORDER BY id) AS next_id
      FROM   t1
      ) AS t
   WINDOW w2 AS (PARTITION BY val ORDER BY id)
  ) x
WHERE (last_val <> val OR last_status <> status)
AND   (status = 1 
       OR last_status = 1
          AND ((next_id_of_last_status > id) OR next_id_of_last_status IS NULL)
      )
ORDER  BY id;

In addition to what we already had, we need valid OFF switches.

An OFF switch if valid if the device was switched ON before (last_status = 1) and the next ON operation after that comes after the OFF switch in question (next_id_of_last_status > id).

We have to provide for the special case that there is was the last ON operation, so we check for NULL in addition (OR next_id_of_last_status IS NULL).

The next_id_of_last_status comes from the same window that we take last_status from. Therefore I introduced additional syntax for explicit window declaration, so I don't have to repeat myself:

WINDOW w2 AS (PARTITION BY val ORDER BY id)

And we need to get the next id for the last status in a subquery earlier (subquery t).

If you've understood all that, you shouldn't have a problem slapping lead() on top of this query to get to your final destination. :)

PL/pgSQL function

Once it gets this complex, it's time to switch to procedural processing.

This comparatively simple plpgsql function nukes the performance of the complex window function query, for the simple reason that it has to scan the whole table only once.

CREATE OR REPLACE FUNCTION valid_t1 (OUT t t1)  -- row variable of table type
  RETURNS SETOF t1
  LANGUAGE plpgsql AS
$func$
DECLARE
   _last_on int := -1;  -- init with impossible value
BEGIN
   FOR t IN
      SELECT * FROM t1 ORDER BY id
   LOOP
      IF t.status = 1 THEN
         IF _last_on <> t.val THEN
            RETURN NEXT;
            _last_on := t.val;
         END IF;
      ELSE
         IF _last_on = t.val THEN
            RETURN NEXT;
            _last_on := -1;
         END IF;
      END IF;
   END LOOP;
END
$func$;

Call:

SELECT * FROM valid_t1();

SQL - LAG to get previous value if condition using multiple previous columns satisfied

I see no answer here that uses window functions and a single scan of the table. We can do this query in a single scan as follows:

Let us assume you have the AwayTeam in another column.

If you don't have this yet and you wanted to parse it out of EventData:

We could use: SUBSTRING(EventData, CHARINDEX(' vs ', EventData) + 4)

I urge you to follow proper normalization and create this as a proper column in your table.

Our algorithm runs like this:

Multiply out (unpivot) the two teams as separate rows, using CROSS APPLY
Calculate the previous Metrics using LAG, partitioning by the merged Team column
Filter back down the doubled up rows, so that we only get a single row for each of our original ones

SELECT id, HomeTeam, AwayTeam, Metric, Prev1, Prev2, Prev3
FROM (

  SELECT *
    ,Prev1 = LAG(Metric, 1) OVER (PARTITION BY v.Team ORDER BY id)
    ,Prev2 = LAG(Metric, 2) OVER (PARTITION BY v.Team ORDER BY id)
    ,Prev3 = LAG(Metric, 3) OVER (PARTITION BY v.Team ORDER BY id)
    -- more of these ......
  FROM test_table
  CROSS APPLY (VALUES (HomeTeam, 1),(AwayTeam, 0)) AS v(Team,IsHome)
) AS t

WHERE IsHome = 1
-- ORDER BY id  --if necessary

Importantly, we can do this without the use of multiple different sorts, partitions or ordering, and without the use of a self-join. Just a single scan.

Result:

id	HomeTeam	AwayTeam	Metric	Prev1	Prev2	Prev3
1	Team A	Team B	5	(null)	(null)	(null)
2	Team A	Team B	7	5	(null)	(null)
3	Team C	Team D	6	(null)	(null)	(null)
4	Team Z	Team A	8	(null)	(null)	(null)
5	Team A	Team B	9	8	7	5
6	Team C	Team D	3	6	(null)	(null)
7	Team C	Team D	1	3	6	(null)
8	Team E	Team F	2	(null)	(null)	(null)

group consecutive time intervals in sql

You can use variables just fine in PL/pgSQL.

I would solve this with a table function.

Assuming the table is called stock, my code would look like this:

CREATE OR REPLACE FUNCTION combine_periods() RETURNS SETOF stock
   LANGUAGE plpgsql STABLE AS
$$DECLARE
   s stock;
   period stock;
BEGIN
   FOR s IN
      SELECT stock_name, action, start_date, end_date
      FROM stock
      ORDER BY stock_name, action, start_date
   LOOP
      /* is this a new period? */
      IF period IS NOT NULL AND
         (period.stock_name <> s.stock_name
            OR period.action <> s.action
            OR period.end_date <> s.start_date)
      THEN
         /* new period, output last period */
         RETURN NEXT period;
         period := NULL;
      ELSE
         IF period IS NOT NULL
         THEN
            /* period continues, update end_date */
            period.end_date := s.end_date;
         END IF;
      END IF;

      /* remember the beginning of a new period */
      IF period IS NULL
      THEN
         period := s;
      END IF;
   END LOOP;

   /* output the last period */
   IF period IS NOT NULL
   THEN
      RETURN NEXT period;
   END IF;

   RETURN;
END;$$;

And I would call it like this:

test=> SELECT * FROM combine_periods();
┌────────────┬─────────┬────────────┬──────────┐
│ stock_name │ action  │ start_date │ end_date │
├────────────┼─────────┼────────────┼──────────┤
│ google     │ falling │          3 │        4 │
│ google     │ growing │          1 │        3 │
│ google     │ growing │          4 │        5 │
│ yahoo      │ growing │          1 │        2 │
└────────────┴─────────┴────────────┴──────────┘
(4 rows)

Detect SQL Island Over Multiple Parameters and Conditions