Conditional Lead/Lag Function Postgresql

How to select the previous date with a condition

In this case, perhaps a subquery would be simpler than window functions:

SELECT *,(
 SELECT max(date) FROM topups q2 
 WHERE id_user = q.id_user AND 
       q2.promo_id = 'Y' AND 
       q.date > q2.date)
FROM topups q ORDER BY date DESC;

 id_user |    date    | promo_id |    max     
---------+------------+----------+------------
       1 | 2017-06-20 | N        | 2017-04-20
       1 | 2017-05-20 | N        | 2017-04-20
       1 | 2017-04-20 | Y        | 2017-03-20
       1 | 2017-03-20 | Y        | 
       1 | 2017-02-20 | N        | 
(5 rows)

Demo: db<>fiddle

Postgres : lag and lead with special conditions

No need for analytic functions, just sub-selects. Something like the following (untested) should work:

select
  id_data,
  val_no3,
  id_prev,
  id_next,
  (select val_no2 from b where id_data = x.id_prev) as val_prev,
  (select val_no2 from b where id_data = x.id_next) as val_next
from
  b x
order by
  id_data;

PostgreSQL combine LAG and LEAD to query n previous and following rows

I am not quiet sure if I got your use case correctly: You want to check if one given annotation is in one of the 5 rows (2 preceding, current, 2 following). Correct?

It is possible to define a window like BETWEEN 2 PRECEDING AND 2 FOLLOWING
LEAD or LAG only give one value, in this case the one value after or before the current row - if a window supports it; no matter how many rows your window contains. But you want to check in any of these five rows.

One way to achieve this:

demo: db<>fiddle

SELECT *
FROM (
    SELECT token, annotation, lemma,
        unnest(array_agg(annotation) OVER w) as surrounded_annos      -- 2
    FROM tokens
    WINDOW w AS (                                                     -- 1
        ORDER BY line, tno ASC
        ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
    )
    ORDER BY line, tno ASC
) AS "window"
WHERE
    lemma LIKE '...'
    AND "window".surrounded_annos LIKE '...'
;

defining the window as explained above
1. aggregate all annotations in these five rows (if possible) with array_agg which gives an array
2. unnest expands this array into one row for every element since IMHO there is no way to search array elements with LIKE. This gives you this result (which can be filtered in the next step):

Result subquery:

token     annotation        lemma     surrounded_annos
This      DEM.PROX          this      DEM.PROX
This      DEM.PROX          this      VB.COP.3SG.PRES
This      DEM.PROX          this      ART.INDEF
is        VB.COP.3SG.PRES   be        DEM.PROX
is        VB.COP.3SG.PRES   be        VB.COP.3SG.PRES
is        VB.COP.3SG.PRES   be        ART.INDEF
is        VB.COP.3SG.PRES   be        NN.INAN
an        ART.INDEF         a         DEM.PROX
an        ART.INDEF         a         VB.COP.3SG.PRES
an        ART.INDEF         a         ART.INDEF
an        ART.INDEF         a         NN.INAN
example   NN.INAN           example   VB.COP.3SG.PRES
example   NN.INAN           example   ART.INDEF
example   NN.INAN           example   NN.

Using lag and lead function for multiple partitions

You are pretty close. A few small issues may be all that's holding you up. For one, you can't reference the alias last_month_amount within the same query where you assign it. My preference is to pull the first part into a CTE and then do the calculations in the main query.

Secondly, you may also be seeing unusual behavior due to not handling nulls consistently. Since lag may return null if you are looking at the first row in the partition, delta should address this possibility. It appears you're aware delta may be null (from the logic in your event case statement), but delta is currently coming up null for the first row, not the last, so I think you have the 'stop' case backwards.

You also are missing a FROM; perhaps you were abbreviating or simplifying your code.

Finally, the way "month" is used in this post is confusing, in part because it's an odd name for a DATE column. When you say "the same month", it's unclear whether you mean the same date or the same month of the year. If you potentially have distinct rows with the same month (date), seller, and product, then you will need a secondary column in your order by or you will get unpredictable ordering. Without knowing any of this I'm going to assume you've thought this through and that order by month is good enough.

It seems like you may also want event to be 'new' if last month's amount was zero, since the previous row would be 'stop'. I've made that change, but adjust accordingly if my assumption is wrong.

Here's the same code with those changes:

with rows_in_context AS (
  select month 
  , seller
  , product
  , amount
  , lag(amount) over (partition by seller,product order by month) as last_month_amount
  from some_table
)
select *
  , coalesce(amount,0) - coalesce(last_month_amount,0) as delta 
  , case when COALESCE(last_month_amount,0) = 0 then 'new' 
         when COALESCE(amount,0) = 0 then 'stop'
         when delta > 0 then 'increase' 
         when delta < 0 then 'decrease'
    end as event 
from rows_in_context
order by month, seller, product;

Using LAG function with a specific condition

I think that you want a conditional lead() with option ignore nulls:

select 
    id,
    date from_date,
    case when status = 'ON' 
        then lead(case when status = 'OFF' then date end ignore nulls) 
            over(partition by id order by date)
    end to_date,
    status
from mytable

To better match the expected results, we apply the function to any status other than 'OFF', and withdraw 1 day from the result:

select 
    id,
    date from_date,
    case when status <> 'OFF' 
        then lead(case when status = 'OFF' then date end ignore nulls) 
            over(partition by id order by date) - 1
    end to_date,
    status
from mytable

If you also want the next 'ON' date for 'OFF' rows:

select 
    id,
    date from_date,
    case when status <> 'OFF' 
        then lead(case when status = 'OFF' then date end ignore nulls) 
            over(partition by id order by date) - 1
        else lead(case when status = 'ON' then date end ignore nulls) 
            over(partition by id order by date) - 1
    end to_date,
    status
from mytable

Demo on DB Fiddlde

First query:


   ID | FROM_DATE | TO_DATE   | STATUS
----: | :-------- | :-------- | :-----
15643 | 10-MAR-20 | 20-MAR-20 | ON    
15643 | 15-MAR-20 | null      | test  
15643 | 20-MAR-20 | null      | OFF

Second query:


   ID | FROM_DATE | TO_DATE   | STATUS
----: | :-------- | :-------- | :-----
15643 | 10-MAR-20 | 19-MAR-20 | ON    
15643 | 15-MAR-20 | 19-MAR-20 | test  
15643 | 20-MAR-20 | null      | OFF

Third query (you can't see the difference with the second query since there is no 'OFF' row with a next 'ON'):


   ID | FROM_DATE | TO_DATE   | STATUS
----: | :-------- | :-------- | :-----
15643 | 10-MAR-20 | 19-MAR-20 | ON    
15643 | 15-MAR-20 | 19-MAR-20 | test  
15643 | 20-MAR-20 | null      | OFF

Conditional LEAD/LAG with no sequence guarantee

You have the right expression for prior_path already. You just need to wrap it in a conditional expression.

As for prior_event, it is a bit more complicated indeed. I would suggest the following approach:

for events, we can just use lag()
for pages, an option is to use some gaps-and-islands technique: first define groups with a conditional sum that increments every time an event is met, and then use first_value():

This should do what you want:

select  
    t.*,
    case when type = 'page'
        then lag(path) over(partition by sessionid, type  order by hit)
    end prior_path,
    case type 
        when 'page'
            then first_value(event) over(partition by sessionid, grp order by hit)
        when 'event' 
            then lag(event) over(partition by sessionid order by hit)
        end prior_event
from (
    select 
        t.*,
        sum(case when type = 'event' then 1 else 0 end) 
            over(partition by sessionid order by hit) grp
    from mytable t
) t

Demo on DB Fiddle (due to the lack of hive fiddle in the wild, I used Postgres - but this would work on Hive as well):


sessionid | hit | type  | path                                  | event | grp | prior_path                         | prior_event
--------: | --: | :---- | :------------------------------------ | :---- | --: | :--------------------------------- | :----------
     1001 |   1 | event | www.stackoverflow.com                 | hover |   1 | null                               | null       
     1001 |   2 | page  | www.stackoverflow.com                 | null  |   1 | null                               | hover      
     1001 |   3 | event | www.stackoverflow.com                 | load  |   2 | null                               | null       
     1001 |   4 | event | www.stackoverflow.com                 | blur  |   3 | null                               | load       
     1001 |   5 | event | www.stackoverflow.com                 | click |   4 | null                               | blur       
     1001 |   6 | page  | www.stackoverflow.com/post/10         | null  |   4 | www.stackoverflow.com              | click      
     1001 |   7 | event | www.stackoverflow.com/post/10#details | offer |   5 | null                               | null       
     1001 |   8 | page  | www.stackoverflow.com/post/confirm    | null  |   5 | www.stackoverflow.com/post/10      | offer      
     1001 |   9 | page  | www.stackoverflow.com/questions/10    | null  |   5 | www.stackoverflow.com/post/confirm | offer      
     1001 |  10 | event | www.stackoverflow.com/questions/10    | exit  |   6 | null                               | null

Using LAG() and PARTITION BY to return rows IF within 10 days of date

This answer is based on Gordon Linoff's idea,
but with a few tweaks:

FILTER is not implemented for pure window functions like lead() or lag() in Postgresql 11 (yet). So use WHERE fruit_bought='orange' as a condition on the entire inner SELECT.
To guarantee selection of the row with the last date, use LEAD(date, 1, '-infinity'). This makes the default value for next_date equal to a -infinity timestamp. Thus date >= next_date - interval '10 day' will be TRUE for the last date.
Let's call rows within 10 days of each other a cluster. To select only rows from the last cluster,
compute a cumulative sum which counts how many times cond is FALSE (since FALSE values separate clusters):
```
SUM(CASE WHEN cond IS TRUE THEN 0 ELSE 1 END) OVER (ORDER BY date DESC) AS cluster_num
```
and select only rows for which cluster_num equals 0. Since we ORDER BY date DESC, the 0th cluster is the last cluster.

SELECT *
FROM (
    SELECT *, SUM(CASE WHEN cond IS TRUE THEN 0 ELSE 1 END) OVER (ORDER BY date DESC) AS cluster_num
    FROM (
        SELECT *, date >= next_date - interval '10 day' AS cond
        FROM (
            SELECT id, fruit_bought, date, 
                LEAD(date, 1, '-infinity') 
                OVER (PARTITION BY fruit_bought ORDER BY date) AS next_date 
            FROM fruits 
            WHERE fruit_bought='orange'
            -- restrict date here to specify an "initial date"
            AND date <= '2018-04-01'  
        ) t1
    ) t2
) t3
WHERE cond AND cluster_num = 0
ORDER BY date ASC

yields

| id | fruit_bought |       date |  next_date | cond | cluster_num |
|----+--------------+------------+------------+------+-------------|
|  3 | orange       | 2018-03-07 | 2018-03-15 | t    |           0 |
|  4 | orange       | 2018-03-15 | 2018-03-20 | t    |           0 |
|  6 | orange       | 2018-03-20 |  -infinity | t    |           0 |

Setup:

CREATE TABLE fruits (
    fruitid INT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    id INT,
    fruit_bought TEXT,
    quantity INT,
    date DATE);

INSERT INTO fruits (id, fruit_bought, quantity, date)
VALUES (1,'orange',100,'2018-01-10')
, (2,'apple',50,'2018-02-05')
, (3,'orange',75,'2018-03-07')
, (4,'orange',200,'2018-03-15')
, (5,'apple',10,'2018-03-17')
, (6,'orange',20,'2018-03-20')
, (7,'orange',20,'2018-01-09');

LAG with condition

Here is one approach using analytic functions:

WITH cte AS (
    SELECT *, COUNT(CASE WHEN event = 1 THEN 1 END) OVER
                  (PARTITION BY customer_id ORDER BY ts) cnt
    FROM yourTable
)

SELECT ts, customer_id, event,
       MAX(CASE WHEN event = 1 THEN ts END) OVER
           (PARTITION BY customer_id, cnt) AS desired_result
FROM cte
ORDER BY customer_id, ts;

screen capture from demo link below

Demo

We can articulate your problem by saying that your want the desired_result column to contain the most recent timestamp value when the event was 1. The count (cnt) in the CTE above computes a pseudo group of records for each time the event is 1. Then we simply do a conditional aggregation over customer and pseudo group to find the timestamp value.

Jump SQL gap over specific condition & proper lead() usage

Query with window functions

SELECT *
FROM  (
   SELECT *
         ,lag(val, 1, 0)    OVER (PARTITION BY status ORDER BY id) AS last_val
         ,lag(status, 1, 0) OVER w2 AS last_status
         ,lag(next_id)      OVER w2 AS next_id_of_last_status
   FROM  (
      SELECT *, lead(id) OVER (PARTITION BY status ORDER BY id) AS next_id
      FROM   t1
      ) AS t
   WINDOW w2 AS (PARTITION BY val ORDER BY id)
  ) x
WHERE (last_val <> val OR last_status <> status)
AND   (status = 1 
       OR last_status = 1
          AND ((next_id_of_last_status > id) OR next_id_of_last_status IS NULL)
      )
ORDER  BY id

In addition to what we already had, we need valid OFF switches.

An OFF switch if valid if the device was switched ON before (last_status = 1) and the next ON operation after that comes after the OFF switch in question (next_id_of_last_status > id).

We have to provide for the special case that there is was the last ON operation, so we check for NULL in addition (OR next_id_of_last_status IS NULL).

The next_id_of_last_status comes from the same window that we take last_status from. Therefore I introduced additional syntax for explicit window declaration, so I don't have to repeat myself:

WINDOW w2 AS (PARTITION BY val ORDER BY id)

And we need to get the next id for the last status in a subquery earlier (subquery t).

If you've understood all that, you shouldn't have a problem slapping lead() on top of this query to get to your final destination. :)

PL/pgSQL function

Once it gets this complex, it's time to switch to procedural processing.

This comparatively simple plpgsql function nukes the performance of the complex window function query, for the simple reason that it has to scan the whole table only once.

CREATE OR REPLACE FUNCTION valid_t1 (OUT t t1)  -- row variable of table type
  RETURNS SETOF t1 LANGUAGE plpgsql AS
$func$
DECLARE
   _last_on int := -1;  -- init with impossible value
BEGIN

FOR t IN
   SELECT * FROM t1 ORDER BY id
LOOP
   IF t.status = 1 THEN
      IF _last_on <> t.val THEN
         RETURN NEXT;
         _last_on := t.val;
      END IF;
   ELSE
      IF _last_on = t.val THEN
         RETURN NEXT;
         _last_on := -1;
      END IF;
   END IF;
END LOOP;

END
$func$;

Call:

SELECT * FROM valid_t1();