Postgresql Window Function: Partition by Comparison

PostgreSQL window function: partition by comparison

Using several different window functions and two subqueries, this should work decently fast:

WITH events(id, event, ts) AS (
VALUES
(1, 12, '2014-03-19 08:00:00'::timestamp)
,(2, 12, '2014-03-19 08:30:00')
,(3, 13, '2014-03-19 09:00:00')
,(4, 13, '2014-03-19 09:30:00')
,(5, 12, '2014-03-19 10:00:00')
)
SELECT first_value(pre_id) OVER (PARTITION BY grp ORDER BY ts) AS pre_id
, id, ts
, first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id
FROM (
SELECT *, count(step) OVER w AS grp
FROM (
SELECT id, ts
, NULLIF(lag(event) OVER w, event) AS step
, lag(id) OVER w AS pre_id
, lead(id) OVER w AS post_id
FROM events
WINDOW w AS (ORDER BY ts)
) sub1
WINDOW w AS (ORDER BY ts)
) sub2
ORDER BY ts;

Using ts as name for the timestamp column.

Assuming ts to be unique - and indexed (a unique constraint does that automatically).

In a test with a real life table with 50k rows it only needed a single index scan. So, should be decently fast even with big tables. In comparison, your query with join / distinct did not finish after a minute (as expected).

Even an optimized version, dealing with one cross join at a time (the left join with hardly a limiting condition is effectively a limited cross join) did not finish after a minute.

For best performance with a big table, tune your memory settings, in particular for work_mem (for big sort operations). Consider setting it (much) higher for your session temporarily if you can spare the RAM. Read more here and here.

How?

  1. In subquery sub1 look at the event from the previous row and only keep that if it has changed, thus marking the first element of a new group. At the same time, get the id of the previous and the next row (pre_id, post_id).

  2. In subquery sub2, count() only counts non-null values. The resulting grp marks peers in blocks of consecutive same events.

  3. In the final SELECT, take the first pre_id and the last post_id per group for each row to arrive at the desired result.

    Actually, this should be even faster in the outer SELECT:

     last_value(post_id) OVER (PARTITION BY grp ORDER BY ts
    RANGE BETWEEN UNBOUNDED PRECEDING
    AND UNBOUNDED FOLLOWING) AS post_id

    ... since the sort order of the window agrees with the window for pre_id, so only a single sort is needed. A quick test seems to confirm it. More about this frame definition.

SQL Fiddle.

PostgreSQL Window Function ordering

partition over x the ordering is not working the way I expect

It is working perfectly fine. When you partition by x first 1 and last 1 are in the same group.

Window Functions:

The PARTITION BY list within OVER specifies dividing the rows into groups, or partitions, that share the same values of the PARTITION BY expression(s). For each row, the window function is computed across the rows that fall into the same partition as the current row.

To get result you want you could use (classic example of gaps and islands problem):

SELECT *, ROW_NUMBER() OVER (ORDER BY y) -
ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) + 1 AS group_id
FROM tab
ORDER BY group_id

LiveDemo

Output:

╔═══╦═══╦══════════╗
║ x ║ y ║ group_id ║
╠═══╬═══╬══════════╣
║ 1 ║ 1 ║ 1 ║
║ 2 ║ 2 ║ 2 ║
║ 2 ║ 3 ║ 2 ║
║ 1 ║ 4 ║ 3 ║
╚═══╩═══╩══════════╝

Warning:

This solution is not general.

EDIT:

More general solution is to utilize LAG to get previous value and windowed SUM:

WITH cte AS
(
SELECT t1.x, t1.y, LAG(x) OVER(ORDER BY y) AS x_prev
FROM tab t1
)
SELECT x,y, SUM( CASE WHEN x = COALESCE(x_prev,x) THEN 0 ELSE 1 END)
OVER(ORDER BY y) + 1 AS group_id
FROM cte
ORDER BY group_id;

LiveDemo2

PostgreSQL - how do I know which partition I'm in when using window functions?

We can try using DENSE_RANK and ROW_NUMBER:

WITH yourTable AS (
SELECT 'A' AS col1, 1 AS col2 UNION ALL
SELECT 'A', 2 UNION ALL
SELECT 'A', 3 UNION ALL
SELECT 'B', 1 UNION ALL
SELECT 'B', 2 UNION ALL
SELECT 'B', 3
)

SELECT *,
DENSE_RANK() OVER (ORDER BY col1) partition1,
ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col2) partition2
FROM yourTable
ORDER BY
col1, col2;

Sample Image

Demo

The general pattern here, for an arbitrary number of partitions, would be to use DENSE_RANK for the first N-1 partitions, and then ROW_NUMBER at the last partition. ROW_NUMBER would ensure that the last partition has 1,2,3,... as the sequence.

Can I use one PARTITION definition for multiple window function calls?

You can try to use WINDOW Clause, The optional WINDOW clause has the general form

WINDOW window_name AS ( window_definition ) [, ...]

Then use OVER window_name for your aggregate function, it might be more elegant

 select id,weekly,
sum(totalsteps) over w as total_steps,
sum(totaldistance) over w as total_distance,
sum(veryactiveminutes) over w as total_veryactive,
sum(fairlyactiveminutes) over w as total_fairlyactive,
sum(lightlyactiveminutes) over w as total_lightlyactive,
sum(totalsteps) over w as total_steps,
sum(totaldistance) over w as total_distance,
sum(veryactivedistance) over w as total_veryactivedistance,
sum(moderatelyactivedistance) over w as total_moderatelyactivedistance,
sum(lightactivedistance) over w as total_lightactivedistance,
sum(sedentaryactivedistance) over w as total_sedentaryactivedistance,
sum(calories) over w as total_calories,
sum(totalminutesover w as leep) over w as total_over w as leep,
sum(totaltimeinbed) over w as total_inbed
from (
select *, date_trunc('week', activitydate) as weekly
from activitysleep_merged
) WINDOW w AS ( PARTITION BY id, weekly );

more detail we can refer WINDOW Clause

SQLfiddle

Partition By with Order By Clause in PostgreSQL

Update

This behavior is documented here:

4.2.8. Window Function Calls


[..]
The default framing option is RANGE UNBOUNDED PRECEDING, which is
the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With
ORDER BY, this sets the frame to be all rows from the partition
start up through the current row's last ORDER BY peer. Without
ORDER BY, this means all rows of the partition are included in the
window frame, since all rows become peers of the current row.

That means:

In absence of a frame_clauseRANGE UNBOUNDED PRECEDING is used by default. That includes:

  • All rows "preceding" the current row according to the ORDER BY clause
  • The current row
  • All rows which have the same values in the ORDER BY columns as the current row

In absence of an ORDER BY clause – ORDER BY NULL is assumed (though I'm guessing again). Thus the frame will include all rows from the partition, because the values in the ORDER BY column(s) are the same (which is always NULL) in every row.

Original answer:

Disclaimer: The following is more a guess than a qualified answer. I didn't find any documentation, which can confirm what I write. At the same time I don't think that currently given answers correctly explain the behavior.

The reason for the diffrence in the results is not directly the ORDER BY clause, since a + b + c is the same as c + b + a. The reason is (and that is my guess) that the ORDER BY clause implicitly defines the frame_clause as

rows between unbounded preceding and current row

Try the following query:

select *
, sum(val) over (partition by user_id) as res
, sum(val) over (partition by user_id order by ts) as res_order_by
, sum(val) over (
partition by user_id
order by ts
rows between unbounded preceding and current row
) as res_order_by_unbounded_preceding
, sum(val) over (
partition by user_id
-- order by ts
rows between unbounded preceding and current row
) as res_preceding
, sum(val) over (
partition by user_id
-- order by ts
rows between current row and unbounded following
) as res_following
, sum(val) over (
partition by user_id
order by ts
rows between unbounded preceding and unbounded following
) as res_orderby_preceding_following

from example_table;

db<>fiddle

You will see, that you can get a cumulative sum without an ORDER BY clause aswell as get a "full" sum with the ORDER BY clause.



Related Topics



Leave a reply



Submit