Window Functions or Common Table Expressions: Count Previous Rows Within Range

Window Functions or Common Table Expressions: count previous rows within range

I don't think you can do this cheaply with a plain query, CTEs and window functions - their frame definition is static, but you need a dynamic frame (depending on column values).

Generally, you'll have to define lower and upper bound of your window carefully: The following queries exclude the current row and include the lower border.

There is still a minor difference: the function includes previous peers of the current row, while the correlated subquery excludes them ...

Test case

Using ts instead of reserved word date as column name.

CREATE TABLE test (
id bigint
, ts timestamp
);

ROM - Roman's query

Use CTEs, aggregate timestamps into an array, unnest, count ...

While correct, performance deteriorates drastically with more than a hand full of rows. There are a couple of performance killers here. See below.

ARR - count array elements

I took Roman's query and tried to streamline it a bit:

  • Remove 2nd CTE which is not necessary.
  • Transform 1st CTE into subquery, which is faster.
  • Direct count() instead of re-aggregating into an array and counting with array_length().

But array handling is expensive, and performance still deteriorates badly with more rows.

SELECT id, ts
, (SELECT count(*)::int - 1
FROM unnest(dates) x
WHERE x >= sub.ts - interval '1h') AS ct
FROM (
SELECT id, ts
, array_agg(ts) OVER(ORDER BY ts) AS dates
FROM test
) sub;

COR - correlated subquery

You could solve it with a simple correlated subquery. A lot faster, but still ...

SELECT id, ts
, (SELECT count(*)
FROM test t1
WHERE t1.ts >= t.ts - interval '1h'
AND t1.ts < t.ts) AS ct
FROM test t
ORDER BY ts;

FNC - Function

Loop over rows in chronological order with a row_number() in plpgsql function and combine that with a cursor over the same query, spanning the desired time frame. Then we can just subtract row numbers:

CREATE OR REPLACE FUNCTION running_window_ct(_intv interval = '1 hour')
RETURNS TABLE (id bigint, ts timestamp, ct int)
LANGUAGE plpgsql AS
$func$
DECLARE
cur CURSOR FOR
SELECT t.ts + _intv AS ts1, row_number() OVER (ORDER BY t.ts) AS rn
FROM test t ORDER BY t.ts;
rec record;
rn int;

BEGIN
OPEN cur;
FETCH cur INTO rec;
ct := -1; -- init

FOR id, ts, rn IN
SELECT t.id, t.ts, row_number() OVER (ORDER BY t.ts)
FROM test t ORDER BY t.ts
LOOP
IF rec.ts1 >= ts THEN
ct := ct + 1;
ELSE
LOOP
FETCH cur INTO rec;
EXIT WHEN rec.ts1 >= ts;
END LOOP;
ct := rn - rec.rn;
END IF;

RETURN NEXT;
END LOOP;
END
$func$;

Call with default interval of one hour:

SELECT * FROM running_window_ct();

Or with any interval:

SELECT * FROM running_window_ct('2 hour - 3 second');

db<>fiddle here

Old sqlfiddle

Benchmark

With the table from above I ran a quick benchmark on my old test server: (PostgreSQL 9.1.9 on Debian).

-- TRUNCATE test;
INSERT INTO test
SELECT g, '2013-08-08'::timestamp
+ g * interval '5 min'
+ random() * 300 * interval '1 min' -- halfway realistic values
FROM generate_series(1, 10000) g;

CREATE INDEX test_ts_idx ON test (ts);
ANALYZE test; -- temp table needs manual analyze

I varied the bold part for each run and took the best of 5 with EXPLAIN ANALYZE.

100 rows

ROM: 27.656 ms

ARR: 7.834 ms

COR: 5.488 ms

FNC: 1.115 ms

1000 rows

ROM: 2116.029 ms

ARR: 189.679 ms

COR: 65.802 ms

FNC: 8.466 ms

5000 rows

ROM: 51347 ms !!

ARR: 3167 ms

COR: 333 ms

FNC: 42 ms

100000 rows

ROM: DNF

ARR: DNF

COR: 6760 ms

FNC: 828 ms

The function is the clear victor. It is fastest by an order of magnitude and scales best.

Array handling cannot compete.

Rolling count of rows withing time interval

Sounds like an application for window functions. But, sadly, that's not the case. Window frames can only be based on row counts, not on actual column values.

A simple query with LEFT JOIN can do the job:

SELECT t0.order_id
, count(t1.time_created) AS count_within_3_sec
FROM tbl t0
LEFT JOIN tbl t1 ON t1.time_created BETWEEN t0.time_created - interval '3 sec'
AND t0.time_created
GROUP BY 1
ORDER BY 1;

db<>fiddle here

Does not work with time like in your minimal demo, as that does not wrap around. I suppose it's reasonable to assume timestamp or timestamptz.

Since you include each row itself in the count, an INNER JOIN would work, too. (LEFT JOIN is still more reliable in the face of possible NULL values.)

Or use a LATERAL subquery and you don't need to aggregate on the outer query level:

SELECT t0.order_id
, t1.count_within_3_sec
FROM tbl t0
LEFT JOIN LATERAL (
SELECT count(*) AS count_within_3_sec
FROM tbl t1
WHERE t1.time_created BETWEEN t0.time_created - interval '3 sec'
AND t0.time_created
) t1 ON true
ORDER BY 1;

Related:

  • Rolling sum / count / average over date interval

For big tables and many rows in the time frame, a procedural solution that walks through the table once will perform better. Like:

  • Window Functions or Common Table Expressions: count previous rows within range
  • Alternatives to broken PL/ruby: convert a warehouse journal table
  • GROUP BY and aggregate sequential numeric values

Count over rows in previous time range partitioned by a specific column


DB design

Fist off, your table should be normalized. industry should be a small foreign key column (typically integer) referencing industry_id of an industry table. Maybe you have that already and only simplified for the sake of the question. Your actual table definition would go a long way.

Since rows with an indicator are rare but highly interesting, create a (possibly "covering") partial index to make any solution faster:

CREATE INDEX tbl_indicator_idx ON tbl (industry, day)
WHERE indicator <> 0;

Equality first, range last.

Assuming that indicator is defined NOT NULL. If industry was an integer, this index would be perfectly efficient.

Query

This query identifies rows to be reset:

WITH x AS (               -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
SELECT industry, day
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('ISODOW' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) sub
WHERE incident
AND ct > 1
ORDER BY 1, 2;

SQL Fiddle.

ISODOW as extract() parameter is convenient to truncate weekends.

Integrate this in your UPDATE:

WITH x AS (               -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
UPDATE tbl t
SET indicator = 0
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('isodow' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) u
WHERE u.incident
AND u.ct > 1
AND t.industry = u.industry
AND t.day = u.day;

This should be substantially faster than your solution with correlated subqueries and a function call for every row. Even if that's based on my own previous answer, it's not perfect for this case.

Postgres - Update running count whenever row meets a certain condition

Here is one option using analytic functions:

WITH cte AS (
SELECT *, CASE WHEN SUM(price*quantity) OVER (ORDER BY id) = 0 THEN 1 ELSE 0 END AS price_sum
FROM yourTable
),
cte2 AS (
SELECT *, LAG(price_sum, 1, 0) OVER (ORDER BY id) price_sum_lag
FROM cte
)

SELECT id, price, quantity, 1 + SUM(price_sum_lag) OVER (ORDER BY id) cumulative_total
FROM cte2
ORDER BY id;

screen capture from demo link below

Demo

You may try running each CTE in succession to see how the logic is working.

Count number of years since last deduction

Assuming the current year is the most recent year, you would use aggregation:

select account, max(fy),
sum(case when fy = max_fy then deductions end) as this_year_deduction,
max(fy) - max(case when deduction < 0 then fy end) as years_since_deduction
from (select t.*, max(fy) over (partition by account) as max_fy
from t
) t
group by account;

Note: I assume the third column is the most recent deduction. The query uses a window function to extract that.

Calculate moving sum/count by time condition using window function and filter PostgreSQL

You want a window function with a window frame definition using range:

select t1.*,
sum(quantity) over (order by time
range between interval '29 day' preceding and current row
)
from t1 ;

EDIT:

If you have data for all dates, you can use rows:

select t1.*,
sum(quantity) over (order by time
rows between 29 preceding and current row
)
from t1 ;

EDIT II:

If you need to deal with missing days in older versions of Postgres that do not support range, then expanding the data is probably the simplest method:

select t1.*,
sum(quantity) over (order by time
rows between 29 preceding and current row
)
from (select generate_series(min(t1.time), max(t1.time), interval '1 day') as dte
from t1
) d left join
t1
on d.dte = t1.time;

You may want to filter out the additional rows:

select t1.*
from (select t1.*,
sum(quantity) over (order by time
rows between 29 preceding and current row
) as running_sum
from (select generate_series(min(t1.time), max(t1.time), interval '1 day') as dte
from t1
) d left join
t1
on d.dte = t1.time
) t1
where t1.time is not null;

How to improve SQL query performance containing partially common subqueries


How can I reuse the common portion of the subquery ...?

Use conditional aggregates in a single LATERAL subquery:

SELECT t1.*, t2.positive, t2.negative
FROM tableA t1
CROSS JOIN LATERAL (
SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
, COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000
) t2;

It can be a CROSS JOIN because the subquery always returns a row. See:

  • JOIN (SELECT ... ) ue ON 1=1?
  • What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

Use conditional aggregates with the FILTER clause to base multiple aggregates on the same time frame. See:

  • Aggregate columns with additional (distinct) filters

event_count should probably be integer or bigint. See:

  • PostgreSQL using UUID vs Text as primary key
  • Is there any difference in saving same value in different integer types?

sys_timestamp should probably be timestamp or timestamptz. See:

  • Ignoring time zones altogether in Rails and PostgreSQL

An index on (sys_timestamp) is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count) typically helps some more. If the table is vacuumed enough, you get index-only scans from it.

Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See:

  • Window Functions or Common Table Expressions: count previous rows within range
  • What are the pros and cons of performing calculations in sql vs. in your application

How to perform a filtered query in a PostgreSQL window partition?

What you have in mind is simply not possibly using the frame definition of a window function. (You were beginning to suspect as much.) The RANGE or ROWS clauses count distinct values or rows and have no concept of the meaning of values.

You want to count all rows that fall in a certain period of time and need to go about this differently. You could run a correlated subquery or a LATERAL subquery to count for each and every row, but that's expensive.

A smarter way would be to run through two cursors in parallel and keep a running count. I implemented exactly that for a very similar question:

  • Window Functions or Common Table Expressions: count previous rows within range

Scales much better. I added detailed benchmarks over there.



Related Topics



Leave a reply



Submit