Window Functions or Common Table Expressions: count previous rows within range
I don't think you can do this cheaply with a plain query, CTEs and window functions - their frame definition is static, but you need a dynamic frame (depending on column values).
Generally, you'll have to define lower and upper bound of your window carefully: The following queries exclude the current row and include the lower border.
There is still a minor difference: the function includes previous peers of the current row, while the correlated subquery excludes them ...
Test case
Using ts
instead of reserved word date
as column name.
CREATE TABLE test (
id bigint
, ts timestamp
);
ROM - Roman's query
Use CTEs, aggregate timestamps into an array, unnest, count ...
While correct, performance deteriorates drastically with more than a hand full of rows. There are a couple of performance killers here. See below.
ARR - count array elements
I took Roman's query and tried to streamline it a bit:
- Remove 2nd CTE which is not necessary.
- Transform 1st CTE into subquery, which is faster.
- Direct
count()
instead of re-aggregating into an array and counting witharray_length()
.
But array handling is expensive, and performance still deteriorates badly with more rows.
SELECT id, ts
, (SELECT count(*)::int - 1
FROM unnest(dates) x
WHERE x >= sub.ts - interval '1h') AS ct
FROM (
SELECT id, ts
, array_agg(ts) OVER(ORDER BY ts) AS dates
FROM test
) sub;
COR - correlated subquery
You could solve it with a simple correlated subquery. A lot faster, but still ...
SELECT id, ts
, (SELECT count(*)
FROM test t1
WHERE t1.ts >= t.ts - interval '1h'
AND t1.ts < t.ts) AS ct
FROM test t
ORDER BY ts;
FNC - Function
Loop over rows in chronological order with a row_number()
in plpgsql function and combine that with a cursor over the same query, spanning the desired time frame. Then we can just subtract row numbers:
CREATE OR REPLACE FUNCTION running_window_ct(_intv interval = '1 hour')
RETURNS TABLE (id bigint, ts timestamp, ct int)
LANGUAGE plpgsql AS
$func$
DECLARE
cur CURSOR FOR
SELECT t.ts + _intv AS ts1, row_number() OVER (ORDER BY t.ts) AS rn
FROM test t ORDER BY t.ts;
rec record;
rn int;
BEGIN
OPEN cur;
FETCH cur INTO rec;
ct := -1; -- init
FOR id, ts, rn IN
SELECT t.id, t.ts, row_number() OVER (ORDER BY t.ts)
FROM test t ORDER BY t.ts
LOOP
IF rec.ts1 >= ts THEN
ct := ct + 1;
ELSE
LOOP
FETCH cur INTO rec;
EXIT WHEN rec.ts1 >= ts;
END LOOP;
ct := rn - rec.rn;
END IF;
RETURN NEXT;
END LOOP;
END
$func$;
Call with default interval of one hour:
SELECT * FROM running_window_ct();
Or with any interval:
SELECT * FROM running_window_ct('2 hour - 3 second');
db<>fiddle here
Old sqlfiddle
Benchmark
With the table from above I ran a quick benchmark on my old test server: (PostgreSQL 9.1.9 on Debian).
-- TRUNCATE test;
INSERT INTO test
SELECT g, '2013-08-08'::timestamp
+ g * interval '5 min'
+ random() * 300 * interval '1 min' -- halfway realistic values
FROM generate_series(1, 10000) g;
CREATE INDEX test_ts_idx ON test (ts);
ANALYZE test; -- temp table needs manual analyze
I varied the bold part for each run and took the best of 5 with EXPLAIN ANALYZE
.
100 rows
ROM: 27.656 ms
ARR: 7.834 ms
COR: 5.488 ms
FNC: 1.115 ms
1000 rows
ROM: 2116.029 ms
ARR: 189.679 ms
COR: 65.802 ms
FNC: 8.466 ms
5000 rows
ROM: 51347 ms !!
ARR: 3167 ms
COR: 333 ms
FNC: 42 ms
100000 rows
ROM: DNF
ARR: DNF
COR: 6760 ms
FNC: 828 ms
The function is the clear victor. It is fastest by an order of magnitude and scales best.
Array handling cannot compete.
Rolling count of rows withing time interval
Sounds like an application for window functions. But, sadly, that's not the case. Window frames can only be based on row counts, not on actual column values.
A simple query with LEFT JOIN
can do the job:
SELECT t0.order_id
, count(t1.time_created) AS count_within_3_sec
FROM tbl t0
LEFT JOIN tbl t1 ON t1.time_created BETWEEN t0.time_created - interval '3 sec'
AND t0.time_created
GROUP BY 1
ORDER BY 1;
db<>fiddle here
Does not work with time
like in your minimal demo, as that does not wrap around. I suppose it's reasonable to assume timestamp
or timestamptz
.
Since you include each row itself in the count, an INNER JOIN
would work, too. (LEFT JOIN
is still more reliable in the face of possible NULL values.)
Or use a LATERAL
subquery and you don't need to aggregate on the outer query level:
SELECT t0.order_id
, t1.count_within_3_sec
FROM tbl t0
LEFT JOIN LATERAL (
SELECT count(*) AS count_within_3_sec
FROM tbl t1
WHERE t1.time_created BETWEEN t0.time_created - interval '3 sec'
AND t0.time_created
) t1 ON true
ORDER BY 1;
Related:
- Rolling sum / count / average over date interval
For big tables and many rows in the time frame, a procedural solution that walks through the table once will perform better. Like:
- Window Functions or Common Table Expressions: count previous rows within range
- Alternatives to broken PL/ruby: convert a warehouse journal table
- GROUP BY and aggregate sequential numeric values
Count over rows in previous time range partitioned by a specific column
DB design
Fist off, your table should be normalized. industry
should be a small foreign key column (typically integer
) referencing industry_id
of an industry
table. Maybe you have that already and only simplified for the sake of the question. Your actual table definition would go a long way.
Since rows with an indicator are rare but highly interesting, create a (possibly "covering") partial index to make any solution faster:
CREATE INDEX tbl_indicator_idx ON tbl (industry, day)
WHERE indicator <> 0;
Equality first, range last.
Assuming that indicator
is defined NOT NULL
. If industry
was an integer
, this index would be perfectly efficient.
Query
This query identifies rows to be reset:
WITH x AS ( -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
SELECT industry, day
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('ISODOW' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) sub
WHERE incident
AND ct > 1
ORDER BY 1, 2;
SQL Fiddle.
ISODOW
as extract()
parameter is convenient to truncate weekends.
Integrate this in your UPDATE
:
WITH x AS ( -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
UPDATE tbl t
SET indicator = 0
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('isodow' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) u
WHERE u.incident
AND u.ct > 1
AND t.industry = u.industry
AND t.day = u.day;
This should be substantially faster than your solution with correlated subqueries and a function call for every row. Even if that's based on my own previous answer, it's not perfect for this case.
Postgres - Update running count whenever row meets a certain condition
Here is one option using analytic functions:
WITH cte AS (
SELECT *, CASE WHEN SUM(price*quantity) OVER (ORDER BY id) = 0 THEN 1 ELSE 0 END AS price_sum
FROM yourTable
),
cte2 AS (
SELECT *, LAG(price_sum, 1, 0) OVER (ORDER BY id) price_sum_lag
FROM cte
)
SELECT id, price, quantity, 1 + SUM(price_sum_lag) OVER (ORDER BY id) cumulative_total
FROM cte2
ORDER BY id;
Demo
You may try running each CTE in succession to see how the logic is working.
Count number of years since last deduction
Assuming the current year is the most recent year, you would use aggregation:
select account, max(fy),
sum(case when fy = max_fy then deductions end) as this_year_deduction,
max(fy) - max(case when deduction < 0 then fy end) as years_since_deduction
from (select t.*, max(fy) over (partition by account) as max_fy
from t
) t
group by account;
Note: I assume the third column is the most recent deduction. The query uses a window function to extract that.
Calculate moving sum/count by time condition using window function and filter PostgreSQL
You want a window function with a window frame definition using range
:
select t1.*,
sum(quantity) over (order by time
range between interval '29 day' preceding and current row
)
from t1 ;
EDIT:
If you have data for all dates, you can use rows
:
select t1.*,
sum(quantity) over (order by time
rows between 29 preceding and current row
)
from t1 ;
EDIT II:
If you need to deal with missing days in older versions of Postgres that do not support range
, then expanding the data is probably the simplest method:
select t1.*,
sum(quantity) over (order by time
rows between 29 preceding and current row
)
from (select generate_series(min(t1.time), max(t1.time), interval '1 day') as dte
from t1
) d left join
t1
on d.dte = t1.time;
You may want to filter out the additional rows:
select t1.*
from (select t1.*,
sum(quantity) over (order by time
rows between 29 preceding and current row
) as running_sum
from (select generate_series(min(t1.time), max(t1.time), interval '1 day') as dte
from t1
) d left join
t1
on d.dte = t1.time
) t1
where t1.time is not null;
How to improve SQL query performance containing partially common subqueries
How can I reuse the common portion of the subquery ...?
Use conditional aggregates in a single LATERAL
subquery:
SELECT t1.*, t2.positive, t2.negative
FROM tableA t1
CROSS JOIN LATERAL (
SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
, COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000
) t2;
It can be a CROSS JOIN
because the subquery always returns a row. See:
- JOIN (SELECT ... ) ue ON 1=1?
- What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Use conditional aggregates with the FILTER
clause to base multiple aggregates on the same time frame. See:
- Aggregate columns with additional (distinct) filters
event_count
should probably be integer
or bigint
. See:
- PostgreSQL using UUID vs Text as primary key
- Is there any difference in saving same value in different integer types?
sys_timestamp
should probably be timestamp
or timestamptz
. See:
- Ignoring time zones altogether in Rails and PostgreSQL
An index on (sys_timestamp)
is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count)
typically helps some more. If the table is vacuumed enough, you get index-only scans from it.
Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See:
- Window Functions or Common Table Expressions: count previous rows within range
- What are the pros and cons of performing calculations in sql vs. in your application
How to perform a filtered query in a PostgreSQL window partition?
What you have in mind is simply not possibly using the frame definition of a window function. (You were beginning to suspect as much.) The RANGE
or ROWS
clauses count distinct values or rows and have no concept of the meaning of values.
You want to count all rows that fall in a certain period of time and need to go about this differently. You could run a correlated subquery or a LATERAL
subquery to count for each and every row, but that's expensive.
A smarter way would be to run through two cursors in parallel and keep a running count. I implemented exactly that for a very similar question:
- Window Functions or Common Table Expressions: count previous rows within range
Scales much better. I added detailed benchmarks over there.
Related Topics
Postgres and Indexes on Foreign Keys and Primary Keys
Find Rows That Have the Same Value on a Column in MySQL
How to Exclude Rows That Don't Join with Another Table
What Is Full Text Search VS Like
How to Alter the Position of a Column in a Postgresql Database Table
Exists VS Join and Use of Exists Clause
How to Group by Month from Date Field Using SQL
In SQL, Is Update Always Faster Than Delete+Insert
Sqlite Alter Table Add Multiple Columns in a Single Statement
SQL Poor Stored Procedure Execution Plan Performance - Parameter Sniffing
Sql, How to Concatenate Results
Get Month from Datetime in SQLite
SQL Comments on Create Table on SQL Server 2008
Why Is My T-SQL Left Join Not Working
Postgresql: Case Insensitive String Comparison
SQL Query to Get Most Recent Row for Each Instance of a Given Key