Best way to count rows by arbitrary time intervals
Luckily, you are using PostgreSQL. The window function generate_series()
is your friend.
Test case
Given the following test table (which you should have provided):
CREATE TABLE event(event_id serial, ts timestamp);
INSERT INTO event (ts)
SELECT generate_series(timestamp '2018-05-01'
, timestamp '2018-05-08'
, interval '7 min') + random() * interval '7 min';
One event for every 7 minutes (plus 0 to 7 minutes, randomly).
Basic solution
This query counts events for any arbitrary time interval. 17 minutes in the example:
WITH grid AS (
SELECT start_time
, lead(start_time, 1, 'infinity') OVER (ORDER BY start_time) AS end_time
FROM (
SELECT generate_series(min(ts), max(ts), interval '17 min') AS start_time
FROM event
) sub
)
SELECT start_time, count(e.ts) AS events
FROM grid g
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.end_time
GROUP BY start_time
ORDER BY start_time;
The query retrieves minimum and maximum ts
from the base table to cover the complete time range. You can use an arbitrary time range instead.
Provide any time interval as needed.
Produces one row for every time slot. If no event happened during that interval, the count is 0
.
Be sure to handle upper and lower bound correctly. See:
- Unexpected results from SQL query with BETWEEN timestamps
The window function lead()
has an often overlooked feature: it can provide a default for when no leading row exists. Providing 'infinity'
in the example. Else the last interval would be cut off with an upper bound NULL
.
Minimal equivalent
The above query uses a CTE and lead()
and verbose syntax. Elegant and maybe easier to understand, but a bit more expensive. Here is a shorter, faster, minimal version:
SELECT start_time, count(e.ts) AS events
FROM (SELECT generate_series(min(ts), max(ts), interval '17 min') FROM event) g(start_time)
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.start_time + interval '17 min'
GROUP BY 1
ORDER BY 1;
Example for "every 15 minutes in the past week"`
Formatted with to_char()
.
SELECT to_char(start_time, 'YYYY-MM-DD HH24:MI'), count(e.ts) AS events
FROM generate_series(date_trunc('day', localtimestamp - interval '7 days')
, localtimestamp
, interval '15 min') g(start_time)
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.start_time + interval '15 min'
GROUP BY start_time
ORDER BY start_time;
Still ORDER BY
and GROUP BY
on the underlying timestamp value, not on the formatted string. That's faster and more reliable.
db<>fiddle here
Related answer producing a running count over the time frame:
- PostgreSQL: running count of rows for a query 'by minute'
How to create an efficient query which will count of the records by a specific time interval?
Your query seems complicated. You only need to generate the sequence of times and then use left join
to bring them together . . . and aggregate:
select g.ts, g.ts + interval '4 second', count(ds.id)
from (select generate_series(min(starttime), max(strttime), interval '4 second') as ts
from data_store
) g left join
data_store ds
on ds.starttime >= g.ts and ds.starttime < g.ts + interval '4 second'
group by g.ts
order by g.ts;
Note: If you want the interval to begin on an exact second (and not have some strange number of milliseconds 999 times out of 1000), then use date_trunc()
.
EDIT:
It might be worth seeing if a correlated subquery is faster:
select gs.ts,
(select count(*)
from data_store ds
where ds.starttime >= g.ts and ds.starttime < g.ts + interval '4 second'
) as cnt
from (select generate_series(min(starttime), max(strttime), interval '4 second') as ts
from data_store
) g;
How can I aggregate a count of rows by time intervals in PostgreSQL?
You can use generate_series()
to generate the timestamps. Then unnest, filter and aggregate:
select gs.ts, count(i.time) as num_item1
from generate_series('2020-12-06 11:30:00.000'::timestamp, '2020-12-06 11:55:00.000', interval '5 minute') gs(ts) left join
(items i join lateral
unnest(i.items) item
on item = 'item1'
)
on i.time >= gs.ts and i.time < gs.ts + interval '5 minute'
group by gs.ts
order by 1;
Here is a db<>fiddle.
Column of counts for time intervals
Looking at just the data you provided, this does the trick:
SELECT v.id,
i.interval,
coalesce((CASE WHEN sub.cnt < 3 THEN sub.cnt ELSE 0 END), 0) AS count
FROM (VALUES('a'), ('b'), ('c')) v(id)
CROSS JOIN generate_series(1, 7) i(interval)
LEFT JOIN (
SELECT id, ((date - '2015-10-30')/7 + 1)::int AS interval, count(*) AS cnt
FROM my_table
GROUP BY 1, 2) sub USING (id, interval)
ORDER BY 2, 1;
A few words of explanation:
- You have three
id
values which are here recreated with aVALUES
clause. If you have many more or don't know beforehand which id's to enumerate, you can always replace theVALUES
clause with a sub-query. - You provide a specific date range over 7 weeks. Since you might have weeks where a certain
id
is not present you need to generate a series of theinterval
values andCROSS JOIN
that to theid
values above. This yields the 21 rows you are looking for. - Then you calculate the occurrences of
id
s inintervals
. You can subtract adate
from anotherdate
which will give you the number of days in between. So subtract the date of the row from the earliest date, divide that by 7 to get the interval period, add 1 to make theinterval
1-based and convert to integer. You can then convert counts of > 2 to 0 andNULL
to 0 with a combination ofCASE
andcoalesce()
. - The query outputs the
interval
too, otherwise you will have no clue what the data refers to. Optionally, you can turn this into a column which shows the date range of the interval.
More flexible solution
If you have more id
s and a larger date range, you can use the below version which first determines the distinct id
s and the date range. Note that the interval
is now 0-based to make calculations easier. Not that it matters much because instead of the interval number, the corresponding date range is displayed.
WITH mi AS (
SELECT min(date) AS min, ((max(date) - min(date))/7)::int AS intv FROM my_table)
SELECT v.id,
to_char((mi.min + i.intv * 7)::timestamp, 'YYYY-mm-dd') || ' - ' ||
to_char((mi.min + i.intv * 7 + 6)::timestamp, 'YYYY-mm-dd') AS period,
coalesce((CASE WHEN sub.cnt < 3 THEN sub.cnt ELSE 0 END), 0) AS count
FROM mi,
(SELECT DISTINCT id FROM my_table) v
CROSS JOIN LATERAL generate_series(0, mi.intv) i(intv)
LEFT JOIN LATERAL (
SELECT id, ((date - mi.min)/7)::int AS intv, count(*) AS cnt
FROM my_table
GROUP BY 1, 2) sub USING (id, intv)
ORDER BY 2, 1;
SQLFiddle with both solutions.
Related Topics
Return Multiple Fields as a Record in Postgresql with Pl/Pgsql
Detect Consecutive Dates Ranges Using SQL
How to Interpret Precision and Scale of a Number in a Database
Basic Recursive Query on SQLite3
What Datatype to Use When Storing Latitude and Longitude Data in SQL Databases
Find Closest Date in SQL Server
Performance Issue in Using Select *
How to Kill a Running Select Statement
How to Get Second Largest or Third Largest Entry from a Table
SQL - Select First 10 Rows Only
Disable Rails SQL Logging in Console
Hierarchical Data in Linq - Options and Performance
Select Data from Date Range Between Two Dates
Fastest Check If Row Exists in Postgresql
Ssis Package Not Wanting to Fetch Metadata of Temporary Table