Best Way to Count Rows by Arbitrary Time Intervals

Best way to count rows by arbitrary time intervals

Luckily, you are using PostgreSQL. The window function generate_series() is your friend.

Test case

Given the following test table (which you should have provided):

CREATE TABLE event(event_id serial, ts timestamp);
INSERT INTO event (ts)
SELECT generate_series(timestamp '2018-05-01'
, timestamp '2018-05-08'
, interval '7 min') + random() * interval '7 min';

One event for every 7 minutes (plus 0 to 7 minutes, randomly).

Basic solution

This query counts events for any arbitrary time interval. 17 minutes in the example:

WITH grid AS (
SELECT start_time
, lead(start_time, 1, 'infinity') OVER (ORDER BY start_time) AS end_time
FROM (
SELECT generate_series(min(ts), max(ts), interval '17 min') AS start_time
FROM event
) sub
)
SELECT start_time, count(e.ts) AS events
FROM grid g
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.end_time
GROUP BY start_time
ORDER BY start_time;

The query retrieves minimum and maximum ts from the base table to cover the complete time range. You can use an arbitrary time range instead.

Provide any time interval as needed.

Produces one row for every time slot. If no event happened during that interval, the count is 0.

Be sure to handle upper and lower bound correctly. See:

  • Unexpected results from SQL query with BETWEEN timestamps

The window function lead() has an often overlooked feature: it can provide a default for when no leading row exists. Providing 'infinity' in the example. Else the last interval would be cut off with an upper bound NULL.

Minimal equivalent

The above query uses a CTE and lead() and verbose syntax. Elegant and maybe easier to understand, but a bit more expensive. Here is a shorter, faster, minimal version:

SELECT start_time, count(e.ts) AS events
FROM (SELECT generate_series(min(ts), max(ts), interval '17 min') FROM event) g(start_time)
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.start_time + interval '17 min'
GROUP BY 1
ORDER BY 1;

Example for "every 15 minutes in the past week"`

Formatted with to_char().

SELECT to_char(start_time, 'YYYY-MM-DD HH24:MI'), count(e.ts) AS events
FROM generate_series(date_trunc('day', localtimestamp - interval '7 days')
, localtimestamp
, interval '15 min'
) g(start_time)
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.start_time + interval '15 min'
GROUP BY start_time
ORDER BY start_time;

Still ORDER BY and GROUP BY on the underlying timestamp value, not on the formatted string. That's faster and more reliable.

db<>fiddle here

Related answer producing a running count over the time frame:

  • PostgreSQL: running count of rows for a query 'by minute'

How to create an efficient query which will count of the records by a specific time interval?

Your query seems complicated. You only need to generate the sequence of times and then use left join to bring them together . . . and aggregate:

select g.ts,  g.ts + interval '4 second', count(ds.id)
from (select generate_series(min(starttime), max(strttime), interval '4 second') as ts
from data_store
) g left join
data_store ds
on ds.starttime >= g.ts and ds.starttime < g.ts + interval '4 second'
group by g.ts
order by g.ts;

Note: If you want the interval to begin on an exact second (and not have some strange number of milliseconds 999 times out of 1000), then use date_trunc().

EDIT:

It might be worth seeing if a correlated subquery is faster:

select gs.ts,
(select count(*)
from data_store ds
where ds.starttime >= g.ts and ds.starttime < g.ts + interval '4 second'
) as cnt
from (select generate_series(min(starttime), max(strttime), interval '4 second') as ts
from data_store
) g;

How can I aggregate a count of rows by time intervals in PostgreSQL?

You can use generate_series() to generate the timestamps. Then unnest, filter and aggregate:

select gs.ts, count(i.time) as num_item1
from generate_series('2020-12-06 11:30:00.000'::timestamp, '2020-12-06 11:55:00.000', interval '5 minute') gs(ts) left join
(items i join lateral
unnest(i.items) item
on item = 'item1'
)
on i.time >= gs.ts and i.time < gs.ts + interval '5 minute'
group by gs.ts
order by 1;

Here is a db<>fiddle.

Column of counts for time intervals

Looking at just the data you provided, this does the trick:

SELECT v.id, 
i.interval,
coalesce((CASE WHEN sub.cnt < 3 THEN sub.cnt ELSE 0 END), 0) AS count
FROM (VALUES('a'), ('b'), ('c')) v(id)
CROSS JOIN generate_series(1, 7) i(interval)
LEFT JOIN (
SELECT id, ((date - '2015-10-30')/7 + 1)::int AS interval, count(*) AS cnt
FROM my_table
GROUP BY 1, 2) sub USING (id, interval)
ORDER BY 2, 1;

A few words of explanation:

  • You have three id values which are here recreated with a VALUES clause. If you have many more or don't know beforehand which id's to enumerate, you can always replace the VALUES clause with a sub-query.
  • You provide a specific date range over 7 weeks. Since you might have weeks where a certain id is not present you need to generate a series of the interval values and CROSS JOIN that to the id values above. This yields the 21 rows you are looking for.
  • Then you calculate the occurrences of ids in intervals. You can subtract a date from another date which will give you the number of days in between. So subtract the date of the row from the earliest date, divide that by 7 to get the interval period, add 1 to make the interval 1-based and convert to integer. You can then convert counts of > 2 to 0 and NULL to 0 with a combination of CASE and coalesce().
  • The query outputs the interval too, otherwise you will have no clue what the data refers to. Optionally, you can turn this into a column which shows the date range of the interval.

More flexible solution

If you have more ids and a larger date range, you can use the below version which first determines the distinct ids and the date range. Note that the interval is now 0-based to make calculations easier. Not that it matters much because instead of the interval number, the corresponding date range is displayed.

WITH mi AS (
SELECT min(date) AS min, ((max(date) - min(date))/7)::int AS intv FROM my_table)
SELECT v.id,
to_char((mi.min + i.intv * 7)::timestamp, 'YYYY-mm-dd') || ' - ' ||
to_char((mi.min + i.intv * 7 + 6)::timestamp, 'YYYY-mm-dd') AS period,
coalesce((CASE WHEN sub.cnt < 3 THEN sub.cnt ELSE 0 END), 0) AS count
FROM mi,
(SELECT DISTINCT id FROM my_table) v
CROSS JOIN LATERAL generate_series(0, mi.intv) i(intv)
LEFT JOIN LATERAL (
SELECT id, ((date - mi.min)/7)::int AS intv, count(*) AS cnt
FROM my_table
GROUP BY 1, 2) sub USING (id, intv)
ORDER BY 2, 1;

SQLFiddle with both solutions.



Related Topics



Leave a reply



Submit