Best Way to Count Rows by Arbitrary Time Intervals

Best way to count rows by arbitrary time intervals

Luckily, you are using PostgreSQL. The window function generate_series() is your friend.

Test case

Given the following test table (which you should have provided):

CREATE TABLE event(event_id serial, ts timestamp);
INSERT INTO event (ts)
SELECT generate_series(timestamp '2018-05-01'
                     , timestamp '2018-05-08'
                     , interval '7 min') + random() * interval '7 min';

One event for every 7 minutes (plus 0 to 7 minutes, randomly).

Basic solution

This query counts events for any arbitrary time interval. 17 minutes in the example:

WITH grid AS (
   SELECT start_time
        , lead(start_time, 1, 'infinity') OVER (ORDER BY start_time) AS end_time
   FROM  (
      SELECT generate_series(min(ts), max(ts), interval '17 min') AS start_time
      FROM   event
      ) sub
   )
SELECT start_time, count(e.ts) AS events
FROM   grid       g
LEFT   JOIN event e ON e.ts >= g.start_time
                   AND e.ts <  g.end_time
GROUP  BY start_time
ORDER  BY start_time;

The query retrieves minimum and maximum ts from the base table to cover the complete time range. You can use an arbitrary time range instead.

Provide any time interval as needed.

Produces one row for every time slot. If no event happened during that interval, the count is 0.

Be sure to handle upper and lower bound correctly. See:

Unexpected results from SQL query with BETWEEN timestamps

The window function lead() has an often overlooked feature: it can provide a default for when no leading row exists. Providing 'infinity' in the example. Else the last interval would be cut off with an upper bound NULL.

Minimal equivalent

The above query uses a CTE and lead() and verbose syntax. Elegant and maybe easier to understand, but a bit more expensive. Here is a shorter, faster, minimal version:

SELECT start_time, count(e.ts) AS events
FROM  (SELECT generate_series(min(ts), max(ts), interval '17 min') FROM event) g(start_time)
LEFT   JOIN event e ON e.ts >= g.start_time
                   AND e.ts <  g.start_time + interval '17 min'
GROUP  BY 1
ORDER  BY 1;

Example for "every 15 minutes in the past week"`

Formatted with to_char().

SELECT to_char(start_time, 'YYYY-MM-DD HH24:MI'), count(e.ts) AS events
FROM   generate_series(date_trunc('day', localtimestamp - interval '7 days')
                     , localtimestamp
                     , interval '15 min') g(start_time)
LEFT   JOIN event e ON e.ts >= g.start_time
                   AND e.ts <  g.start_time + interval '15 min'
GROUP  BY start_time
ORDER  BY start_time;

Still ORDER BY and GROUP BY on the underlying timestamp value, not on the formatted string. That's faster and more reliable.

db<>fiddle here

Related answer producing a running count over the time frame:

PostgreSQL: running count of rows for a query 'by minute'

How to create an efficient query which will count of the records by a specific time interval?

Your query seems complicated. You only need to generate the sequence of times and then use left join to bring them together . . . and aggregate:

select g.ts,  g.ts + interval '4 second', count(ds.id)
from (select generate_series(min(starttime), max(strttime), interval '4 second') as ts
      from data_store
     ) g left join
     data_store ds
     on ds.starttime >= g.ts and ds.starttime < g.ts + interval '4 second'
group by g.ts
order by g.ts;

Note: If you want the interval to begin on an exact second (and not have some strange number of milliseconds 999 times out of 1000), then use date_trunc().

EDIT:

It might be worth seeing if a correlated subquery is faster:

select gs.ts,
       (select count(*)
        from data_store ds
        where ds.starttime >= g.ts and ds.starttime < g.ts + interval '4 second'
       ) as cnt
from (select generate_series(min(starttime), max(strttime), interval '4 second') as ts
      from data_store
     ) g;

How can I aggregate a count of rows by time intervals in PostgreSQL?

You can use generate_series() to generate the timestamps. Then unnest, filter and aggregate:

select gs.ts, count(i.time) as num_item1
from generate_series('2020-12-06 11:30:00.000'::timestamp, '2020-12-06 11:55:00.000', interval '5 minute') gs(ts) left join
     (items i join lateral
      unnest(i.items) item
      on item = 'item1'
     )
     on i.time >= gs.ts and i.time < gs.ts + interval '5 minute'
group by gs.ts
order by 1;

Here is a db<>fiddle.

Column of counts for time intervals

Looking at just the data you provided, this does the trick:

SELECT v.id, 
       i.interval, 
       coalesce((CASE WHEN sub.cnt < 3 THEN sub.cnt ELSE 0 END), 0) AS count
FROM (VALUES('a'), ('b'), ('c')) v(id)
CROSS JOIN generate_series(1, 7) i(interval)
LEFT JOIN (
  SELECT id, ((date - '2015-10-30')/7 + 1)::int AS interval, count(*) AS cnt
  FROM my_table
  GROUP BY 1, 2) sub USING (id, interval)
ORDER BY 2, 1;

A few words of explanation:

You have three id values which are here recreated with a VALUES clause. If you have many more or don't know beforehand which id's to enumerate, you can always replace the VALUES clause with a sub-query.
You provide a specific date range over 7 weeks. Since you might have weeks where a certain id is not present you need to generate a series of the interval values and CROSS JOIN that to the id values above. This yields the 21 rows you are looking for.
Then you calculate the occurrences of ids in intervals. You can subtract a date from another date which will give you the number of days in between. So subtract the date of the row from the earliest date, divide that by 7 to get the interval period, add 1 to make the interval 1-based and convert to integer. You can then convert counts of > 2 to 0 and NULL to 0 with a combination of CASE and coalesce().
The query outputs the interval too, otherwise you will have no clue what the data refers to. Optionally, you can turn this into a column which shows the date range of the interval.

Best Way to Count Rows by Arbitrary Time Intervals