Retrieve Aggregates for Arbitrary Time Intervals

Aggregate function over a given time interval

CREATE TABLE tt (time TIMESTAMP, value NUMBER);

INSERT INTO tt (time, value) VALUES ('06-JUN-12 12.40.00.000000000 PM', 2);
INSERT INTO tt (time, value) VALUES ('06-JUN-12 12.41.35.000000000 PM', 3);
INSERT INTO tt (time, value) VALUES ('06-JUN-12 12.43.22.000000000 PM', 4);
INSERT INTO tt (time, value) VALUES ('06-JUN-12 12.47.55.000000000 PM', 5);
INSERT INTO tt (time, value) VALUES ('06-JUN-12 12.52.00.000000000 PM', 2);
INSERT INTO tt (time, value) VALUES ('06-JUN-12 12.54.59.000000000 PM', 3);
INSERT INTO tt (time, value) VALUES ('06-JUN-12 12.56.01.000000000 PM', 4);

WITH tmin AS (
SELECT MIN(time) t FROM tt
), tmax AS (
SELECT MAX(time) t FROM tt
)
SELECT ranges.inf, ranges.sup, AVG(tt.value)
FROM
(
SELECT
5*(level-1)*(1/24/60) + tmin.t as inf,
5*(level)*(1/24/60) + tmin.t as sup
FROM tmin, tmax
CONNECT BY (5*(level-1)*(1/24/60) + tmin.t) < tmax.t
) ranges JOIN tt ON tt.time BETWEEN ranges.inf AND ranges.sup
GROUP BY ranges.inf, ranges.sup
ORDER BY ranges.inf

fiddle: http://sqlfiddle.com/#!4/9e314/11

edit: beated by Justin, as usual... :-)

Best way to count rows by arbitrary time intervals

Luckily, you are using PostgreSQL. The window function generate_series() is your friend.

Test case

Given the following test table (which you should have provided):

CREATE TABLE event(event_id serial, ts timestamp);
INSERT INTO event (ts)
SELECT generate_series(timestamp '2018-05-01'
, timestamp '2018-05-08'
, interval '7 min') + random() * interval '7 min';

One event for every 7 minutes (plus 0 to 7 minutes, randomly).

Basic solution

This query counts events for any arbitrary time interval. 17 minutes in the example:

WITH grid AS (
SELECT start_time
, lead(start_time, 1, 'infinity') OVER (ORDER BY start_time) AS end_time
FROM (
SELECT generate_series(min(ts), max(ts), interval '17 min') AS start_time
FROM event
) sub
)
SELECT start_time, count(e.ts) AS events
FROM grid g
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.end_time
GROUP BY start_time
ORDER BY start_time;

The query retrieves minimum and maximum ts from the base table to cover the complete time range. You can use an arbitrary time range instead.

Provide any time interval as needed.

Produces one row for every time slot. If no event happened during that interval, the count is 0.

Be sure to handle upper and lower bound correctly. See:

  • Unexpected results from SQL query with BETWEEN timestamps

The window function lead() has an often overlooked feature: it can provide a default for when no leading row exists. Providing 'infinity' in the example. Else the last interval would be cut off with an upper bound NULL.

Minimal equivalent

The above query uses a CTE and lead() and verbose syntax. Elegant and maybe easier to understand, but a bit more expensive. Here is a shorter, faster, minimal version:

SELECT start_time, count(e.ts) AS events
FROM (SELECT generate_series(min(ts), max(ts), interval '17 min') FROM event) g(start_time)
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.start_time + interval '17 min'
GROUP BY 1
ORDER BY 1;

Example for "every 15 minutes in the past week"`

Formatted with to_char().

SELECT to_char(start_time, 'YYYY-MM-DD HH24:MI'), count(e.ts) AS events
FROM generate_series(date_trunc('day', localtimestamp - interval '7 days')
, localtimestamp
, interval '15 min'
) g(start_time)
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.start_time + interval '15 min'
GROUP BY start_time
ORDER BY start_time;

Still ORDER BY and GROUP BY on the underlying timestamp value, not on the formatted string. That's faster and more reliable.

db<>fiddle here

Related answer producing a running count over the time frame:

  • PostgreSQL: running count of rows for a query 'by minute'

SQL to group time intervals by arbitrary time period

Based on Timekiller's advice, I have come up with the following query:

-- Since there's a problem with declaring variables in PostgreSQL,
-- we will be using aliases for the arguments required by the script.

-- First some configuration:
-- group_period = 3600 -- group by 1 hour (= 3600 seconds)
-- min_time = 1440226301 -- Sat, 22 Aug 2015 06:51:41 GMT
-- max_time = 1450926301 -- Thu, 24 Dec 2015 03:05:01 GMT

-- Calculate the number of started periods in the given interval in advance.
-- period_count = CEIL((max_time - min_time) / group_period)

SET TIME ZONE UTC;
BEGIN TRANSACTION;

-- Create a temporary table and fill it with all time periods.
CREATE TEMP TABLE periods (period_start TIMESTAMP)
ON COMMIT DROP;
INSERT INTO periods (period_start)
SELECT to_timestamp(min_time + group_period * coefficient)
FROM generate_series(0, period_count) as coefficient;

-- Group data by the time periods.
-- Note that we don't require exact overlap of intervals:
-- A. [period_start, period_start + group_period]
-- B. [time_start, time_start + duration]
-- This would yield the best possible result but it would also slow
-- down the query significantly because of the part B.
-- We require only: period_start <= time_start <= period_start + group_period
SELECT
period_start,
COUNT(measurements.*) AS count_measurements,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM periods
LEFT JOIN measurements
ON time_start BETWEEN period_start AND (period_start + group_period)
GROUP BY period_start;

COMMIT TRANSACTION;

It does exactly what I was going for, so mission accomplished. However, I would still appreciate if anybody could give me some feedback to the performance of this query for the following conditions:

  • I expect the measurements table to have about 500-800 million rows.
  • The time_start column is primary key and has unique btree index on it.
  • I have no guarantees about min_time and max_time. I only know that group period will be chosen so that 500 <= period_count <= 2000.

How can I aggregate a count of rows by time intervals in PostgreSQL?

You can use generate_series() to generate the timestamps. Then unnest, filter and aggregate:

select gs.ts, count(i.time) as num_item1
from generate_series('2020-12-06 11:30:00.000'::timestamp, '2020-12-06 11:55:00.000', interval '5 minute') gs(ts) left join
(items i join lateral
unnest(i.items) item
on item = 'item1'
)
on i.time >= gs.ts and i.time < gs.ts + interval '5 minute'
group by gs.ts
order by 1;

Here is a db<>fiddle.

Aggregate (count) occurences of values over arbitrary timeframe

We can read it using read.csv, convert the first column to a date time binned into 6 minute intervals and add a dummy column of 1's. Then re-read it using read.zoo splitting on the type and aggregating on the dummy column:

# test data

Lines <- 'date,type
"Sep 22, 2011 12:54:53.081240000","2"
"Sep 22, 2011 12:54:53.083493000","2"
"Sep 22, 2011 12:54:53.084025000","2"
"Sep 22, 2011 12:54:53.086493000","2"
"Sep 22, 2011 12:54:53.081240000","3"
"Sep 22, 2011 12:54:53.083493000","3"
"Sep 22, 2011 12:54:53.084025000","3"
"Sep 22, 2011 12:54:53.086493000","4"'

library(zoo)
library(chron)

# convert to chron and bin into 6 minute bins using trunc
# Also add a dummy column of 1's
# and remove any leading space (removing space not needed if there is none)

DF <- read.csv(textConnection(Lines), as.is = TRUE)
fmt <- '%b %d, %Y %H:%M:%S'
DF <- transform(DF, dummy = 1,
date = trunc(as.chron(sub("^ *", "", date), format = fmt), "00:06:00"))

# split and aggregate

z <- read.zoo(DF, split = 2, aggregate = length)

With the above test data the solution looks like this:

> z
2 3 4
(09/22/11 12:54:00) 4 3 1

Note that the above has been done in wide form since that form constitutes a time series whereas the long form does not. There is one column for each type. In our test data we had types 2, 3 and 4 so there are three columns.

(We have used chron here since its trunc method fits well with binning into 6 minute groups. chron does not support time zones which can be an advantage since you can't make one of the many possible time zone errors but if you want POSIXct anyways convert it at the end, e.g. time(z) <- as.POSIXct(paste(as.Date.dates(time(z)), times(time(z)) %% 1)) . This expression is shown in a table in one of the R News 4/1 articles except we used as.Date.dates instead of just as.Date to work around a bug that seems to have been introduced since then. We could also use time(z) <- as.POSIXct(time(z)) but that would result in a different time zone.)

EDIT:

The original solution binned into dates but I noticed afterwards that you wish to bin into 6 minute periods so the solution was revised.

EDIT:

Revised based on comment.



Related Topics



Leave a reply



Submit