Aggregating (X,Y) Coordinate Point Clouds in Postgresql

Aggregating (x,y) coordinate point clouds in PostgreSQL

Use the often overlooked built-in function width_bucket() in combination with your aggregation:

If your coordinates run from, say, 0 to 2000 and you want to consolidate everything within squares of 5 to single points, I would lay out a grid of 10 (5*2) like this:

SELECT device_id
, width_bucket(pos_x, 0, 2000, 2000/10) * 10 AS pos_x
, width_bucket(pos_y, 0, 2000, 2000/10) * 10 AS pos_y
, count(*) AS ct -- or any other aggregate
FROM tbl
GROUP BY 1,2,3
ORDER BY 1,2,3;

To minimize the error you could GROUP BY the grid as demonstrated, but save actual average coordinates:

SELECT device_id
, avg(pos_x)::int AS pos_x -- save actual averages to minimize error
, avg(pos_y)::int AS pos_y -- cast if you need to
, count(*) AS ct -- or any other aggregate
FROM tbl
GROUP BY
device_id
, width_bucket(pos_x, 0, 2000, 2000/10) * 10 -- aggregate by grid
, width_bucket(pos_y, 0, 2000, 2000/10) * 10
ORDER BY 1,2,3;

sqlfiddle demonstrating both alongside.

Well, this particular case could be simpler:

...
GROUP BY
device_id
, (pos_x / 10) * 10 -- truncates last digit of an integer
, (pos_y / 10) * 10
...

But that's just because the demo grid size of 10 conveniently matches the decimal system. Try the same with a grid size of 17 or something ...



Expand to timestamps

You can expand this approach to cover date and timestamp values by converting them to unix epoch (number of seconds since '1970-1-1') with extract().

SELECT extract(epoch FROM '2012-10-01 21:06:38+02'::timestamptz);

When you are done, convert the result back to timestamp with time zone:

SELECT timestamptz 'epoch' + 1349118398 * interval '1s';

Or simply to_timestamp():

SELECT to_timestamp(1349118398);

Aggregating connected sets of nodes / edges

A recursive query is the way to go:

with recursive tree as (
select node, parent, length, node as root_id
from network
where parent is null
union all
select c.node, c.parent, c.length, p.root_id
from network c
join tree p on p.node = c.parent
)
select root_id, array_agg(node) as edges_in_group, sum(length) as total_length
from tree
group by root_id;

The important thing is to keep the id of the root node in each recursion, so that you can group by that id in the final result.

Performing Multiple Avg Queries in PostgreSQL


Only populated cells

Use the built-in function width_bucket() to get only grid cells with one or more matching rows in entries:

For a grid of 100 x 100 cells in the outer frame of box(point(_lat_start, _long_start), point(_lat_end, _long_end)):

SELECT width_bucket(gps_lat , _lat_start , _lat_end , 100) AS grid_lat
, width_bucket(gps_long, _long_start, _long_end, 100) AS grid_long
, avg(value) AS avg_val
FROM entries
WHERE point(gps_lat, gps_long) <@ box(point(_lat_start, _long_start)
, point(_lat_end , _long_end))
GROUP BY 1,2
ORDER BY 1,2;

<@ is the "contained in" operator for geometric types.

It's easy to wrap this into a function and parameterize the outer box and the number of grid cells.

A multicolumn GiST expression index will help performance if only a small fraction of rows lies within the outer box. You'll need to install the btree_gist module first, once per database:

  • PostgreSQL EXCLUDE USING error: Data type integer has no default operator class

Then:

CREATE INDEX entries_point_idx ON entries
USING gist (point(gps_lat, gps_long), value);

Adding value to the index only makes sense if you can get an index-only scan out of this in Postgres 9.2+.

If you are reading large parts of the table anyway, you don't need an index and it might be cheaper to run simple a between x and y checks in the WHERE clause.

This is assuming a flat earth (which may be good enough for your purpose). If you want to be precise, you will have to dig deeper into PostGIS.

All cells in the grid

To get all cells use LEFT JOIN to a pre-generated grid like you already tried:

SELECT grid_lat, grid_long, g.avg_val  -- or use COALESCE
FROM generate_series(1,100) grid_lat
CROSS JOIN generate_series(1,100) grid_long
LEFT JOIN (<query from above>) g USING (grid_lat, grid_long)

Related:

  • Aggregating (x,y) coordinate point clouds in PostgreSQL

calculating average with grouping based on time intervals


Simple and fast solution for this particular example:

SELECT date_trunc('minute', ts) AS minute
, sum(speed)/6 AS avg_speed
FROM speed_table AS t
WHERE ts >= '2014-06-21 0:0'
AND ts < '2014-06-20 0:0' -- exclude dangling corner case
AND condition2 = 'something'
GROUP BY 1
ORDER BY 1;

You need to factor in missing rows as "0 speed". Since a minute has 6 samples, just sum and divide by 6. Missing rows evaluate to 0 implicitly.

This returns no row for minutes with no rows at all.avg_speed for missing result rows is 0.

General query for arbitrary intervals

Works for all any interval listed in the manual for date_trunc():

SELECT date_trunc('minute', g.ts) AS ts_start
, avg(COALESCE(speed, 0)) AS avg_speed
FROM (SELECT generate_series('2014-06-21 0:0'::timestamp
, '2014-06-22 0:0'::timestamp
, '10 sec'::interval) AS ts) g
LEFT JOIN speed_table t USING (ts)
WHERE (t.condition2 = 'something' OR
t.condition2 IS NULL) -- depends on actual condition!
AND g.ts <> '2014-06-22 0:0'::timestamp -- exclude dangling corner case
GROUP BY 1
ORDER BY 1;

The problematic part is the additional unknown condition. You would need to define that. And decide whether missing rows supplied by generate_series should pass the test or not (which can be tricky!).

I let them pass in my example (and all other rows with a NULL values).

Compare:

PostgreSQL: running count of rows for a query 'by minute'

Arbitrary intervals:

Truncate timestamp to arbitrary intervals

For completely arbitrary intervals consider @Clodoaldo's math based on epoch values or use the often overlooked function width_bucket(). Example:

Aggregating (x,y) coordinate point clouds in PostgreSQL
Aggregating (x,y) coordinate point clouds in PostgreSQL

PostgreSQL - Get count of items in a table grouped by a datetime column for N intervals

Generate the times you want and then use left join and aggregation:

select gs.ts, count(u.id)
from generate_series('2019-01-01T00:00:00'::timestamp,
'2019-01-01T12:00:00'::timestamp,
interval '12 hour'
) gs(ts) left join
users u
on u.created_at >= gs.ts and
u.created_at < gs.ts + interval '12 hour'
group by 1
order by 1;

EDIT:

If you want to specify the number of rows, you can use something similar:

from generate_series(1, 10, 1) as gs(n) cross join lateral
(values ('2019-01-01T00:00:00'::timestamp + (gs.n - 1) * interval '12 hour')
) v(ts) left join
users u
on u.created_at >= v.ts and
u.created_at < v.ts + interval '12 hour'

Multiple averages over evenly spaced intervals

I suggest the handy function width_bucket():

To get the average for each time segment ("bin"):

SELECT width_bucket(extract(epoch FROM t.the_date)
, x.min_epoch, x.max_epoch, x.bins) AS bin
, avg(value) AS bin_avg
FROM tbl t
, (SELECT extract(epoch FROM min(the_date)) AS min_epoch
, extract(epoch FROM max(the_date)) AS max_epoch
, 10 AS bins
FROM tbl t
) x
GROUP BY 1;

To get the "running average" over the (step-by-step) growing time interval:

SELECT bin, round(sum(bin_sum) OVER w /sum(bin_ct) OVER w, 2) AS running_avg
FROM (
SELECT width_bucket(extract(epoch FROM t.the_date)
, x.min_epoch, x.max_epoch, x.bins) AS bin
, sum(value) AS bin_sum
, count(*) AS bin_ct
FROM tbl t
, (SELECT extract(epoch FROM min(the_date)) AS min_epoch
, extract(epoch FROM max(the_date)) AS max_epoch
, 10 AS bins
FROM tbl t
) x
GROUP BY 1
) sub
WINDOW w AS (ORDER BY bin)
ORDER BY 1;

Using the_date instead of date as column name, avoiding reserved words as identifiers.

Since width_bucket() is currently only implemented for double precision and numeric, I extract epoch values from the_date. Details here:

Aggregating (x,y) coordinate point clouds in PostgreSQL

SQL Query timestamp date field and group all results by date

Simplest / fastest way to extract a date from a Unix epoch:

SELECT to_timestamp(my_epoch_ts)::date;

Example:

SELECT to_timestamp(1349118398)::date

Result:

2012-10-01

That's making use of the to_timestamp() function, followed by a cast to date: ::date.

More details and links at the end of this recent related answer.

Truncate timestamp to arbitrary intervals

Consider this demo to bring timestamps down to a resolution of 15 minutes and aggregate resulting dupes:

WITH tbl(id, ts) AS ( VALUES
(1::int, '2012-10-04 00:00:00'::timestamp)
,(2, '2012-10-04 18:23:01')
,(3, '2012-10-04 18:30:00')
,(4, '2012-10-04 18:52:33')
,(5, '2012-10-04 18:55:01')
,(6, '2012-10-04 18:59:59')
,(7, '2012-10-05 11:01:01')
)
SELECT to_timestamp((extract(epoch FROM ts)::bigint / 900)*900)::timestamp
AS lower_bound
, to_timestamp(avg(extract(epoch FROM ts)))::timestamp AS avg_ts
, count(*) AS ct
FROM tbl
GROUP BY 1
ORDER BY 1;

Result:

     lower_bound     |       avg_ts        | ct
---------------------+---------------------+----
2012-10-04 00:00:00 | 2012-10-04 00:00:00 | 1
2012-10-04 18:15:00 | 2012-10-04 18:23:01 | 1
2012-10-04 18:30:00 | 2012-10-04 18:30:00 | 1
2012-10-04 18:45:00 | 2012-10-04 18:55:51 | 3
2012-10-05 11:00:00 | 2012-10-05 11:01:01 | 1

The trick is to extract a unix epoch like @Michael already posted. Integer division lumps them together in buckets of the chosen resolution, because fractional digits are truncated.

I divide by 900, because 15 minutes = 900 seconds.

Multiply by the same number to get the resulting lower_bound.
Convert the unix epoch back to a timestamp with to_timestamp().

This works great for intervals that can be represented without fractional digits in the decimal system. For even more versatility use the often overlooked function width_bucket() like I demonstrate in this recent, closely related answer. More explanation, links and an sqlfiddle demo over there.



Related Topics



Leave a reply



Submit