Aggregating (x,y) coordinate point clouds in PostgreSQL
Use the often overlooked built-in function width_bucket()
in combination with your aggregation:
If your coordinates run from, say, 0 to 2000 and you want to consolidate everything within squares of 5 to single points, I would lay out a grid of 10 (5*2) like this:
SELECT device_id
, width_bucket(pos_x, 0, 2000, 2000/10) * 10 AS pos_x
, width_bucket(pos_y, 0, 2000, 2000/10) * 10 AS pos_y
, count(*) AS ct -- or any other aggregate
FROM tbl
GROUP BY 1,2,3
ORDER BY 1,2,3;
To minimize the error you could GROUP BY
the grid as demonstrated, but save actual average coordinates:
SELECT device_id
, avg(pos_x)::int AS pos_x -- save actual averages to minimize error
, avg(pos_y)::int AS pos_y -- cast if you need to
, count(*) AS ct -- or any other aggregate
FROM tbl
GROUP BY
device_id
, width_bucket(pos_x, 0, 2000, 2000/10) * 10 -- aggregate by grid
, width_bucket(pos_y, 0, 2000, 2000/10) * 10
ORDER BY 1,2,3;
sqlfiddle demonstrating both alongside.
Well, this particular case could be simpler:
...
GROUP BY
device_id
, (pos_x / 10) * 10 -- truncates last digit of an integer
, (pos_y / 10) * 10
...
But that's just because the demo grid size of 10
conveniently matches the decimal system. Try the same with a grid size of 17
or something ...
Expand to timestamps
You can expand this approach to cover date
and timestamp
values by converting them to unix epoch (number of seconds since '1970-1-1') with extract().
SELECT extract(epoch FROM '2012-10-01 21:06:38+02'::timestamptz);
When you are done, convert the result back to timestamp with time zone
:
SELECT timestamptz 'epoch' + 1349118398 * interval '1s';
Or simply to_timestamp()
:
SELECT to_timestamp(1349118398);
Aggregating connected sets of nodes / edges
A recursive query is the way to go:
with recursive tree as (
select node, parent, length, node as root_id
from network
where parent is null
union all
select c.node, c.parent, c.length, p.root_id
from network c
join tree p on p.node = c.parent
)
select root_id, array_agg(node) as edges_in_group, sum(length) as total_length
from tree
group by root_id;
The important thing is to keep the id of the root node in each recursion, so that you can group by that id in the final result.
Performing Multiple Avg Queries in PostgreSQL
Only populated cells
Use the built-in function width_bucket()
to get only grid cells with one or more matching rows in entries
:
For a grid of 100 x 100 cells in the outer frame of box(point(_lat_start, _long_start), point(_lat_end, _long_end))
:
SELECT width_bucket(gps_lat , _lat_start , _lat_end , 100) AS grid_lat
, width_bucket(gps_long, _long_start, _long_end, 100) AS grid_long
, avg(value) AS avg_val
FROM entries
WHERE point(gps_lat, gps_long) <@ box(point(_lat_start, _long_start)
, point(_lat_end , _long_end))
GROUP BY 1,2
ORDER BY 1,2;
<@
is the "contained in" operator for geometric types.
It's easy to wrap this into a function and parameterize the outer box and the number of grid cells.
A multicolumn GiST expression index will help performance if only a small fraction of rows lies within the outer box. You'll need to install the btree_gist module first, once per database:
- PostgreSQL EXCLUDE USING error: Data type integer has no default operator class
Then:
CREATE INDEX entries_point_idx ON entries
USING gist (point(gps_lat, gps_long), value);
Adding value
to the index only makes sense if you can get an index-only scan out of this in Postgres 9.2+.
If you are reading large parts of the table anyway, you don't need an index and it might be cheaper to run simple a between x and y
checks in the WHERE
clause.
This is assuming a flat earth (which may be good enough for your purpose). If you want to be precise, you will have to dig deeper into PostGIS.
All cells in the grid
To get all cells use LEFT JOIN
to a pre-generated grid like you already tried:
SELECT grid_lat, grid_long, g.avg_val -- or use COALESCE
FROM generate_series(1,100) grid_lat
CROSS JOIN generate_series(1,100) grid_long
LEFT JOIN (<query from above>) g USING (grid_lat, grid_long)
Related:
- Aggregating (x,y) coordinate point clouds in PostgreSQL
calculating average with grouping based on time intervals
Simple and fast solution for this particular example:
SELECT date_trunc('minute', ts) AS minute
, sum(speed)/6 AS avg_speed
FROM speed_table AS t
WHERE ts >= '2014-06-21 0:0'
AND ts < '2014-06-20 0:0' -- exclude dangling corner case
AND condition2 = 'something'
GROUP BY 1
ORDER BY 1;
You need to factor in missing rows as "0 speed". Since a minute has 6 samples, just sum and divide by 6. Missing rows evaluate to 0
implicitly.
This returns no row for minutes with no rows at all.avg_speed
for missing result rows is 0
.
General query for arbitrary intervals
Works for all any interval listed in the manual for date_trunc()
:
SELECT date_trunc('minute', g.ts) AS ts_start
, avg(COALESCE(speed, 0)) AS avg_speed
FROM (SELECT generate_series('2014-06-21 0:0'::timestamp
, '2014-06-22 0:0'::timestamp
, '10 sec'::interval) AS ts) g
LEFT JOIN speed_table t USING (ts)
WHERE (t.condition2 = 'something' OR
t.condition2 IS NULL) -- depends on actual condition!
AND g.ts <> '2014-06-22 0:0'::timestamp -- exclude dangling corner case
GROUP BY 1
ORDER BY 1;
The problematic part is the additional unknown condition. You would need to define that. And decide whether missing rows supplied by generate_series should pass the test or not (which can be tricky!).
I let them pass in my example (and all other rows with a NULL values).
Compare:
PostgreSQL: running count of rows for a query 'by minute'
Arbitrary intervals:
Truncate timestamp to arbitrary intervals
For completely arbitrary intervals consider @Clodoaldo's math based on epoch values or use the often overlooked function width_bucket()
. Example:
Aggregating (x,y) coordinate point clouds in PostgreSQL
Aggregating (x,y) coordinate point clouds in PostgreSQL
PostgreSQL - Get count of items in a table grouped by a datetime column for N intervals
Generate the times you want and then use left join
and aggregation:
select gs.ts, count(u.id)
from generate_series('2019-01-01T00:00:00'::timestamp,
'2019-01-01T12:00:00'::timestamp,
interval '12 hour'
) gs(ts) left join
users u
on u.created_at >= gs.ts and
u.created_at < gs.ts + interval '12 hour'
group by 1
order by 1;
EDIT:
If you want to specify the number of rows, you can use something similar:
from generate_series(1, 10, 1) as gs(n) cross join lateral
(values ('2019-01-01T00:00:00'::timestamp + (gs.n - 1) * interval '12 hour')
) v(ts) left join
users u
on u.created_at >= v.ts and
u.created_at < v.ts + interval '12 hour'
Multiple averages over evenly spaced intervals
I suggest the handy function width_bucket()
:
To get the average for each time segment ("bin"):
SELECT width_bucket(extract(epoch FROM t.the_date)
, x.min_epoch, x.max_epoch, x.bins) AS bin
, avg(value) AS bin_avg
FROM tbl t
, (SELECT extract(epoch FROM min(the_date)) AS min_epoch
, extract(epoch FROM max(the_date)) AS max_epoch
, 10 AS bins
FROM tbl t
) x
GROUP BY 1;
To get the "running average" over the (step-by-step) growing time interval:
SELECT bin, round(sum(bin_sum) OVER w /sum(bin_ct) OVER w, 2) AS running_avg
FROM (
SELECT width_bucket(extract(epoch FROM t.the_date)
, x.min_epoch, x.max_epoch, x.bins) AS bin
, sum(value) AS bin_sum
, count(*) AS bin_ct
FROM tbl t
, (SELECT extract(epoch FROM min(the_date)) AS min_epoch
, extract(epoch FROM max(the_date)) AS max_epoch
, 10 AS bins
FROM tbl t
) x
GROUP BY 1
) sub
WINDOW w AS (ORDER BY bin)
ORDER BY 1;
Using the_date
instead of date
as column name, avoiding reserved words as identifiers.
Since width_bucket()
is currently only implemented for double precision
and numeric
, I extract epoch values from the_date
. Details here:
Aggregating (x,y) coordinate point clouds in PostgreSQL
SQL Query timestamp date field and group all results by date
Simplest / fastest way to extract a date
from a Unix epoch:
SELECT to_timestamp(my_epoch_ts)::date;
Example:
SELECT to_timestamp(1349118398)::date
Result:
2012-10-01
That's making use of the to_timestamp()
function, followed by a cast to date: ::date
.
More details and links at the end of this recent related answer.
Truncate timestamp to arbitrary intervals
Consider this demo to bring timestamps down to a resolution of 15 minutes and aggregate resulting dupes:
WITH tbl(id, ts) AS ( VALUES
(1::int, '2012-10-04 00:00:00'::timestamp)
,(2, '2012-10-04 18:23:01')
,(3, '2012-10-04 18:30:00')
,(4, '2012-10-04 18:52:33')
,(5, '2012-10-04 18:55:01')
,(6, '2012-10-04 18:59:59')
,(7, '2012-10-05 11:01:01')
)
SELECT to_timestamp((extract(epoch FROM ts)::bigint / 900)*900)::timestamp
AS lower_bound
, to_timestamp(avg(extract(epoch FROM ts)))::timestamp AS avg_ts
, count(*) AS ct
FROM tbl
GROUP BY 1
ORDER BY 1;
Result:
lower_bound | avg_ts | ct
---------------------+---------------------+----
2012-10-04 00:00:00 | 2012-10-04 00:00:00 | 1
2012-10-04 18:15:00 | 2012-10-04 18:23:01 | 1
2012-10-04 18:30:00 | 2012-10-04 18:30:00 | 1
2012-10-04 18:45:00 | 2012-10-04 18:55:51 | 3
2012-10-05 11:00:00 | 2012-10-05 11:01:01 | 1
The trick is to extract a unix epoch like @Michael already posted. Integer division lumps them together in buckets of the chosen resolution, because fractional digits are truncated.
I divide by 900, because 15 minutes = 900 seconds.
Multiply by the same number to get the resulting lower_bound
.
Convert the unix epoch back to a timestamp with to_timestamp()
.
This works great for intervals that can be represented without fractional digits in the decimal system. For even more versatility use the often overlooked function width_bucket()
like I demonstrate in this recent, closely related answer. More explanation, links and an sqlfiddle demo over there.
Related Topics
SQL Server:Sum() of Multiple Rows Including Where Clauses
How to Create a Date in SQL Server Given the Day, Month and Year as Integers
Oracle - How to Create a Materialized View with Fast Refresh and Joins
Postgres Error: More Than One Row Returned by a Subquery Used as an Expression
Lightweight SQL Database Which Doesn't Require Installation
Drop Function Without Knowing the Number/Type of Parameters
Aggregate Function in an SQL Update Query
Sql: How to Select a Single Id ("Row") That Meets Multiple Criteria from a Single Column
Match Only Entire Words with Like
How to Generate a Temporary Table Filled with Dates in SQL Server 2000
How to Do If Not Exists in SQLite
How to Show Row Numbers in Postgresql Query
Find Records Where Join Doesn't Exist
Postgres - Function to Return the Intersection of 2 Arrays
Why Can't I Use Column Aliases in the Next Select Expression
Must Declare the Scalar Variable