SQL: Finding the closest Lat/Lon record on Google BigQuery
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(STRUCT<id_a INT64, id_b STRING>(a.id, b.id) ORDER BY ST_DISTANCE(a.point, b.point) LIMIT 1)[OFFSET(0)]
FROM (SELECT id, ST_GEOGPOINT(lon, lat) point FROM `project.dataset.table_a`) a
CROSS JOIN (SELECT id, ST_GEOGPOINT(lon, lat) point FROM `project.dataset.table_b`) b
GROUP BY a.id
you can test, play with it using dummy data from your question as
#standardSQL
WITH `project.dataset.table_a` AS (
SELECT 1 id, 32.95 lat, 65.567 lon UNION ALL
SELECT 2, 33.95, 65.566
), `project.dataset.table_b` AS (
SELECT 'a' id, 32.96 lat, 65.566 lon UNION ALL
SELECT 'b', 33.96, 65.566
)
SELECT AS VALUE ARRAY_AGG(STRUCT<id_a INT64, id_b STRING>(a.id, b.id) ORDER BY ST_DISTANCE(a.point, b.point) LIMIT 1)[OFFSET(0)]
FROM (SELECT id, ST_GEOGPOINT(lon, lat) point FROM `project.dataset.table_a`) a
CROSS JOIN (SELECT id, ST_GEOGPOINT(lon, lat) point FROM `project.dataset.table_b`) b
GROUP BY a.id
with result
Row id_a id_b
1 1 a
2 2 b
Finding the closest geographic points between two tables BigQuery
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE
ARRAY_AGG(STRUCT(id, address, parameter1, parameter2) ORDER BY ST_DISTANCE(a.point, b.point) LIMIT 1)[OFFSET(0)]
FROM (SELECT *, ST_GEOGPOINT(longitude, latitude) point FROM `project.dataset.tableA`) a,
(SELECT *, ST_GEOGPOINT(longitude, latitude) point FROM `project.dataset.tableB`) b
GROUP BY id
If to apply to sample data from your question
WITH `project.dataset.tableA` AS (
SELECT 1 id, 39.79 latitude, 86.03 longitude, '123 Vine St' address UNION ALL
SELECT 2, 39.89, 84.01, '123 Oak St'
), `project.dataset.tableB` AS (
SELECT 39.74 latitude, 86.33 longitude, [1, 2, 3] parameter1, [.1, .2, .3] parameter2 UNION ALL
SELECT 39.81, 83.90, [4, 5, 6], [.4, .5, .6]
)
output is
In sql, what is the most efficient way to find nearest pairs of points for two tables
A simple way to find the nearest point is to aggregate sorting by distance, and take the first element, something like:
SELECT
table_a.idenfitier,
ARRAY_AGG(table_b.identifier
ORDER BY ST_Distance(table_a.geog, table_b.geog)
LIMIT 1)
[ORDINAL(1)]
FROM ...
GROUP BY table_a.idenfitier
But as you correctly mentioned, cross product is very slow, so if you can - restrict the join to a limited radius using ST_DWithin
.
Couple blog posts describe the ideas how to do this in BigQuery in details
https://mentin.medium.com/nearest-neighbor-in-bigquery-gis-7d50ebd5d63
https://mentin.medium.com/nearest-neighbor-using-bq-scripting-373241f5b2f5
Reverse- geocoding: How to determine the city closest to a (lat,lon) with BigQuery SQL?
This is the best performing query we've worked out so far:
WITH a AS (
# a table with points around the world
SELECT * FROM UNNEST([ST_GEOGPOINT(-70, -33), ST_GEOGPOINT(-122,37), ST_GEOGPOINT(151,-33)]) my_point
), b AS (
# any table with cities world locations
SELECT *, ST_GEOGPOINT(lon,lat) latlon_geo
FROM `fh-bigquery.geocode.201806_geolite2_latlon_redux`
)
SELECT my_point, city_name, subdivision_1_name, country_name, continent_name
FROM (
SELECT loc.*, my_point
FROM (
SELECT ST_ASTEXT(my_point) my_point, ANY_VALUE(my_point) geop
, ARRAY_AGG( # get the closest city
STRUCT(city_name, subdivision_1_name, country_name, continent_name)
ORDER BY ST_DISTANCE(my_point, b.latlon_geo) LIMIT 1
)[SAFE_OFFSET(0)] loc
FROM a, b
WHERE ST_DWITHIN(my_point, b.latlon_geo, 100000) # filter to only close cities
GROUP BY my_point
)
)
GROUP BY 1,2,3,4,5
HAVERSINE distance in BigQuery?
2018 update: BigQuery now supports native geo functions.
ST_DISTANCE: Returns the shortest distance in meters between two non-empty GEOGRAPHYs.
Distance between NY and Seattle:
#standardSQL
WITH geopoints AS (
SELECT ST_GEOGPOINT(lon,lat) p, name, state
FROM `bigquery-public-data.noaa_gsod.stations`
)
SELECT ST_DISTANCE(
(SELECT p FROM geopoints WHERE name='PORT AUTH DOWNTN MANHATTAN WA'),
(SELECT p FROM geopoints WHERE name='SEATTLE')
)
3866381.55
- https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions
Legacy SQL solution (standard pending):
SELECT lat, lon, name,
(111.045 * DEGREES(ACOS(COS(RADIANS(40.73943)) * COS(RADIANS(lat)) * COS(RADIANS(-73.99585) - RADIANS(lon)) + SIN(RADIANS(40.73943)) * SIN(RADIANS(lat))))) AS distance
FROM [bigquery-public-data:noaa_gsod.stations]
HAVING distance>0
ORDER BY distance
LIMIT 4
(based on http://www.plumislandmedia.net/mysql/haversine-mysql-nearest-loc/)
Get latitude/longitude from BigQuery Geography Point
Use the Geography functions ST_X
and ST_Y
.
SELECT
ST_X(locationCoords) as longitude,
ST_Y(locationCoords) as latitude
FROM
dataTable
Geography Functions DOC
BigQuery: join all columns of a row nearest to the time of current row from a different table
Here's an alternative answer - in this case I created the physical tables to make sure this runs when tables are not ephemeral out of a WITH
statement:
CREATE TABLE temp.a AS (
SELECT * FROM UNNEST(
[STRUCT(TIMESTAMP('2018-01-02 20:01:00') AS time, 'monkey' AS animal)
,STRUCT('2018-03-04 10:10:10', 'lion')
,STRUCT('2018-07-04 10:10:10', 'donkey')
])
)
;
CREATE TABLE temp.b AS (
SELECT * FROM UNNEST(
[STRUCT(TIMESTAMP('2017-01-02 10:01:00') AS time, 'one' AS festival)
,STRUCT('2019-03-04 10:10:10', 'two')
,STRUCT('2018-07-04 10:10:10', 'three')
,STRUCT('2018-03-05 10:10:10', 'four')
])
)
;
SELECT b.*,
ARRAY_AGG(a
ORDER BY ABS(TIMESTAMP_DIFF(b.time, a.time, SECOND))
LIMIT 1)[OFFSET(0)] closest
FROM temp.b
JOIN temp.a
ON ABS(timestamp_diff(b.time,a.time, DAY)) < 360*2
GROUP BY 1, 2
The clause ON ABS(timestamp_diff(b.time,a.time, DAY)) < 360*2
is a sanity check to avoid explosive joins. Bring 360*2
to a way lower limit if your data is more dense than the presented here.
Optimize Distance Calculation in BigQuery
BigQuery can very efficiently do spatial join, matching items across two tables that are within specific distance of each other. But you need to know that specific distance, or try several ones until you update all the points.
This post discusses it in more details:
https://medium.com/@mentin/nearest-neighbor-in-bigquery-gis-7d50ebd5d63
You can automate it using BigQuery scripting, here is an idea, although it discussed a slightly different problem, a nearest geometry to a single point:
https://medium.com/@mentin/nearest-neighbor-using-bq-scripting-373241f5b2f5
Related Topics
Query JSONb Column Containing Array of JSON Objects
The Difference Between Comma and Join in SQL
Stop Access from Using Wrong Identity When Appending to Linked Table on SQL Server
SQL Access How to Return Between Dates
Handling Unicode Characters That Aren't Displayed Correctly in SQL Query
How to Display Number Value in Words
Oracle - Select Count on a Subquery
Sqlite Multiple Autoincrement Columns
How to Return Last Inserted (Auto Incremented) Row Id in Hsql
Ora-00904 Invalid Identifier on Decode Alias
How to Fill Missing Dates and Values in Partitioned Data
How to Execute Dynamic SQL in Teradata
SQL Count of Open Orders Each Day Between Two Dates
Turning Arbitrarily Many Rows into Columns in Postgresql
How Does 'In' Clause Works in Oracle
How to Find Which Columns Don't Have Any Data (All Values Are Null)