Sql: Finding the Closest Lat/Lon Record on Google Bigquery

SQL: Finding the closest Lat/Lon record on Google BigQuery

Below is for BigQuery Standard SQL

#standardSQL
SELECT AS VALUE ARRAY_AGG(STRUCT<id_a INT64, id_b STRING>(a.id, b.id) ORDER BY ST_DISTANCE(a.point, b.point) LIMIT 1)[OFFSET(0)]
FROM (SELECT id, ST_GEOGPOINT(lon, lat) point FROM `project.dataset.table_a`) a
CROSS JOIN (SELECT id, ST_GEOGPOINT(lon, lat) point FROM `project.dataset.table_b`) b
GROUP BY a.id

you can test, play with it using dummy data from your question as

#standardSQL
WITH `project.dataset.table_a` AS (
SELECT 1 id, 32.95 lat, 65.567 lon UNION ALL
SELECT 2, 33.95, 65.566
), `project.dataset.table_b` AS (
SELECT 'a' id, 32.96 lat, 65.566 lon UNION ALL
SELECT 'b', 33.96, 65.566
)
SELECT AS VALUE ARRAY_AGG(STRUCT<id_a INT64, id_b STRING>(a.id, b.id) ORDER BY ST_DISTANCE(a.point, b.point) LIMIT 1)[OFFSET(0)]
FROM (SELECT id, ST_GEOGPOINT(lon, lat) point FROM `project.dataset.table_a`) a
CROSS JOIN (SELECT id, ST_GEOGPOINT(lon, lat) point FROM `project.dataset.table_b`) b
GROUP BY a.id

with result

Row id_a    id_b     
1 1 a
2 2 b

Finding the closest geographic points between two tables BigQuery

Below is for BigQuery Standard SQL

#standardSQL
SELECT AS VALUE
ARRAY_AGG(STRUCT(id, address, parameter1, parameter2) ORDER BY ST_DISTANCE(a.point, b.point) LIMIT 1)[OFFSET(0)]
FROM (SELECT *, ST_GEOGPOINT(longitude, latitude) point FROM `project.dataset.tableA`) a,
(SELECT *, ST_GEOGPOINT(longitude, latitude) point FROM `project.dataset.tableB`) b
GROUP BY id

If to apply to sample data from your question

WITH `project.dataset.tableA` AS (
SELECT 1 id, 39.79 latitude, 86.03 longitude, '123 Vine St' address UNION ALL
SELECT 2, 39.89, 84.01, '123 Oak St'
), `project.dataset.tableB` AS (
SELECT 39.74 latitude, 86.33 longitude, [1, 2, 3] parameter1, [.1, .2, .3] parameter2 UNION ALL
SELECT 39.81, 83.90, [4, 5, 6], [.4, .5, .6]
)

output is

Sample Image

In sql, what is the most efficient way to find nearest pairs of points for two tables

A simple way to find the nearest point is to aggregate sorting by distance, and take the first element, something like:

SELECT 
table_a.idenfitier,
ARRAY_AGG(table_b.identifier
ORDER BY ST_Distance(table_a.geog, table_b.geog)
LIMIT 1)
[ORDINAL(1)]
FROM ...
GROUP BY table_a.idenfitier

But as you correctly mentioned, cross product is very slow, so if you can - restrict the join to a limited radius using ST_DWithin.
Couple blog posts describe the ideas how to do this in BigQuery in details

https://mentin.medium.com/nearest-neighbor-in-bigquery-gis-7d50ebd5d63

https://mentin.medium.com/nearest-neighbor-using-bq-scripting-373241f5b2f5

Reverse- geocoding: How to determine the city closest to a (lat,lon) with BigQuery SQL?

This is the best performing query we've worked out so far:

WITH a AS (
# a table with points around the world
SELECT * FROM UNNEST([ST_GEOGPOINT(-70, -33), ST_GEOGPOINT(-122,37), ST_GEOGPOINT(151,-33)]) my_point
), b AS (
# any table with cities world locations
SELECT *, ST_GEOGPOINT(lon,lat) latlon_geo
FROM `fh-bigquery.geocode.201806_geolite2_latlon_redux`
)

SELECT my_point, city_name, subdivision_1_name, country_name, continent_name
FROM (
SELECT loc.*, my_point
FROM (
SELECT ST_ASTEXT(my_point) my_point, ANY_VALUE(my_point) geop
, ARRAY_AGG( # get the closest city
STRUCT(city_name, subdivision_1_name, country_name, continent_name)
ORDER BY ST_DISTANCE(my_point, b.latlon_geo) LIMIT 1
)[SAFE_OFFSET(0)] loc
FROM a, b
WHERE ST_DWITHIN(my_point, b.latlon_geo, 100000) # filter to only close cities
GROUP BY my_point
)
)
GROUP BY 1,2,3,4,5

Sample Image

HAVERSINE distance in BigQuery?

2018 update: BigQuery now supports native geo functions.

ST_DISTANCE: Returns the shortest distance in meters between two non-empty GEOGRAPHYs.

Distance between NY and Seattle:

#standardSQL
WITH geopoints AS (
SELECT ST_GEOGPOINT(lon,lat) p, name, state
FROM `bigquery-public-data.noaa_gsod.stations`
)

SELECT ST_DISTANCE(
(SELECT p FROM geopoints WHERE name='PORT AUTH DOWNTN MANHATTAN WA'),
(SELECT p FROM geopoints WHERE name='SEATTLE')
)

3866381.55
  • https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions

Legacy SQL solution (standard pending):

SELECT lat, lon, name,
(111.045 * DEGREES(ACOS(COS(RADIANS(40.73943)) * COS(RADIANS(lat)) * COS(RADIANS(-73.99585) - RADIANS(lon)) + SIN(RADIANS(40.73943)) * SIN(RADIANS(lat))))) AS distance
FROM [bigquery-public-data:noaa_gsod.stations]
HAVING distance>0
ORDER BY distance
LIMIT 4

Sample Image

(based on http://www.plumislandmedia.net/mysql/haversine-mysql-nearest-loc/)

Get latitude/longitude from BigQuery Geography Point

Use the Geography functions ST_X and ST_Y.

SELECT 
ST_X(locationCoords) as longitude,
ST_Y(locationCoords) as latitude
FROM
dataTable

Geography Functions DOC

BigQuery: join all columns of a row nearest to the time of current row from a different table

Here's an alternative answer - in this case I created the physical tables to make sure this runs when tables are not ephemeral out of a WITH statement:

CREATE TABLE temp.a AS (
SELECT * FROM UNNEST(
[STRUCT(TIMESTAMP('2018-01-02 20:01:00') AS time, 'monkey' AS animal)
,STRUCT('2018-03-04 10:10:10', 'lion')
,STRUCT('2018-07-04 10:10:10', 'donkey')
])
)
;
CREATE TABLE temp.b AS (
SELECT * FROM UNNEST(
[STRUCT(TIMESTAMP('2017-01-02 10:01:00') AS time, 'one' AS festival)
,STRUCT('2019-03-04 10:10:10', 'two')
,STRUCT('2018-07-04 10:10:10', 'three')
,STRUCT('2018-03-05 10:10:10', 'four')
])
)
;

SELECT b.*,
ARRAY_AGG(a
ORDER BY ABS(TIMESTAMP_DIFF(b.time, a.time, SECOND))
LIMIT 1)[OFFSET(0)] closest
FROM temp.b
JOIN temp.a
ON ABS(timestamp_diff(b.time,a.time, DAY)) < 360*2
GROUP BY 1, 2

The clause ON ABS(timestamp_diff(b.time,a.time, DAY)) < 360*2 is a sanity check to avoid explosive joins. Bring 360*2 to a way lower limit if your data is more dense than the presented here.

Sample Image

Optimize Distance Calculation in BigQuery

BigQuery can very efficiently do spatial join, matching items across two tables that are within specific distance of each other. But you need to know that specific distance, or try several ones until you update all the points.

This post discusses it in more details:
https://medium.com/@mentin/nearest-neighbor-in-bigquery-gis-7d50ebd5d63

You can automate it using BigQuery scripting, here is an idea, although it discussed a slightly different problem, a nearest geometry to a single point:
https://medium.com/@mentin/nearest-neighbor-using-bq-scripting-373241f5b2f5



Related Topics



Leave a reply



Submit