Find Closest Points (Lat/Lon) from One Data Set to a Second Data Set

Find closest points (lat / lon) from one data set to a second data set

Here is solution using a single loop and vectorizing the distance calculation (converted to km).

The code is using base R's rank function to order/sort the list of calculated distances.

The indexes and the calculated distances of the 3 shortest values are store back in data frame A.

library(geosphere)

A = data.frame(longitude = c(-2.3954998, -2.0650243, -2.0650542), latitude = c(55.32043, 55.59062, 55.60859))
B = data.frame(longitude = c(-2.4252843, -2.0650542, -2.0650243), latitude = c(55.15858, 55.60859, 55.59062))

for(i in 1:nrow(A)){
  #calucate distance against all of B
  distances<-geosphere::distGeo(A[i,], B)/1000
  #rank the calculated distances
  ranking<-rank(distances, ties.method = "first")

  #find the 3 shortest and store the indexes of B back in A
  A$shortest[i]<-which(ranking ==1) #Same as which.min()
  A$shorter[i]<-which(ranking==2)
  A$short[i]<-which(ranking ==3)

  #store the distances back in A
  A$shortestD[i]<-distances[A$shortest[i]] #Same as min()
  A$shorterD[i]<-distances[A$shorter[i]]
  A$shortD[i]<-distances[A$short[i]]
}
A

  longitude latitude shortest shorter short shortestD  shorterD   shortD
1 -2.395500 55.32043        1       3     2  18.11777 36.633310 38.28952
2 -2.065024 55.59062        3       2     1   0.00000  2.000682 53.24607
3 -2.065054 55.60859        2       3     1   0.00000  2.000682 55.05710

As M Viking pointed out, for the geosphere package the data must be arranged Lon then Lat.

Finding closest coordinates between two large data sets

I wrote up an answer referencing this thread. The function is modified to take care of reporting the distance and avoid hard-coding. Please note that it calculates Euclidean distance.

library(data.table)
#Euclidean distance 
mydist <- function(a, b, df1, x, y){

          dt <- data.table(sqrt((df1[[x]]-a)^2 + (df1[[y]]-b)^2))

          return(data.table(Closest.V1  = which.min(dt$V1),
                            Distance    = dt[which.min(dt$V1)]))
           }

setDT(df1)[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON, setDT(df2), 
                        "HIGH_PRCN_LAT", "HIGH_PRCN_LON"), 
                         by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]

  #     id HIGH_PRCN_LAT HIGH_PRCN_LON Closest.V1 Distance.V1
  # 1:   1      52.88144     -2.873778          5   0.7990743
  # 2:   2      57.80945     -2.234544          8   2.1676868
  # 3:   4      34.02335     -3.098445         10   1.4758202
  # 4:   5      63.80879     -2.439163          3   4.2415854
  # 5:   6      53.68881     -7.396112          2   3.6445416
  # 6:   7      63.44628     -5.162345          3   2.3577811
  # 7:   8      21.60755     -8.633113          9   8.2123762
  # 8:   9      78.32444      3.813290          7  11.4936496
  # 9:  10      66.85533     -3.994326          1   1.9296370
  # 10:  3      51.62354     -8.906553          2   3.2180026

You can use RANN::nn2, but you need to make sure to use the right syntax. Following works!

as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))

#    nn.idx   nn.dists
# 1       5  0.7990743
# 2       8  2.1676868
# 3      10  1.4758202
# 4       3  4.2415854
# 5       2  3.6445416
# 6       3  2.3577811
# 7       9  8.2123762
# 8       7 11.4936496
# 9       1  1.9296370
# 10      2  3.2180026

Find nearest point in second datset for every point in first dataset

c(df1[[x]],df1[[y]]) should be changed to df1[, c(y,x), with=FALSE] because geosphere::distGeo / raster::pointDistance expects a 2-column matrix of points. Also, it expects the points to be supplied as (longitude, latitude) - hence the (x,y) should be flipped to (y,x).

Using raster::pointDistance:

# Using raster::pointDistance
library(data.table)
library(raster)
library(geosphere)

setDT(df1)
setDT(df2)

mydist <- function(a, b, df1, x, y) {
    dt <- data.table(pointDistance(c(b,a), df1[, c(y,x), with=FALSE], lonlat=TRUE))
    
    return (
        data.table(Closest.V1 = which.min(dt$V1),
        Distance = dt[which.min(dt$V1)])
    )
}

df1[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON,
    df2, "HIGH_PRCN_LAT", "HIGH_PRCN_LON"),
    by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]

Sample Image

Using geosphere::distGeo (same result as above):

# Using geosphere::distGeo
mydist <- function(a, b, df1, x, y) {
    dt <- data.table(distGeo(c(b,a), df1[, c(y,x), with=FALSE]))
    
    return (
        data.table(Closest.V1 = which.min(dt$V1),
        Distance = dt[which.min(dt$V1)])
    )
}

df1[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON,
    df2, "HIGH_PRCN_LAT", "HIGH_PRCN_LON"),
    by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]

Find nearest points of latitude and longitude from different data sets with different length

Here is an other possible solution:

library(rgeos)
set1sp <- SpatialPoints(set1)
set2sp <- SpatialPoints(set2)
set1$nearest_in_set2 <- apply(gDistance(set1sp, set2sp, byid=TRUE), 1, which.min)

head(set1)
       lon      lat nearest_in_set2
## 1 13.67111 48.39167              10
## 2 12.86695 48.14806              10
## 3 15.94223 48.72111              10
## 4 11.09974 47.18917               1
## 5 12.95834 47.05444               1
## 6 14.20389 47.12917               1

Finding closest point by comparing two data frames in R

I think you just need to iterate over each within AllStations and return the nearest of the ContaminantStations that is within a radius.

func <- function(stations, constations, radius = 250000) {
  if (!NROW(stations) || !NROW(constations)) return()
  if (length(radius) == 1 && NROW(constations) > 1) {
    radius <- rep(radius, NROW(constations))
  } else if (length(radius) != NROW(constations)) {
    stop("'radius' must be length 1 or the same as the number of rows in 'constations'")
  }
  out <- integer(NROW(stations))
  for (i in seq_len(NROW(stations))) {
    dists <- geosphere::distHaversine(stations[i,], constations)
    out[i] <- if (any(dists <= radius)) which.min(dists) else 0L
  }
  return(out)
}

This returns a integer vector, indicating the contaminant station that is closest. If none are within the radius, then it returns 0. This is safely used as a row-index on the original frame.

Each argument must include only two columns, with the first column being longitude. (I make no assumptions of the column names in the function.) radius is in meters, consistent with the geosphere package assumptions.

ind <- func(AllStations[,c("long_coor", "lat_coor")], ContaminantStations[,c("longitude", "latitude")],
            radius = 230000)
ind
# [1] 0 6 6 6 0 0 6 6

These are indices on the ContaminantStations rows, where non-zero means that that contaminant station is the closest to the specific row of AllStations.

We can identify which contaminant station is closest with this (there are many ways to do this, including tidyverse and other techniques ... this is just a start).

AllStations$ClosestContaminantStation <- NA_character_
AllStations$ClosestContaminantStation[ind > 0] <- ContaminantStations$station[ind]
AllStations
#     site_no lat_coor long_coor ClosestContaminantStation
# 1  02110500 33.91267 -78.71502                      <NA>
# 2  02110550 33.85083 -78.89722            USGS-021473426
# 3  02110701 33.86100 -79.04115            USGS-021473426
# 4  02110704 33.83295 -79.04365            USGS-021473426
# 5  02110760 33.74073 -78.86669                      <NA>
# 6  02110777 33.85156 -78.65585                      <NA>
# 7 021108044 33.65017 -79.12310            USGS-021473426
# 8  02110815 33.44461 -79.17393            USGS-021473426

A vis of your data for perspective:

ggplot of station locations

An alternative to this approach would be to return the distance and index of the closest contaminant station, regardless of the radius, allowing you to filter later.

func2 <- function(stations, constations, radius = 250000) {
  if (!NROW(stations) || !NROW(constations)) return()
  if (length(radius) == 1 && NROW(constations) > 1) {
    radius <- rep(radius, NROW(constations))
  } else if (length(radius) != NROW(constations)) {
    stop("'radius' must be length 1 or the same as the number of rows in 'constations'")
  }
  out <- data.frame(ind = integer(NROW(stations)), dist = numeric(NROW(stations)))
  for (i in seq_len(NROW(stations))) {
    dists <- geosphere::distHaversine(stations[i,], constations)
    out$ind[i] <- which.min(dists)
    out$dist[i] <- min(dists)
  }
  return(out)
}

Demonstration, including bringing the contaminant station into the same frame.

AllStations2 <- cbind(
  AllStations,
  func2(AllStations[,c("long_coor", "lat_coor")], ContaminantStations[,c("longitude", "latitude")])
)
AllStations2
#     site_no lat_coor long_coor ind     dist
# 1  02110500 33.91267 -78.71502   1 241971.5
# 2  02110550 33.85083 -78.89722   6 227650.6
# 3  02110701 33.86100 -79.04115   6 214397.8
# 4  02110704 33.83295 -79.04365   6 214847.7
# 5  02110760 33.74073 -78.86669   6 233190.8
# 6  02110777 33.85156 -78.65585   6 249519.7
# 7 021108044 33.65017 -79.12310   6 213299.3
# 8  02110815 33.44461 -79.17393   6 217378.9
AllStations3 <- cbind(
  AllStations2,
  ContaminantStations[AllStations2$ind,]
)
AllStations3
#       site_no lat_coor long_coor ind     dist        station latitude longitude
# 1    02110500 33.91267 -78.71502   1 241971.5  USGS-02146110 34.88928 -81.06869
# 6    02110550 33.85083 -78.89722   6 227650.6 USGS-021473426 34.24320 -81.31954
# 6.1  02110701 33.86100 -79.04115   6 214397.8 USGS-021473426 34.24320 -81.31954
# 6.2  02110704 33.83295 -79.04365   6 214847.7 USGS-021473426 34.24320 -81.31954
# 6.3  02110760 33.74073 -78.86669   6 233190.8 USGS-021473426 34.24320 -81.31954
# 6.4  02110777 33.85156 -78.65585   6 249519.7 USGS-021473426 34.24320 -81.31954
# 6.5 021108044 33.65017 -79.12310   6 213299.3 USGS-021473426 34.24320 -81.31954
# 6.6  02110815 33.44461 -79.17393   6 217378.9 USGS-021473426 34.24320 -81.31954

From here, you can choose your radius at will:

subset(AllStations3, dist < 230000)
#       site_no lat_coor long_coor ind     dist        station latitude longitude
# 6    02110550 33.85083 -78.89722   6 227650.6 USGS-021473426  34.2432 -81.31954
# 6.1  02110701 33.86100 -79.04115   6 214397.8 USGS-021473426  34.2432 -81.31954
# 6.2  02110704 33.83295 -79.04365   6 214847.7 USGS-021473426  34.2432 -81.31954
# 6.5 021108044 33.65017 -79.12310   6 213299.3 USGS-021473426  34.2432 -81.31954
# 6.6  02110815 33.44461 -79.17393   6 217378.9 USGS-021473426  34.2432 -81.31954

Find nearest point in other dataframe (WITH A LOT OF DATA)

I guess a Ball Tree is an appropriate structure for this task.

You can use the scikit-learn implementation, see the code below for an example adapted to your case :

import numpy as np
import geopandas as gpd
from shapely.geometry import Point
from sklearn.neighbors import BallTree

## Create the two GeoDataFrame to replicate your dataset
appart = gpd.GeoDataFrame({
        'geometry': Point(a, b),
        'x': a,
        'y': b,
    } for a, b in zip(np.random.rand(100000), np.random.rand(100000))
])

pharma = gpd.GeoDataFrame([{
        'geometry': Point(a, b),
        'x': a,
        'y': b,
    } for a, b in zip(np.random.rand(3000), np.random.rand(3000))
])

# Create a BallTree 
tree = BallTree(pharma[['x', 'y']].values, leaf_size=2)

# Query the BallTree on each feature from 'appart' to find the distance
# to the nearest 'pharma' and its id
appart['distance_nearest'], appart['id_nearest'] = tree.query(
    appart[['x', 'y']].values, # The input array for the query
    k=1, # The number of nearest neighbors
)

With this method you can solve your problem pretty quickly (the above example, on my computer, took less than a second to find the index of the nearest point, out of 3000 points, on an input dataset of 100000 points).

By default the query method of BallTree is returning the distance to the nearest neighbor and its id.
If you want you can disable returning the distance of this nearest neighbor by setting the return_distance parameter to False.
If you really only care about the distance, you can save only this value :

appart['distance_nearest'], _ = tree.query(appart[['x', 'y']].values, k=1)

Find Closest Points (Lat/Lon) from One Data Set to a Second Data Set