Find closest points (lat / lon) from one data set to a second data set
Here is solution using a single loop and vectorizing the distance calculation (converted to km).
The code is using base R's rank
function to order/sort the list of calculated distances.
The indexes and the calculated distances of the 3 shortest values are store back in data frame A.
library(geosphere)
A = data.frame(longitude = c(-2.3954998, -2.0650243, -2.0650542), latitude = c(55.32043, 55.59062, 55.60859))
B = data.frame(longitude = c(-2.4252843, -2.0650542, -2.0650243), latitude = c(55.15858, 55.60859, 55.59062))
for(i in 1:nrow(A)){
#calucate distance against all of B
distances<-geosphere::distGeo(A[i,], B)/1000
#rank the calculated distances
ranking<-rank(distances, ties.method = "first")
#find the 3 shortest and store the indexes of B back in A
A$shortest[i]<-which(ranking ==1) #Same as which.min()
A$shorter[i]<-which(ranking==2)
A$short[i]<-which(ranking ==3)
#store the distances back in A
A$shortestD[i]<-distances[A$shortest[i]] #Same as min()
A$shorterD[i]<-distances[A$shorter[i]]
A$shortD[i]<-distances[A$short[i]]
}
A
longitude latitude shortest shorter short shortestD shorterD shortD
1 -2.395500 55.32043 1 3 2 18.11777 36.633310 38.28952
2 -2.065024 55.59062 3 2 1 0.00000 2.000682 53.24607
3 -2.065054 55.60859 2 3 1 0.00000 2.000682 55.05710
As M Viking pointed out, for the geosphere package the data must be arranged Lon then Lat.
Finding closest coordinates between two large data sets
I wrote up an answer referencing this thread. The function is modified to take care of reporting the distance and avoid hard-coding. Please note that it calculates Euclidean distance.
library(data.table)
#Euclidean distance
mydist <- function(a, b, df1, x, y){
dt <- data.table(sqrt((df1[[x]]-a)^2 + (df1[[y]]-b)^2))
return(data.table(Closest.V1 = which.min(dt$V1),
Distance = dt[which.min(dt$V1)]))
}
setDT(df1)[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON, setDT(df2),
"HIGH_PRCN_LAT", "HIGH_PRCN_LON"),
by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]
# id HIGH_PRCN_LAT HIGH_PRCN_LON Closest.V1 Distance.V1
# 1: 1 52.88144 -2.873778 5 0.7990743
# 2: 2 57.80945 -2.234544 8 2.1676868
# 3: 4 34.02335 -3.098445 10 1.4758202
# 4: 5 63.80879 -2.439163 3 4.2415854
# 5: 6 53.68881 -7.396112 2 3.6445416
# 6: 7 63.44628 -5.162345 3 2.3577811
# 7: 8 21.60755 -8.633113 9 8.2123762
# 8: 9 78.32444 3.813290 7 11.4936496
# 9: 10 66.85533 -3.994326 1 1.9296370
# 10: 3 51.62354 -8.906553 2 3.2180026
You can use RANN::nn2
, but you need to make sure to use the right syntax. Following works!
as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))
# nn.idx nn.dists
# 1 5 0.7990743
# 2 8 2.1676868
# 3 10 1.4758202
# 4 3 4.2415854
# 5 2 3.6445416
# 6 3 2.3577811
# 7 9 8.2123762
# 8 7 11.4936496
# 9 1 1.9296370
# 10 2 3.2180026
Find nearest point in second datset for every point in first dataset
c(df1[[x]],df1[[y]])
should be changed to df1[, c(y,x), with=FALSE]
because geosphere::distGeo
/ raster::pointDistance
expects a 2-column matrix of points. Also, it expects the points to be supplied as (longitude, latitude) - hence the (x,y) should be flipped to (y,x).
Using raster::pointDistance
:
# Using raster::pointDistance
library(data.table)
library(raster)
library(geosphere)
setDT(df1)
setDT(df2)
mydist <- function(a, b, df1, x, y) {
dt <- data.table(pointDistance(c(b,a), df1[, c(y,x), with=FALSE], lonlat=TRUE))
return (
data.table(Closest.V1 = which.min(dt$V1),
Distance = dt[which.min(dt$V1)])
)
}
df1[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON,
df2, "HIGH_PRCN_LAT", "HIGH_PRCN_LON"),
by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]
Using geosphere::distGeo
(same result as above):
# Using geosphere::distGeo
mydist <- function(a, b, df1, x, y) {
dt <- data.table(distGeo(c(b,a), df1[, c(y,x), with=FALSE]))
return (
data.table(Closest.V1 = which.min(dt$V1),
Distance = dt[which.min(dt$V1)])
)
}
df1[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON,
df2, "HIGH_PRCN_LAT", "HIGH_PRCN_LON"),
by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]
Find nearest points of latitude and longitude from different data sets with different length
Here is an other possible solution:
library(rgeos)
set1sp <- SpatialPoints(set1)
set2sp <- SpatialPoints(set2)
set1$nearest_in_set2 <- apply(gDistance(set1sp, set2sp, byid=TRUE), 1, which.min)
head(set1)
lon lat nearest_in_set2
## 1 13.67111 48.39167 10
## 2 12.86695 48.14806 10
## 3 15.94223 48.72111 10
## 4 11.09974 47.18917 1
## 5 12.95834 47.05444 1
## 6 14.20389 47.12917 1
Finding closest point by comparing two data frames in R
I think you just need to iterate over each within AllStations
and return the nearest of the ContaminantStations
that is within a radius.
func <- function(stations, constations, radius = 250000) {
if (!NROW(stations) || !NROW(constations)) return()
if (length(radius) == 1 && NROW(constations) > 1) {
radius <- rep(radius, NROW(constations))
} else if (length(radius) != NROW(constations)) {
stop("'radius' must be length 1 or the same as the number of rows in 'constations'")
}
out <- integer(NROW(stations))
for (i in seq_len(NROW(stations))) {
dists <- geosphere::distHaversine(stations[i,], constations)
out[i] <- if (any(dists <= radius)) which.min(dists) else 0L
}
return(out)
}
This returns a integer vector, indicating the contaminant station that is closest. If none are within the radius, then it returns 0
. This is safely used as a row-index on the original frame.
Each argument must include only two columns, with the first column being longitude. (I make no assumptions of the column names in the function.) radius
is in meters, consistent with the geosphere
package assumptions.
ind <- func(AllStations[,c("long_coor", "lat_coor")], ContaminantStations[,c("longitude", "latitude")],
radius = 230000)
ind
# [1] 0 6 6 6 0 0 6 6
These are indices on the ContaminantStations
rows, where non-zero means that that contaminant station is the closest to the specific row of AllStations
.
We can identify which contaminant station is closest with this (there are many ways to do this, including tidyverse and other techniques ... this is just a start).
AllStations$ClosestContaminantStation <- NA_character_
AllStations$ClosestContaminantStation[ind > 0] <- ContaminantStations$station[ind]
AllStations
# site_no lat_coor long_coor ClosestContaminantStation
# 1 02110500 33.91267 -78.71502 <NA>
# 2 02110550 33.85083 -78.89722 USGS-021473426
# 3 02110701 33.86100 -79.04115 USGS-021473426
# 4 02110704 33.83295 -79.04365 USGS-021473426
# 5 02110760 33.74073 -78.86669 <NA>
# 6 02110777 33.85156 -78.65585 <NA>
# 7 021108044 33.65017 -79.12310 USGS-021473426
# 8 02110815 33.44461 -79.17393 USGS-021473426
A vis of your data for perspective:
An alternative to this approach would be to return the distance and index of the closest contaminant station, regardless of the radius, allowing you to filter later.
func2 <- function(stations, constations, radius = 250000) {
if (!NROW(stations) || !NROW(constations)) return()
if (length(radius) == 1 && NROW(constations) > 1) {
radius <- rep(radius, NROW(constations))
} else if (length(radius) != NROW(constations)) {
stop("'radius' must be length 1 or the same as the number of rows in 'constations'")
}
out <- data.frame(ind = integer(NROW(stations)), dist = numeric(NROW(stations)))
for (i in seq_len(NROW(stations))) {
dists <- geosphere::distHaversine(stations[i,], constations)
out$ind[i] <- which.min(dists)
out$dist[i] <- min(dists)
}
return(out)
}
Demonstration, including bringing the contaminant station into the same frame.
AllStations2 <- cbind(
AllStations,
func2(AllStations[,c("long_coor", "lat_coor")], ContaminantStations[,c("longitude", "latitude")])
)
AllStations2
# site_no lat_coor long_coor ind dist
# 1 02110500 33.91267 -78.71502 1 241971.5
# 2 02110550 33.85083 -78.89722 6 227650.6
# 3 02110701 33.86100 -79.04115 6 214397.8
# 4 02110704 33.83295 -79.04365 6 214847.7
# 5 02110760 33.74073 -78.86669 6 233190.8
# 6 02110777 33.85156 -78.65585 6 249519.7
# 7 021108044 33.65017 -79.12310 6 213299.3
# 8 02110815 33.44461 -79.17393 6 217378.9
AllStations3 <- cbind(
AllStations2,
ContaminantStations[AllStations2$ind,]
)
AllStations3
# site_no lat_coor long_coor ind dist station latitude longitude
# 1 02110500 33.91267 -78.71502 1 241971.5 USGS-02146110 34.88928 -81.06869
# 6 02110550 33.85083 -78.89722 6 227650.6 USGS-021473426 34.24320 -81.31954
# 6.1 02110701 33.86100 -79.04115 6 214397.8 USGS-021473426 34.24320 -81.31954
# 6.2 02110704 33.83295 -79.04365 6 214847.7 USGS-021473426 34.24320 -81.31954
# 6.3 02110760 33.74073 -78.86669 6 233190.8 USGS-021473426 34.24320 -81.31954
# 6.4 02110777 33.85156 -78.65585 6 249519.7 USGS-021473426 34.24320 -81.31954
# 6.5 021108044 33.65017 -79.12310 6 213299.3 USGS-021473426 34.24320 -81.31954
# 6.6 02110815 33.44461 -79.17393 6 217378.9 USGS-021473426 34.24320 -81.31954
From here, you can choose your radius at will:
subset(AllStations3, dist < 230000)
# site_no lat_coor long_coor ind dist station latitude longitude
# 6 02110550 33.85083 -78.89722 6 227650.6 USGS-021473426 34.2432 -81.31954
# 6.1 02110701 33.86100 -79.04115 6 214397.8 USGS-021473426 34.2432 -81.31954
# 6.2 02110704 33.83295 -79.04365 6 214847.7 USGS-021473426 34.2432 -81.31954
# 6.5 021108044 33.65017 -79.12310 6 213299.3 USGS-021473426 34.2432 -81.31954
# 6.6 02110815 33.44461 -79.17393 6 217378.9 USGS-021473426 34.2432 -81.31954
Find nearest point in other dataframe (WITH A LOT OF DATA)
I guess a Ball Tree is an appropriate structure for this task.
You can use the scikit-learn implementation, see the code below for an example adapted to your case :
import numpy as np
import geopandas as gpd
from shapely.geometry import Point
from sklearn.neighbors import BallTree
## Create the two GeoDataFrame to replicate your dataset
appart = gpd.GeoDataFrame({
'geometry': Point(a, b),
'x': a,
'y': b,
} for a, b in zip(np.random.rand(100000), np.random.rand(100000))
])
pharma = gpd.GeoDataFrame([{
'geometry': Point(a, b),
'x': a,
'y': b,
} for a, b in zip(np.random.rand(3000), np.random.rand(3000))
])
# Create a BallTree
tree = BallTree(pharma[['x', 'y']].values, leaf_size=2)
# Query the BallTree on each feature from 'appart' to find the distance
# to the nearest 'pharma' and its id
appart['distance_nearest'], appart['id_nearest'] = tree.query(
appart[['x', 'y']].values, # The input array for the query
k=1, # The number of nearest neighbors
)
With this method you can solve your problem pretty quickly (the above example, on my computer, took less than a second to find the index of the nearest point, out of 3000 points, on an input dataset of 100000 points).
By default the query
method of BallTree
is returning the distance to the nearest neighbor and its id.
If you want you can disable returning the distance of this nearest neighbor by setting the return_distance
parameter to False
.
If you really only care about the distance, you can save only this value :
appart['distance_nearest'], _ = tree.query(appart[['x', 'y']].values, k=1)
Related Topics
Plot a Function with Several Arguments in R
Combining Grid.Table and Base Package Plots in R Figure
How to Get This Data Structure in R
R Read Abbreviated Month Form a Date That Is Not in English
Annotation_Custom with Npc Coordinates in Ggplot2
Sum Multiple Variables by Group
Add Multiple Curves/Functions to One Ggplot Through Looping
Populate Nas in a Vector Using Prior Non-Na Values
Add a Constant Value to All Rows in a Dataframe
How to Read Column Names 'As Is' from CSV File
Selecting Unique Rows in Matrix Using R
Scraping JavaScript Generated Data
How to Print on a Serie Sof Graphs Pairwise Comparisons Bars and Effect Size Value
Unexpected Date When Converting Posixct Date-Time to Date - Timezone Issue