Identify points within specified distance in R
Two approaches.
The first creates a distance matrix using earth.dist(...)
in the fossil
package, and then takes advantage of data.tables to assemble the table of results.
The second uses distHaversine(...)
in the geosphere
package to calculate distances and assemble the final colocation table in one step. The latter approach may or may not be faster, but will certainly be more memory efficient, as it never stores the full distance matrix. Also, this approach is amenable to using other distance measures in geosphere
, e.g., distVincentySphere(...)
, distVincentyEllipsoid(...)
, or distMeeus(...)
.
Note that the actual distances are slightly different, probably because earth.dist(...)
and distHaversine(...)
use slightly different estimates for the radius of the earth. Also, note that both approaches here rely on station numbers for IDs. If the stations have names, the code will have to be modified slightly.
First Approach: Using earth.dist(...)
df = read.table(header=T,text="long lat
1 -74.20139 39.82806
2 -74.20194 39.82806
3 -74.20167 39.82806
4 -74.20197 39.82824
5 -74.20150 39.82814
6 -74.26472 39.66639
7 -74.17389 39.87111
8 -74.07224 39.97353
9 -74.07978 39.94554") # your sample data
library(fossil) # for earth.dist(...)
library(data.table)
sep.ft <- 200 # critical separation (feet)
sep.km <- sep.ft*0.0003048 # critical separation (km)
m <- as.matrix(earth.dist(df)) # distance matrix in km
coloc <- data.table(which(m<sep.km, arr.ind=T)) # pairs of stations with dist<200 ft
setnames(coloc,c("row","col"),c("ST.1","ST.2")) # rename columns to reflect station IDs
coloc <- coloc[ST.1<ST.2,] # want only lower triagular part
coloc[,dist:=m[ST.1,ST.2]/0.0003048,by="ST.1,ST.2"] # append distances in feet
remove(m) # don't need distance matrix anymore...
stations <- data.table(id=as.integer(rownames(df)),df)
setkey(stations,id)
setkey(coloc,ST.1)
coloc[stations,c("long.1","lat.1"):=list(long,lat),nomatch=0]
setkey(coloc,ST.2)
coloc[stations,c("long.2","lat.2"):=list(long,lat),nomatch=0]
Produces this:
coloc
# ST.1 ST.2 dist long.1 lat.1 long.2 lat.2
# 1: 1 2 154.13436 -74.20139 39.82806 -74.20194 39.82806
# 2: 1 3 78.46840 -74.20139 39.82806 -74.20167 39.82806
# 3: 2 3 75.66596 -74.20194 39.82806 -74.20167 39.82806
# 4: 1 4 175.31180 -74.20139 39.82806 -74.20197 39.82824
# 5: 2 4 66.22069 -74.20194 39.82806 -74.20197 39.82824
# 6: 3 4 106.69018 -74.20167 39.82806 -74.20197 39.82824
# 7: 1 5 42.45634 -74.20139 39.82806 -74.20150 39.82814
# 8: 2 5 126.71608 -74.20194 39.82806 -74.20150 39.82814
# 9: 3 5 55.87449 -74.20167 39.82806 -74.20150 39.82814
# 10: 4 5 136.67612 -74.20197 39.82824 -74.20150 39.82814
Second Approach: Using distHaversine(...)
library(data.table)
library(geosphere)
sep.ft <- 200 # critical separation (feet)
stations <- data.table(id=as.integer(rownames(df)),df)
d <- function(x){ # distance between station[i] and all subsequent stations
r.ft <- 6378137*3.28084 # radius of the earth, in feet
if (x[1]==nrow(stations)) return() # don't process last row
ref <- stations[(x[1]+1):nrow(stations),]
z <- distHaversine(ref[,2:3,with=F],x[2:3], r=r.ft)
z <- data.table(ST.1=x[1], ST.2=ref$id, dist=z, long.1=x[2], lat.1=x[3], long.2=ref$long, lat.2=ref$lat)
return(z[z$dist<sep.ft,])
}
coloc.2 = do.call(rbind,apply(stations,1,d))
Produces this:
coloc.2
# ST.1 ST.2 dist long.1 lat.1 long.2 lat.2
# 1: 1 2 154.26350 -74.20139 39.82806 -74.20194 39.82806
# 2: 1 3 78.53414 -74.20139 39.82806 -74.20167 39.82806
# 3: 1 4 175.45868 -74.20139 39.82806 -74.20197 39.82824
# 4: 1 5 42.49191 -74.20139 39.82806 -74.20150 39.82814
# 5: 2 3 75.72935 -74.20194 39.82806 -74.20167 39.82806
# 6: 2 4 66.27617 -74.20194 39.82806 -74.20197 39.82824
# 7: 2 5 126.82225 -74.20194 39.82806 -74.20150 39.82814
# 8: 3 4 106.77957 -74.20167 39.82806 -74.20197 39.82824
# 9: 3 5 55.92131 -74.20167 39.82806 -74.20150 39.82814
# 10: 4 5 136.79063 -74.20197 39.82824 -74.20150 39.82814
Identify points from a second data frame within specified distance in R
You can use a for-loop to loop through each site, adding a column to the finds
dataframe with true/false (or true/NA as I've used here) for each find.
I've used a cut-off of +/- 10000 in the co-ordinates, and changed the co-ordinates in your example of the finds table so each is in range of one site so you can see how it works.
library(tidyverse)
Data:
sites <- tibble(SiteID = c("HtsOJM-IC3", "HtsCBT-BI1"),
Easting = c(464870, 438430),
Northing = c(106560, 139000))
finds <- tibble(id = c(1005083, 1005080),
objecttype = c("BROOCH", "BROOCH"),
easting = c(470870, 433430),
northing = c(103560, 142000))
Set cut_off
to the max difference between co-ordinates you want to be marked as true
cut_off <- 10000
For each row in sites
, a new column is added to finds
with a value of True
if both the easting and northing co-ordinates are in range of the site ± the value of cut_off
.
for (i in 1:nrow(sites)) {
site_ID <- sites[[i,"SiteID"]]
site_easting <- sites[[i, "Easting"]]
site_northing <- sites[[i, "Northing"]]
finds <- finds %>%
mutate(!! site_ID := if_else((between(easting, site_easting - cut_off, site_easting + cut_off) &
between(northing, site_northing - cut_off, site_northing + cut_off)),
T, NA))
}
Output:
# A tibble: 2 x 6
id objecttype easting northing `HtsOJM-IC3` `HtsCBT-BI1`
<dbl> <chr> <dbl> <dbl> <lgl> <lgl>
1 1005083 BROOCH 470870 103560 TRUE NA
2 1005080 BROOCH 433430 142000 NA TRUE
If you want to convert the separate sites columns into a single column, use pivot_wider()
:
finds %>%
pivot_longer(cols = starts_with("Hts"), names_to = "SiteID", values_drop_na = T) %>%
select(-value)
Output, note that if any find is within range of more than one site they'll now have multiple rows, 1 per site:
# A tibble: 2 x 5
id objecttype easting northing SiteID
<dbl> <chr> <dbl> <dbl> <chr>
1 1005083 BROOCH 470870 103560 HtsOJM-IC3
2 1005080 BROOCH 433430 142000 HtsCBT-BI1
Find which points are within a given distance of each point
You can also use sp::spDists
to return a distance matrix, and then find the elements of each column/row that meet your condition.
For example:
d <- read.table(text='Place X_UTM Y_UTM
1 574261.98 6140492.13
2 571251.23 6141669.26
3 570841.92 6142534.86
4 570233.75 6141212.5
5 578269.25 6140303.78
6 575067.07 6137444.36', header=TRUE)
library(sp)
i <- apply(spDists(as.matrix(d[, c('X_UTM', 'Y_UTM')])), 2,
function(x) paste(which(x < 1000 & x != 0), collapse=', '))
data.frame(Place=d$Place, Closest=i)
## Place Closest
## 1 1
## 2 2 3
## 3 3 2
## 4 4
## 5 5
## 6 6
Find the Number of Points Within a Certain Radius of a Data Point in R
When you ask a question, you should always include some example data. Like this
lat <- c(-23.8, -25.8)
lon <- c(-49.6, -44.6)
hosp <- cbind(lon, lat)
lat <- c(-22.8, -24.8, -29.1, -28, -20)
lon <- c(-46.4, -46.3, -45.3, -40, -30)
procedures <- cbind(lon, lat)
Are your data in longitude/latitude? If so, you need to use a proper method to compute distances. For example
library(geosphere)
dm <- distm(procedures, hosp)
Or
library(raster)
d <- pointDistance(procedures, hosp, lonlat=TRUE)
Both compute the distance from all procedures to all hospitals. This will fail with very large datasets, but from what you describe it should work fine.
Now you can use a threshold (here 400,000 m) to find out which procedures are within that distance of each hospital
apply(d < 400000, 2, which)
#[[1]]
#[1] 1 2
#[[2]]
#[1] 1 2 3
So procedure 1, 2 and 3 are within that distance to hospital 2
If your data are not longitude/latitude, you can use
d <- pointDistance(procedures, hosp, lonlat=FALSE)
Points within a distance from a point
Straight line (Euclidean) distances between a certain set of points vary across different projections.
As discussed here, 3857 is not a great choice for distance calculations. If we use a projection built for Northern France, where the points are located (perhaps EPSG:27561 - NTF (Paris) / Lambert Nord France), we get the expected results.
sdf <- st_as_sf(df, coords = c("x", "y"),
crs = 4326, agr = "constant")
sdf_proj <- st_transform(sdf, 3857)
sdf_France_proj <- st_transform(sdf, 27561)
Great circle distances
> st_distance(pnt, sdf, by_element = T) %>% sort()
Units: [m]
[1] 0.00000 0.00000 61.50611 101.64007 143.64481 189.43485 253.46142 267.67954 411.50933 478.11306
Mercator (3857) Euclidean distances
> st_distance(pnt_proj, sdf_proj, by_element = T) %>% sort()
Units: [m]
[1] 0.00000 0.00000 93.43522 154.41781 218.22699 287.78463 385.04306 406.64205 625.14635 726.33083
France projection (27561) Euclidean distances
> st_distance(pnt_France_proj, sdf_France_proj, by_element = T) %>% sort()
Units: [m]
[1] 0.00000 0.00000 61.50289 101.63477 143.63731 189.42497 253.44822 267.66559 411.48772 478.08794
And here are the plots with buffers:
library(gridExtra)
buffer_proj <- st_buffer(pnt_proj, 200)
buffer_France_proj <- st_buffer(pnt_France_proj, 200)
proj <- ggplot() + g
eom_sf(data = buffer_proj, fill = "blue", alpha = 0.5) +
geom_sf(data = sdf_proj, color = "red") +
ggtitle("Mercator (3857)")
France_proj <- ggplot() +
geom_sf(data = buffer_France_proj, fill = "blue", alpha = 0.5) +
geom_sf(data = sdf_proj, color = "red") +
ggtitle("France proj (27561)")
grid.arrange(proj, France_proj, ncol = 1)
Find coordinates within radius around many starting points in grid
Example data --- you were using a lon/lat example, but based on your code, I am assuming that you are using planar data.
library(raster)
r <- raster(nrows=100, ncols=100, xmn=0, xmx=100, ymn=0, ymx=100, crs="+proj=utm +zone=1 +datum=WGS84")
values(r) <- 1:ncell(r) # for display only
xygrid <- as.data.frame(r, xy = TRUE)[,1:2]
locs <- c(8025, 1550, 5075)
dn <- 2.5 # min dist
dx <- 5.5 # max dist
The simplest approach would be to use pointDistance
p <- xyFromCell(r, locs)
d <- pointDistance(xygrid, p, lonlat=FALSE)
u <- unique(which(d>dn & d<dx) %% nrow(d))
pts <- xygrid[u,]
plot(r)
points(pts)
But you will probably run out of memory with that, and it is inefficient to compute all distance. Instead, you may intersect the points with a buffer around the points of interest
b1 <- buffer(SpatialPoints(p, proj4string=crs(r)), dx)
b2 <- buffer(SpatialPoints(p, proj4string=crs(r)), dn)
b <- erase(b1, b2)
x <- intersect(SpatialPoints(xygrid, proj4string=crs(r)), b)
plot(r)
points(x, cex=.5)
points(xyFromCell(r, locs), col="red", pch="x")
With terra
it goes like this -- and works well for large datasets in version 1.1-11 that should be on CRAN this week
library(terra)
rr <- rast(r)
pp <- xyFromCell(rr, locs)
bb1 <- buffer(vect(pp), dx)
bb2 <- buffer(vect(pp), dn)
bb <- erase(bb1, bb2)
xx <- intersect(vect(as.matrix(xygrid)), bb)
You can do similar things with sf
.
Given that you have so many data points, you might want to start with removing all points that are clearly not of interest
xySel <- lapply(locs, function(i) {
xy <- xygrid[i,]
s <- xygrid[,1] > xy[,1]-dx & xygrid[,1] < xy[,1]+dx & xygrid[,2] > xy[,2]-dx & xygrid[,2] < xy[,2]+dx
xygrid[s,]
})
xySel = do.call(rbind, xySel)
dim(xySel)
# [1] 363 2
dim(xygrid)
#[1] 10000 2
And now you could run pointDistance
as above on all data (or else inside the lapply function)
You say that you need to use points, and not a raster. I have seen that idea many times, and 9 out of 10 times that is wrong. Maybe it is true in your case. For others who stumble upon this question, here are are two raster based approaches.
With the raster package you could use extract( ... ,cellnumbers=TRUE)
or ajacent
. With adjacent, you would first make a weights matrix using one of the buffers made above
buf <- disaggregate(b)[2,]
rb <- crop(r, buf)
w <- as.matrix(rasterize(buf, rb, background=NA) )
w[6,6]=0
And then use the weight matrix like this
a <- adjacent(r, locs, w, pairs=FALSE)
pts <- xyFromCell(r, a)
plot(r)
points(pts)
With terra
you could use the cells
method
d <- cells(rr, bb)
xy <- xyFromCell(rr, d[,2])
plot(rr)
points(xy, cex=.5)
lines(bb, col="red", lwd=2)
How to calculate the distance between points and a fixed location in R Studio
c(df$longitude,df$latitude)
is just one long vector.
In this case you need to pass the columns of the data frame
library(geosphere)
distGeo((citycenter, df[, c("longitude", "latitude")])
Now the dist() function will compute the distance between the city center location with every row in the data frame.
Assigning points to locations based on distance from coordinate in R
Try breaking down your data into smaller chunks so you don't overload your memory. A reproducible example would be helpful, but I think this gets the job done:
library(sp)
# X1 is a small example and X2 is a large example
X1 <- cbind(pointX = 1:109, pointY = 1:109)
Y1 <- cbind(x = 11:20, y = 11:20)
X2 <- cbind(pointX = 2e4 + sample(2e6), pointY = 2e4 + sample(2e6))
Y2 <- cbind(x = sample(2e4), y = sample(2e4))
nearWrapper = function(X, Y, nBatches = 10){
maxNumber = dim(X)[1]
batchNumbers <- split(0:maxNumber, ceiling(seq_along(0:maxNumber)/nBatches))
out <- numeric(maxNumber)
for(batch in 1:(nBatches+1)){
out[batchNumbers[[batch]]] <- apply(spDists(X[batchNumbers[[batch]],], Y), 1, which.min)
}
return(out)
}
smallOut <- nearWrapper(X1, Y1)
largeOut <- nearWrapper(X2, Y2)
If this takes too long with your data, you might also check out parallel computing (using a foreach
loop in place of the for
loop in your case).
How to pull points that are within a certain distance away in R?
The euclidean distance of each point on the grid from the target point p
can be efficiently computed with:
dist <- sqrt(rowSums(mapply(function(x,y) (x-y)^2, grid, p)))
Basically the inner mapply
call will result in a matrix of the same size as grid
but that has the squared distance of that point from the target point in that dimension; rowSums
and sqrt
efficiently then compute the euclidean distance.
In this case you are including anything with sqrt(2)
Euclidean distance from the target point:
grid[dist < 1.5,]
# Var1 Var2
# 16 1 4
# 17 2 4
# 18 3 4
# 21 1 5
# 22 2 5
# 23 3 5
# 26 1 6
# 27 2 6
# 28 3 6
The use of mapply
(operating over dimensions) and rowSums
makes this much more efficient than an approach that loops through individual points on the grid, computing the distance to the target point. To see this, consider a slightly larger example with 1000 randomly distributed points in three dimensions:
set.seed(144)
grid <- data.frame(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))
p <- data.frame(x=rnorm(1), y=rnorm(1), z=rnorm(1))
lim <- 1.5
byrow <- function(grid, p, lim) grid[apply(grid, 1, function(x) sqrt(sum((x-p)^2))) < lim,]
vectorized <- function(grid, p, lim) grid[sqrt(rowSums(mapply(function(x,y) (x-y)^2, grid, p))) < lim,]
identical(byrow(grid, p, lim), vectorized(grid, p, lim))
[1] TRUE
library(microbenchmark)
# Unit: microseconds
# expr min lq mean median uq max neval
# byrow(grid, p, lim) 446792.71 473428.137 500680.0431 495824.7765 521185.093 579999.745 10
# vectorized(grid, p, lim) 855.33 881.981 954.1773 907.3805 1081.658 1108.679 10
The vectorized approach is 500 times faster than the approach that loops through the rows.
This approach can be used in cases where you have many more points (1 million in this example):
set.seed(144)
grid <- data.frame(x=rnorm(1000000), y=rnorm(1000000), z=rnorm(1000000))
p <- data.frame(x=rnorm(1), y=rnorm(1), z=rnorm(1))
lim <- 1.5
system.time(vectorized(grid, p, lim))
# user system elapsed
# 3.466 0.136 3.632
find number of points within a radius in R using lon and lat coordinates
To get what you want from nn2
, you need to:
- Specify the
searchtype
to be"radius"
. By not specifying thesearchtype
, thesearchtype
is set tostandard
and theradius
input is ignored. - Specify
k
to benrow(mydata)
becausek
is the maximum number of nearest neighbors to compute.
The manner in which you are calling nn2
will return the 100
nearest neighbors in mydata
for each point in mydata
. Consequently, you will get the same result for any radius
. For example:
library(RANN)
set.seed(123)
## simulate some data
lon = runif(100, -99.1, -99)
lat = runif(100, 33.9, 34)
## data is a 100 x 2 matrix (can also be data.frame)
mydata <- cbind(lon, lat)
radius <- 0.02 ## your radius
res <- nn2(mydata, k=nrow(mydata), searchtype="radius", radius = radius)
## prints total number of nearest neighbors (for all points) found using "radius"
print(length(which(res$nn.idx>0)))
##[1] 1224
res1 <- nn2(mydata, k=100, radius = radius)
## prints total number of nearest neighbors (for all points) found using your call
print(length(which(res1$nn.idx>0)))
##[1] 10000
radius <- 0.03 ## increase radius
res <- nn2(mydata, k=nrow(mydata), searchtype="radius", radius = radius)
## prints total number of nearest neighbors (for all points) found using "radius"
print(length(which(res$nn.idx>0)))
##[1] 2366
res1 <- nn2(mydata, k=100, radius = radius)
## prints total number of nearest neighbors (for all points) found using your call
print(length(which(res1$nn.idx>0)))
##[1] 10000
Note that according to its documentation ?nn2
, nn2
returns a list with two elements:
nn.idx
: anrow(query) x k
matrix where each row contains the row indices of thek
nearest neighbors inmydata
to the point at that row in the collection of query points inquery
. In both of our calls,query=mydata
. When called withsearchtype="radius"
, if there arem < k
neighbors within the given radius, thenk - m
of those index values will be set to0
. Since the set of query points is the same asmydata
, the index to the point itself will be included.nn.dist
: anrow(query) x k
matrix where each element contains the Euclidean distances for the corresponding nearest neighbor innn.idx
. Here, if the corresponding element innn.idx
is0
, then the value innn.dist
is set to 1.340781e+154.
With your call, you get 100
nearest neighbors for each point in mydata
, hence length(which(res1$nn.idx>0))==10000
no matter what the radius
is in the example.
Finally, you should note that because nn2
returns two nrow(mydata) x nrow(mydata)
in your case, it can very easily overwhelm your memory if you have a lot of points.
Updated to specifically produce the result of getting the count of neighbors within a given radius.
To compute the number of neighbors within a radius
of each point in the data, call nn2
as such
res <- nn2(mydata, k=nrow(mydata), searchtype="radius", radius = radius)
Then, do this:
count <- rowSums(res$nn.idx > 0) - 1
Notes:
- Since
query=mydata
andk=nrow(mydata)
the resultingres$nn.idx
will benrow(mydata) x nrow(mydata)
as explained above. - Each row
i
ofres$nn.idx
corresponds to rowi
inquery
, which is thei
-th point inquery=mydata
. Call thisi
-th pointp[i]
. - Each row
i
ofres$nn.idx
contains the row indices of the neighbors ofp[i]
AND zeroes to fill that row in the matrix (because not all points inmydata
will be within theradius
ofp[i]
). - Therefore, the number of neighbors of
p[i]
can be found by finding those values in the row that are greater than0
and then counting them. This is whatrowSums(res$nn.idx > 0)
does for each row ofres$nn.idx
. - Lastly, because
query=mydata
, the point being queried is in the count itself. That is, a point is the nearest neighbor to itself. Therefore subtract1
from these counts to exclude that.
The resulting count
will be a vector of the counts. The i
-th element is the number of neighbors within the radius
to the i
-th point in query=mydata
.
Hope this is clear.
Related Topics
Increase the API Limit in Ggmap's Geocode Function (In R)
R Scoping: Disallow Global Variables in Function
Connecting Points with Lines in Ggplot2 in R
Remove Fill Around Legend Key in Ggplot
R Shiny Checkboxgroupinput - Select All Checkboxes by Click
Cast Function Argument as a Character String
Adding Lists Names as Plot Titles in Lapply Call in R
Group by Columns and Summarize a Column into a List
Adding S4 Dispatch to Base R S3 Generic
How to Change Knitr Options Mid Chunk
How to Plot Logit and Probit in Ggplot2
Reading Excel File: How to Find the Start Cell in Messy Spreadsheets
In R, What Does "Loaded via a Namespace (And Not Attached)" Mean
R Table Function: How to Sum Instead of Counting
Ggplot2: Geom_Text Resize with the Plot and Force/Fit Text Within Geom_Bar