Identify Points Within Specified Distance in R

Identify points within specified distance in R

Two approaches.

The first creates a distance matrix using earth.dist(...) in the fossil package, and then takes advantage of data.tables to assemble the table of results.

The second uses distHaversine(...) in the geosphere package to calculate distances and assemble the final colocation table in one step. The latter approach may or may not be faster, but will certainly be more memory efficient, as it never stores the full distance matrix. Also, this approach is amenable to using other distance measures in geosphere, e.g., distVincentySphere(...), distVincentyEllipsoid(...), or distMeeus(...).

Note that the actual distances are slightly different, probably because earth.dist(...) and distHaversine(...) use slightly different estimates for the radius of the earth. Also, note that both approaches here rely on station numbers for IDs. If the stations have names, the code will have to be modified slightly.

First Approach: Using earth.dist(...)

df = read.table(header=T,text="long lat
1 -74.20139 39.82806
2 -74.20194 39.82806
3 -74.20167 39.82806
4 -74.20197 39.82824
5 -74.20150 39.82814
6 -74.26472 39.66639
7 -74.17389 39.87111
8 -74.07224 39.97353
9 -74.07978 39.94554") # your sample data
library(fossil) # for earth.dist(...)
library(data.table)
sep.ft <- 200 # critical separation (feet)
sep.km <- sep.ft*0.0003048 # critical separation (km)
m <- as.matrix(earth.dist(df)) # distance matrix in km
coloc <- data.table(which(m<sep.km, arr.ind=T)) # pairs of stations with dist<200 ft
setnames(coloc,c("row","col"),c("ST.1","ST.2")) # rename columns to reflect station IDs
coloc <- coloc[ST.1<ST.2,] # want only lower triagular part
coloc[,dist:=m[ST.1,ST.2]/0.0003048,by="ST.1,ST.2"] # append distances in feet
remove(m) # don't need distance matrix anymore...
stations <- data.table(id=as.integer(rownames(df)),df)
setkey(stations,id)
setkey(coloc,ST.1)
coloc[stations,c("long.1","lat.1"):=list(long,lat),nomatch=0]
setkey(coloc,ST.2)
coloc[stations,c("long.2","lat.2"):=list(long,lat),nomatch=0]

Produces this:

coloc
# ST.1 ST.2 dist long.1 lat.1 long.2 lat.2
# 1: 1 2 154.13436 -74.20139 39.82806 -74.20194 39.82806
# 2: 1 3 78.46840 -74.20139 39.82806 -74.20167 39.82806
# 3: 2 3 75.66596 -74.20194 39.82806 -74.20167 39.82806
# 4: 1 4 175.31180 -74.20139 39.82806 -74.20197 39.82824
# 5: 2 4 66.22069 -74.20194 39.82806 -74.20197 39.82824
# 6: 3 4 106.69018 -74.20167 39.82806 -74.20197 39.82824
# 7: 1 5 42.45634 -74.20139 39.82806 -74.20150 39.82814
# 8: 2 5 126.71608 -74.20194 39.82806 -74.20150 39.82814
# 9: 3 5 55.87449 -74.20167 39.82806 -74.20150 39.82814
# 10: 4 5 136.67612 -74.20197 39.82824 -74.20150 39.82814

Second Approach: Using distHaversine(...)

library(data.table)
library(geosphere)
sep.ft <- 200 # critical separation (feet)
stations <- data.table(id=as.integer(rownames(df)),df)

d <- function(x){ # distance between station[i] and all subsequent stations
r.ft <- 6378137*3.28084 # radius of the earth, in feet
if (x[1]==nrow(stations)) return() # don't process last row
ref <- stations[(x[1]+1):nrow(stations),]
z <- distHaversine(ref[,2:3,with=F],x[2:3], r=r.ft)
z <- data.table(ST.1=x[1], ST.2=ref$id, dist=z, long.1=x[2], lat.1=x[3], long.2=ref$long, lat.2=ref$lat)
return(z[z$dist<sep.ft,])
}
coloc.2 = do.call(rbind,apply(stations,1,d))

Produces this:

coloc.2
# ST.1 ST.2 dist long.1 lat.1 long.2 lat.2
# 1: 1 2 154.26350 -74.20139 39.82806 -74.20194 39.82806
# 2: 1 3 78.53414 -74.20139 39.82806 -74.20167 39.82806
# 3: 1 4 175.45868 -74.20139 39.82806 -74.20197 39.82824
# 4: 1 5 42.49191 -74.20139 39.82806 -74.20150 39.82814
# 5: 2 3 75.72935 -74.20194 39.82806 -74.20167 39.82806
# 6: 2 4 66.27617 -74.20194 39.82806 -74.20197 39.82824
# 7: 2 5 126.82225 -74.20194 39.82806 -74.20150 39.82814
# 8: 3 4 106.77957 -74.20167 39.82806 -74.20197 39.82824
# 9: 3 5 55.92131 -74.20167 39.82806 -74.20150 39.82814
# 10: 4 5 136.79063 -74.20197 39.82824 -74.20150 39.82814

Identify points from a second data frame within specified distance in R

You can use a for-loop to loop through each site, adding a column to the finds dataframe with true/false (or true/NA as I've used here) for each find.

I've used a cut-off of +/- 10000 in the co-ordinates, and changed the co-ordinates in your example of the finds table so each is in range of one site so you can see how it works.

library(tidyverse)

Data:

sites <- tibble(SiteID = c("HtsOJM-IC3", "HtsCBT-BI1"), 
Easting = c(464870, 438430),
Northing = c(106560, 139000))

finds <- tibble(id = c(1005083, 1005080),
objecttype = c("BROOCH", "BROOCH"),
easting = c(470870, 433430),
northing = c(103560, 142000))

Set cut_off to the max difference between co-ordinates you want to be marked as true

cut_off <- 10000

For each row in sites, a new column is added to finds with a value of True if both the easting and northing co-ordinates are in range of the site ± the value of cut_off.

for (i in 1:nrow(sites)) {
site_ID <- sites[[i,"SiteID"]]
site_easting <- sites[[i, "Easting"]]
site_northing <- sites[[i, "Northing"]]

finds <- finds %>%
mutate(!! site_ID := if_else((between(easting, site_easting - cut_off, site_easting + cut_off) &
between(northing, site_northing - cut_off, site_northing + cut_off)),
T, NA))
}

Output:

# A tibble: 2 x 6
id objecttype easting northing `HtsOJM-IC3` `HtsCBT-BI1`
<dbl> <chr> <dbl> <dbl> <lgl> <lgl>
1 1005083 BROOCH 470870 103560 TRUE NA
2 1005080 BROOCH 433430 142000 NA TRUE

If you want to convert the separate sites columns into a single column, use pivot_wider():

finds %>% 
pivot_longer(cols = starts_with("Hts"), names_to = "SiteID", values_drop_na = T) %>%
select(-value)

Output, note that if any find is within range of more than one site they'll now have multiple rows, 1 per site:

# A tibble: 2 x 5
id objecttype easting northing SiteID
<dbl> <chr> <dbl> <dbl> <chr>
1 1005083 BROOCH 470870 103560 HtsOJM-IC3
2 1005080 BROOCH 433430 142000 HtsCBT-BI1

Find which points are within a given distance of each point

You can also use sp::spDists to return a distance matrix, and then find the elements of each column/row that meet your condition.

For example:

d <- read.table(text='Place   X_UTM        Y_UTM
1 574261.98 6140492.13
2 571251.23 6141669.26
3 570841.92 6142534.86
4 570233.75 6141212.5
5 578269.25 6140303.78
6 575067.07 6137444.36', header=TRUE)
library(sp) 
i <- apply(spDists(as.matrix(d[, c('X_UTM', 'Y_UTM')])), 2,
function(x) paste(which(x < 1000 & x != 0), collapse=', '))

data.frame(Place=d$Place, Closest=i)

## Place Closest
## 1 1
## 2 2 3
## 3 3 2
## 4 4
## 5 5
## 6 6

Find the Number of Points Within a Certain Radius of a Data Point in R

When you ask a question, you should always include some example data. Like this

lat <- c(-23.8, -25.8)
lon <- c(-49.6, -44.6)
hosp <- cbind(lon, lat)

lat <- c(-22.8, -24.8, -29.1, -28, -20)
lon <- c(-46.4, -46.3, -45.3, -40, -30)
procedures <- cbind(lon, lat)

Are your data in longitude/latitude? If so, you need to use a proper method to compute distances. For example

 library(geosphere)
dm <- distm(procedures, hosp)

Or

 library(raster)
d <- pointDistance(procedures, hosp, lonlat=TRUE)

Both compute the distance from all procedures to all hospitals. This will fail with very large datasets, but from what you describe it should work fine.
Now you can use a threshold (here 400,000 m) to find out which procedures are within that distance of each hospital

apply(d < 400000, 2, which)
#[[1]]
#[1] 1 2

#[[2]]
#[1] 1 2 3

So procedure 1, 2 and 3 are within that distance to hospital 2

If your data are not longitude/latitude, you can use

 d <- pointDistance(procedures, hosp, lonlat=FALSE)

Points within a distance from a point

Straight line (Euclidean) distances between a certain set of points vary across different projections.

As discussed here, 3857 is not a great choice for distance calculations. If we use a projection built for Northern France, where the points are located (perhaps EPSG:27561 - NTF (Paris) / Lambert Nord France), we get the expected results.

sdf <- st_as_sf(df, coords = c("x", "y"), 
crs = 4326, agr = "constant")
sdf_proj <- st_transform(sdf, 3857)
sdf_France_proj <- st_transform(sdf, 27561)

Great circle distances

> st_distance(pnt, sdf, by_element = T) %>% sort()
Units: [m]
[1] 0.00000 0.00000 61.50611 101.64007 143.64481 189.43485 253.46142 267.67954 411.50933 478.11306

Mercator (3857) Euclidean distances

> st_distance(pnt_proj, sdf_proj, by_element = T) %>% sort()
Units: [m]
[1] 0.00000 0.00000 93.43522 154.41781 218.22699 287.78463 385.04306 406.64205 625.14635 726.33083

France projection (27561) Euclidean distances

> st_distance(pnt_France_proj, sdf_France_proj, by_element = T) %>% sort()
Units: [m]
[1] 0.00000 0.00000 61.50289 101.63477 143.63731 189.42497 253.44822 267.66559 411.48772 478.08794

And here are the plots with buffers:

library(gridExtra)

buffer_proj <- st_buffer(pnt_proj, 200)
buffer_France_proj <- st_buffer(pnt_France_proj, 200)

proj <- ggplot() + g
eom_sf(data = buffer_proj, fill = "blue", alpha = 0.5) +
geom_sf(data = sdf_proj, color = "red") +
ggtitle("Mercator (3857)")

France_proj <- ggplot() +
geom_sf(data = buffer_France_proj, fill = "blue", alpha = 0.5) +
geom_sf(data = sdf_proj, color = "red") +
ggtitle("France proj (27561)")

grid.arrange(proj, France_proj, ncol = 1)

Sample Image

Find coordinates within radius around many starting points in grid

Example data --- you were using a lon/lat example, but based on your code, I am assuming that you are using planar data.

library(raster)   
r <- raster(nrows=100, ncols=100, xmn=0, xmx=100, ymn=0, ymx=100, crs="+proj=utm +zone=1 +datum=WGS84")
values(r) <- 1:ncell(r) # for display only
xygrid <- as.data.frame(r, xy = TRUE)[,1:2]
locs <- c(8025, 1550, 5075)

dn <- 2.5 # min dist
dx <- 5.5 # max dist

The simplest approach would be to use pointDistance

p <- xyFromCell(r, locs)
d <- pointDistance(xygrid, p, lonlat=FALSE)
u <- unique(which(d>dn & d<dx) %% nrow(d))
pts <- xygrid[u,]

plot(r)
points(pts)

But you will probably run out of memory with that, and it is inefficient to compute all distance. Instead, you may intersect the points with a buffer around the points of interest

b1 <- buffer(SpatialPoints(p, proj4string=crs(r)), dx)
b2 <- buffer(SpatialPoints(p, proj4string=crs(r)), dn)
b <- erase(b1, b2)
x <- intersect(SpatialPoints(xygrid, proj4string=crs(r)), b)

plot(r)
points(x, cex=.5)
points(xyFromCell(r, locs), col="red", pch="x")

With terra it goes like this -- and works well for large datasets in version 1.1-11 that should be on CRAN this week

library(terra)
rr <- rast(r)
pp <- xyFromCell(rr, locs)
bb1 <- buffer(vect(pp), dx)
bb2 <- buffer(vect(pp), dn)
bb <- erase(bb1, bb2)
xx <- intersect(vect(as.matrix(xygrid)), bb)

You can do similar things with sf.

Given that you have so many data points, you might want to start with removing all points that are clearly not of interest

xySel <- lapply(locs, function(i) {
xy <- xygrid[i,]
s <- xygrid[,1] > xy[,1]-dx & xygrid[,1] < xy[,1]+dx & xygrid[,2] > xy[,2]-dx & xygrid[,2] < xy[,2]+dx
xygrid[s,]
})

xySel = do.call(rbind, xySel)
dim(xySel)
# [1] 363 2
dim(xygrid)
#[1] 10000 2

And now you could run pointDistance as above on all data (or else inside the lapply function)

You say that you need to use points, and not a raster. I have seen that idea many times, and 9 out of 10 times that is wrong. Maybe it is true in your case. For others who stumble upon this question, here are are two raster based approaches.

With the raster package you could use extract( ... ,cellnumbers=TRUE) or ajacent. With adjacent, you would first make a weights matrix using one of the buffers made above

buf <- disaggregate(b)[2,]
rb <- crop(r, buf)
w <- as.matrix(rasterize(buf, rb, background=NA) )
w[6,6]=0

And then use the weight matrix like this

a <- adjacent(r, locs, w, pairs=FALSE)

pts <- xyFromCell(r, a)
plot(r)
points(pts)

With terra you could use the cells method

d <- cells(rr, bb)
xy <- xyFromCell(rr, d[,2])
plot(rr)
points(xy, cex=.5)
lines(bb, col="red", lwd=2)

How to calculate the distance between points and a fixed location in R Studio

c(df$longitude,df$latitude) is just one long vector.

In this case you need to pass the columns of the data frame

library(geosphere)    
distGeo((citycenter, df[, c("longitude", "latitude")])

Now the dist() function will compute the distance between the city center location with every row in the data frame.

Assigning points to locations based on distance from coordinate in R

Try breaking down your data into smaller chunks so you don't overload your memory. A reproducible example would be helpful, but I think this gets the job done:

library(sp)
# X1 is a small example and X2 is a large example
X1 <- cbind(pointX = 1:109, pointY = 1:109)
Y1 <- cbind(x = 11:20, y = 11:20)

X2 <- cbind(pointX = 2e4 + sample(2e6), pointY = 2e4 + sample(2e6))
Y2 <- cbind(x = sample(2e4), y = sample(2e4))

nearWrapper = function(X, Y, nBatches = 10){
maxNumber = dim(X)[1]
batchNumbers <- split(0:maxNumber, ceiling(seq_along(0:maxNumber)/nBatches))
out <- numeric(maxNumber)
for(batch in 1:(nBatches+1)){
out[batchNumbers[[batch]]] <- apply(spDists(X[batchNumbers[[batch]],], Y), 1, which.min)
}
return(out)
}
smallOut <- nearWrapper(X1, Y1)
largeOut <- nearWrapper(X2, Y2)

If this takes too long with your data, you might also check out parallel computing (using a foreach loop in place of the for loop in your case).

How to pull points that are within a certain distance away in R?

The euclidean distance of each point on the grid from the target point p can be efficiently computed with:

dist <- sqrt(rowSums(mapply(function(x,y) (x-y)^2, grid, p)))

Basically the inner mapply call will result in a matrix of the same size as grid but that has the squared distance of that point from the target point in that dimension; rowSums and sqrt efficiently then compute the euclidean distance.

In this case you are including anything with sqrt(2) Euclidean distance from the target point:

grid[dist < 1.5,]
# Var1 Var2
# 16 1 4
# 17 2 4
# 18 3 4
# 21 1 5
# 22 2 5
# 23 3 5
# 26 1 6
# 27 2 6
# 28 3 6

The use of mapply (operating over dimensions) and rowSums makes this much more efficient than an approach that loops through individual points on the grid, computing the distance to the target point. To see this, consider a slightly larger example with 1000 randomly distributed points in three dimensions:

set.seed(144)
grid <- data.frame(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))
p <- data.frame(x=rnorm(1), y=rnorm(1), z=rnorm(1))
lim <- 1.5
byrow <- function(grid, p, lim) grid[apply(grid, 1, function(x) sqrt(sum((x-p)^2))) < lim,]
vectorized <- function(grid, p, lim) grid[sqrt(rowSums(mapply(function(x,y) (x-y)^2, grid, p))) < lim,]
identical(byrow(grid, p, lim), vectorized(grid, p, lim))
[1] TRUE
library(microbenchmark)
# Unit: microseconds
# expr min lq mean median uq max neval
# byrow(grid, p, lim) 446792.71 473428.137 500680.0431 495824.7765 521185.093 579999.745 10
# vectorized(grid, p, lim) 855.33 881.981 954.1773 907.3805 1081.658 1108.679 10

The vectorized approach is 500 times faster than the approach that loops through the rows.

This approach can be used in cases where you have many more points (1 million in this example):

set.seed(144)
grid <- data.frame(x=rnorm(1000000), y=rnorm(1000000), z=rnorm(1000000))
p <- data.frame(x=rnorm(1), y=rnorm(1), z=rnorm(1))
lim <- 1.5
system.time(vectorized(grid, p, lim))
# user system elapsed
# 3.466 0.136 3.632

find number of points within a radius in R using lon and lat coordinates

To get what you want from nn2, you need to:

  1. Specify the searchtype to be "radius". By not specifying the searchtype, the searchtype is set to standard and the radius input is ignored.
  2. Specify k to be nrow(mydata) because k is the maximum number of nearest neighbors to compute.

The manner in which you are calling nn2 will return the 100 nearest neighbors in mydata for each point in mydata. Consequently, you will get the same result for any radius. For example:

library(RANN)
set.seed(123)

## simulate some data
lon = runif(100, -99.1, -99)
lat = runif(100, 33.9, 34)

## data is a 100 x 2 matrix (can also be data.frame)
mydata <- cbind(lon, lat)

radius <- 0.02 ## your radius
res <- nn2(mydata, k=nrow(mydata), searchtype="radius", radius = radius)
## prints total number of nearest neighbors (for all points) found using "radius"
print(length(which(res$nn.idx>0)))
##[1] 1224
res1 <- nn2(mydata, k=100, radius = radius)
## prints total number of nearest neighbors (for all points) found using your call
print(length(which(res1$nn.idx>0)))
##[1] 10000

radius <- 0.03 ## increase radius
res <- nn2(mydata, k=nrow(mydata), searchtype="radius", radius = radius)
## prints total number of nearest neighbors (for all points) found using "radius"
print(length(which(res$nn.idx>0)))
##[1] 2366
res1 <- nn2(mydata, k=100, radius = radius)
## prints total number of nearest neighbors (for all points) found using your call
print(length(which(res1$nn.idx>0)))
##[1] 10000

Note that according to its documentation ?nn2, nn2 returns a list with two elements:

  1. nn.idx: a nrow(query) x k matrix where each row contains the row indices of the k nearest neighbors in mydata to the point at that row in the collection of query points in query. In both of our calls, query=mydata. When called with searchtype="radius", if there are m < k neighbors within the given radius, then k - m of those index values will be set to 0. Since the set of query points is the same as mydata, the index to the point itself will be included.

  2. nn.dist: a nrow(query) x k matrix where each element contains the Euclidean distances for the corresponding nearest neighbor in nn.idx. Here, if the corresponding element in nn.idx is 0, then the value in nn.dist is set to 1.340781e+154.

With your call, you get 100 nearest neighbors for each point in mydata, hence length(which(res1$nn.idx>0))==10000 no matter what the radius is in the example.

Finally, you should note that because nn2 returns two nrow(mydata) x nrow(mydata) in your case, it can very easily overwhelm your memory if you have a lot of points.


Updated to specifically produce the result of getting the count of neighbors within a given radius.

To compute the number of neighbors within a radius of each point in the data, call nn2 as such

res <- nn2(mydata, k=nrow(mydata), searchtype="radius", radius = radius)

Then, do this:

count <- rowSums(res$nn.idx > 0) - 1

Notes:

  1. Since query=mydata and k=nrow(mydata) the resulting res$nn.idx will be nrow(mydata) x nrow(mydata) as explained above.
  2. Each row i of res$nn.idx corresponds to row i in query, which is the i-th point in query=mydata. Call this i-th point p[i].
  3. Each row i of res$nn.idx contains the row indices of the neighbors of p[i] AND zeroes to fill that row in the matrix (because not all points in mydata will be within the radius of p[i]).
  4. Therefore, the number of neighbors of p[i] can be found by finding those values in the row that are greater than 0 and then counting them. This is what rowSums(res$nn.idx > 0) does for each row of res$nn.idx.
  5. Lastly, because query=mydata, the point being queried is in the count itself. That is, a point is the nearest neighbor to itself. Therefore subtract 1 from these counts to exclude that.

The resulting count will be a vector of the counts. The i-th element is the number of neighbors within the radius to the i-th point in query=mydata.

Hope this is clear.



Related Topics



Leave a reply



Submit