How to efficiently calculate distance between pair of coordinates using data.table :=
I wrote my own version of geosphere::distHaversine
so that it would more naturally fit into a data.table
:=
call, and it might be of use here
dt.haversine <- function(lat_from, lon_from, lat_to, lon_to, r = 6378137){
radians <- pi/180
lat_to <- lat_to * radians
lat_from <- lat_from * radians
lon_to <- lon_to * radians
lon_from <- lon_from * radians
dLat <- (lat_to - lat_from)
dLon <- (lon_to - lon_from)
a <- (sin(dLat/2)^2) + (cos(lat_from) * cos(lat_to)) * (sin(dLon/2)^2)
return(2 * atan2(sqrt(a), sqrt(1 - a)) * r)
}
Update 18/07/2019
You can also write a C++ version through Rcpp.
#include <Rcpp.h>
using namespace Rcpp;
double inverseHaversine(double d){
return 2 * atan2(sqrt(d), sqrt(1 - d)) * 6378137.0;
}
double distanceHaversine(double latf, double lonf, double latt, double lont,
double tolerance){
double d;
double dlat = latt - latf;
double dlon = lont - lonf;
d = (sin(dlat * 0.5) * sin(dlat * 0.5)) + (cos(latf) * cos(latt)) * (sin(dlon * 0.5) * sin(dlon * 0.5));
if(d > 1 && d <= tolerance){
d = 1;
}
return inverseHaversine(d);
}
double toRadians(double deg){
return deg * 0.01745329251; // PI / 180;
}
// [[Rcpp::export]]
Rcpp::NumericVector rcpp_distance_haversine(Rcpp::NumericVector latFrom, Rcpp::NumericVector lonFrom,
Rcpp::NumericVector latTo, Rcpp::NumericVector lonTo,
double tolerance) {
int n = latFrom.size();
NumericVector distance(n);
double latf;
double latt;
double lonf;
double lont;
double dist = 0;
for(int i = 0; i < n; i++){
latf = toRadians(latFrom[i]);
lonf = toRadians(lonFrom[i]);
latt = toRadians(latTo[i]);
lont = toRadians(lonTo[i]);
dist = distanceHaversine(latf, lonf, latt, lont, tolerance);
distance[i] = dist;
}
return distance;
}
Save this file somewhere and use Rcpp::sourceCpp("distance_calcs.cpp")
to load the functions into your R session.
Here are some benchmarks on how they performs against the original geosphere::distHaversine
, and geosphere::distGeo
I've made the objects 85k rows just so it's more meaningful
dt <- rbindlist(list(odmatrix, odmatrix, odmatrix, odmatrix, odmatrix, odmatrix))
dt <- rbindlist(list(dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt))
dt1 <- copy(dt); dt2 <- copy(dt); dt3 <- copy(dt); dt4 <- copy(dt)
library(microbenchmark)
microbenchmark(
rcpp = {
dt4[, dist := rcpp_distance_haversine(lat_orig, long_orig, lat_dest, long_dest, tolerance = 10000000000.0)]
},
dtHaversine = {
dt1[, dist := dt.haversine(lat_orig, long_orig, lat_dest, long_dest)]
} ,
haversine = {
dt2[ , dist := distHaversine(matrix(c(long_orig, lat_orig), ncol = 2),
matrix(c(long_dest, lat_dest), ncol = 2))]
},
geo = {
dt3[ , dist := distGeo(matrix(c(long_orig, lat_orig), ncol = 2),
matrix(c(long_dest, lat_dest), ncol = 2))]
},
times = 5
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rcpp 5.622847 5.683959 6.208954 5.925277 6.036025 7.776664 5
# dtHaversine 9.024500 12.413380 12.335681 12.992920 13.590566 13.657037 5
# haversine 30.911136 33.628153 52.503700 36.038927 40.791089 121.149197 5
# geo 83.646104 83.971163 88.694377 89.548176 90.569327 95.737117 5
Naturally, due to the way the distances are calculated in the two different techniques (geo & haversine), the results will differ slightly.
How to efficiently calculate distance between GPS points in one dataset and GPS points in another data set using data.table
Consider cutting using an slicing method: first cut by close latitudes and close longitudes. In this case 0.5 latitude and 0.5 longitude (which is still about a 60 km disc). We can use data.table
's superb support of rolling joins.
The following takes a few milliseconds for 20,000 entries and only a few seconds for 2M entries.
library(data.table)
library(hutils)
setDT(gpsdata)
setDT(busdata.data)
gps_orig <- copy(gpsdata)
busdata.orig <- copy(busdata.data)
setkey(gpsdata, lat)
# Just to take note of the originals
gpsdata[, gps_lat := lat + 0]
gpsdata[, gps_lon := lon + 0]
busdata.data[, lat := latitude_bustops + 0]
busdata.data[, lon := longitude_bustops + 0]
setkey(busdata.data, lat)
gpsID_by_lat <-
gpsdata[, .(id), keyby = "lat"]
By_latitude <-
busdata.data[gpsdata,
on = "lat",
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L] %>%
.[, .(id_lat = id,
name_lat = name,
bus_lat = latitude_bustops,
bus_lon = longitude_bustops,
gps_lat,
gps_lon),
keyby = .(lon = gps_lon)]
setkey(busdata.data, lon)
By_latlon <-
busdata.data[By_latitude,
on = c("name==name_lat", "lon"),
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L]
By_latlon[, distance := haversine_distance(lat1 = gps_lat,
lon1 = gps_lon,
lat2 = bus_lat,
lon2 = bus_lon)]
By_latlon[distance < 0.2]
calculate distance between each pair of coordinates in wide dataframe
The problem you're having is thatapply(...)
coerces the first argument to a matrix. By definition, a matrix must have all elements of the same data type. Since one of the columns in dat
(dat$subcounty
) is char, apply(...)
coerces everything to char. In your test dataset, everything was numeric, so you didn't have this problem.
This should work:
dat$dist.km <- sapply(1:nrow(dat),function(i)
spDistsN1(as.matrix(dat[i,3:4]),as.matrix(dat[i,5:6]),longlat=T))
Calculating the distance between two long/lat points in the same data.frame
This is a easily solved with the distGeo
function (similar to your functions above) from geosphere package:
library(geosphere)
#calculate distances in meters
df$distance<-distGeo(df[,c("lon1", "lat1")], df[,c("lon2", "lat2")])
#remove columns
df[, -c(3:6)]
customer_id id distance
1 353808874 8474 498.2442
2 69516747 8107 668.4088
3 357032052 1617436 366.9541
4 307735090 7698 531.0785
5 307767260 1617491 343.3051
Efficiently Calculate Distance using geosphere package
I have tried this.
dt[, distance := distHaversine(matrix(c(pickup_longitude, pickup_latitude), ncol = 2),
matrix(c(dropoff_longitude, dropoff_latitude), ncol = 2))]
This worked perfectly fine.
How to calculate distance between 2 coordinates below a certain threshold in R?
Generating the whole distance matrix at a time will be very RAM consuming, looping over each combination of unique zipcodes - very time consuming. Lets find some compromise.
I suggest chunking the zipcode
data.frame
into pieces of (for example) 100 rows (with the help of chunk
function from package bit
), then calculating distances between 44336 and 100 points, filtering according to the target distance treshold and then moving on to the next data chunk. In my example I convert zipcode
data into data.table
to gain some speed and save RAM.
library(zipcode)
library(data.table)
library(magrittr)
library(geosphere)
data(zipcode)
setDT(zipcode)
zipcode[, dum := NA] # we'll need it for full outer join
Just for information - that's the approximate size of each piece of data in RAM.
merge(zipcode, zipcode[1:100], by = "dum", allow.cartesian = T) %>%
object.size() %>% print(unit = "Mb")
# 358.2 Mb
The code itself.
lapply(bit::chunk(1, nrow(zipcode), 1e2), function(ridx) {
merge(zipcode, zipcode[ridx[1]:ridx[2]], by = "dum", allow.cartesian = T)[
, dist := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2),
matrix(c(longitude.y, latitude.y), ncol = 2))/1609.34 # meters to miles
][dist <= 5 # necessary distance treshold
][, dum := NULL]
}) %>% rbindlist -> zip_nearby_dt
zip_nearby_dt # not the whole! for first 10 chunks only
zip.x city.x state.x latitude.x longitude.x zip.y city.y state.y latitude.y longitude.y dist
1: 00210 Portsmouth NH 43.00590 -71.01320 00210 Portsmouth NH 43.00590 -71.01320 0.000000
2: 00210 Portsmouth NH 43.00590 -71.01320 00211 Portsmouth NH 43.00590 -71.01320 0.000000
3: 00210 Portsmouth NH 43.00590 -71.01320 00212 Portsmouth NH 43.00590 -71.01320 0.000000
4: 00210 Portsmouth NH 43.00590 -71.01320 00213 Portsmouth NH 43.00590 -71.01320 0.000000
5: 00210 Portsmouth NH 43.00590 -71.01320 00214 Portsmouth NH 43.00590 -71.01320 0.000000
---
15252: 02906 Providence RI 41.83635 -71.39427 02771 Seekonk MA 41.84345 -71.32343 3.688747
15253: 02912 Providence RI 41.82674 -71.39770 02771 Seekonk MA 41.84345 -71.32343 4.003095
15254: 02914 East Providence RI 41.81240 -71.36834 02771 Seekonk MA 41.84345 -71.32343 3.156966
15255: 02916 Rumford RI 41.84325 -71.35391 02769 Rehoboth MA 41.83507 -71.26115 4.820599
15256: 02916 Rumford RI 41.84325 -71.35391 02771 Seekonk MA 41.84345 -71.32343 1.573050
On my machine it took 1.7 minutes to process 10 chunks, so the whole processing may take 70-80 minutes, not fast, but may be satisfying. We can increase the chunk size to 200 or 300 rows depending on available RAM volume, this will shorten the processing time 2 or 3 times respectively.
The drawback of this solution is that the resulting data.table
contains "duplicated" rows - I mean there are both distances from point A to point B, and from B to A. This may need some additional filtering.
Related Topics
How to Subtract Months from a Date in R
How to Convert Dataframe into Time Series
Remove Parentheses and Text Within from Strings in R
Sample Random Rows Within Each Group in a Data.Table
Adding Minor Tick Marks to the X Axis in Ggplot2 (With No Labels)
How to Put a Transformed Scale on the Right Side of a Ggplot2
Return Elements of List as Independent Objects in Global Environment
Do.Call(Rbind, List) For Uneven Number of Column
How to Move Cells With a Value Row-Wise to the Left in a Dataframe
How to Omit Na Values While Pasting Numerous Column Values Together
Plot Correlation Matrix into a Graph
Select Multiple Columns in Data.Table by Their Numeric Indices
Error in Plot.New(): Figure Margins Too Large in R
R: Use Magrittr Pipe Operator in Self Written Package
How to Print When Using %Dopar%