Using the Geosphere Distm Function on a Data.Table to Calculate Distances

Efficiently Calculate Distance using geosphere package

I have tried this.

dt[, distance := distHaversine(matrix(c(pickup_longitude, pickup_latitude), ncol = 2),
                        matrix(c(dropoff_longitude, dropoff_latitude), ncol = 2))]

This worked perfectly fine.

Use the geosphere::distm in a more efficient way?

If you want a smaller matrix ,consider splitting data with a grid based on longitude and/or latitude. For example, this produces two new columns with labels for a 5 x 5 grid.

#converting your example data to a tibble.
locations_data<-tibble::as_tibble(locations_data)
#create a numeric grid spanning the extent of your latitude and longitude
locations_data$long_quant<-findInterval(locations_data$longitude, quantile(locations_data$longitude,probs = seq(0,1,.2)), rightmost.closed=TRUE)
locations_data$lat_quant<-findInterval(locations_data$latitude, quantile(locations_data$latitude,probs = seq(0,1,.2)), rightmost.closed=TRUE)

You could then create multiple smaller matrices using a subset of locations_data.

Calculating distance with geosphere::distGeo() within dyplr::mutate()

I changed the code to

geo <- geo %>%
  mutate(dist_km = distGEO(
    p1 = cbind(long_pseudonym, lat_pseudonym),
    p2 = cbind(long_pseudonym2, lat_pseudonym2)) / 1000
  )

and it works very well.
For 240 million rows it took less than 8 minutes.
The trick is the cbind command before distGeo.

How to efficiently calculate distance between pair of coordinates using data.table :=

I wrote my own version of geosphere::distHaversine so that it would more naturally fit into a data.table := call, and it might be of use here

dt.haversine <- function(lat_from, lon_from, lat_to, lon_to, r = 6378137){
  radians <- pi/180
  lat_to <- lat_to * radians
  lat_from <- lat_from * radians
  lon_to <- lon_to * radians
  lon_from <- lon_from * radians
  dLat <- (lat_to - lat_from)
  dLon <- (lon_to - lon_from)
  a <- (sin(dLat/2)^2) + (cos(lat_from) * cos(lat_to)) * (sin(dLon/2)^2)
  return(2 * atan2(sqrt(a), sqrt(1 - a)) * r)
}

Update 18/07/2019

You can also write a C++ version through Rcpp.

#include <Rcpp.h>
using namespace Rcpp;

double inverseHaversine(double d){
  return 2 * atan2(sqrt(d), sqrt(1 - d)) * 6378137.0;
}

double distanceHaversine(double latf, double lonf, double latt, double lont,
                         double tolerance){
  double d;
  double dlat = latt - latf;
  double dlon =  lont - lonf;
  
  d = (sin(dlat * 0.5) * sin(dlat * 0.5)) + (cos(latf) * cos(latt)) * (sin(dlon * 0.5) * sin(dlon * 0.5));
  if(d > 1 && d <= tolerance){
    d = 1;
  }
  return inverseHaversine(d);
}

double toRadians(double deg){
  return deg * 0.01745329251;  // PI / 180;
}

// [[Rcpp::export]]
Rcpp::NumericVector rcpp_distance_haversine(Rcpp::NumericVector latFrom, Rcpp::NumericVector lonFrom, 
                        Rcpp::NumericVector latTo, Rcpp::NumericVector lonTo,
                        double tolerance) {
  
  int n = latFrom.size();
  NumericVector distance(n);
  
  double latf;
  double latt;
  double lonf;
  double lont;
  double dist = 0;
  
  for(int i = 0; i < n; i++){
    
    latf = toRadians(latFrom[i]);
    lonf = toRadians(lonFrom[i]);
    latt = toRadians(latTo[i]);
    lont = toRadians(lonTo[i]);
    dist = distanceHaversine(latf, lonf, latt, lont, tolerance);
    
    distance[i] = dist;
  }
  return distance;
}

Save this file somewhere and use Rcpp::sourceCpp("distance_calcs.cpp") to load the functions into your R session.

Here are some benchmarks on how they performs against the original geosphere::distHaversine, and geosphere::distGeo

I've made the objects 85k rows just so it's more meaningful

dt <- rbindlist(list(odmatrix, odmatrix, odmatrix, odmatrix, odmatrix, odmatrix))
dt <- rbindlist(list(dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt, dt))

dt1 <- copy(dt); dt2 <- copy(dt); dt3 <- copy(dt); dt4 <- copy(dt)

library(microbenchmark)

microbenchmark(
  
  rcpp = {
    dt4[, dist := rcpp_distance_haversine(lat_orig, long_orig, lat_dest, long_dest, tolerance = 10000000000.0)]
  },
  
  dtHaversine = {
    dt1[, dist := dt.haversine(lat_orig, long_orig, lat_dest, long_dest)]
  }   ,
  
  haversine = {
    dt2[ , dist := distHaversine(matrix(c(long_orig, lat_orig), ncol = 2), 
                                 matrix(c(long_dest, lat_dest), ncol = 2))]
  },
  
  geo = {
    dt3[ , dist := distGeo(matrix(c(long_orig, lat_orig), ncol = 2), 
                           matrix(c(long_dest, lat_dest), ncol = 2))]
  },
  times = 5
)

# Unit: milliseconds
#       expr       min        lq      mean    median        uq        max neval
#        rcpp  5.622847  5.683959  6.208954  5.925277  6.036025   7.776664     5
# dtHaversine  9.024500 12.413380 12.335681 12.992920 13.590566  13.657037     5
#   haversine 30.911136 33.628153 52.503700 36.038927 40.791089 121.149197     5
#         geo 83.646104 83.971163 88.694377 89.548176 90.569327  95.737117     5

Naturally, due to the way the distances are calculated in the two different techniques (geo & haversine), the results will differ slightly.

Geographical distance by group - Applying a function on each pair of rows

My initial idea was to look at the source code of distHaversine and replicate it in a function that I would use with proxy.
That would work like this (note that lon is expected to be the first column):

library(geosphere)
library(dplyr)
library(proxy)

df1 <- data.frame(province = as.integer(c(1, 1, 1, 2, 2, 2)),
                  house = as.integer(c(1, 2, 3, 4, 5, 6)),
                  lat = c(-76.6, -76.5, -76.4, -75.4, -80.9, -85.7), 
                  lon = c(39.2, 39.1, 39.3, 60.8, 53.3, 40.2))

custom_haversine <- function(x, y) {
  toRad <- pi / 180

  diff <- (y - x) * toRad
  dLon <- diff[1L]
  dLat <- diff[2L]

  a <- sin(dLat / 2) ^ 2 + cos(x[2L] * toRad) * cos(y[2L] * toRad) * sin(dLon / 2) ^ 2
  a <- min(a, 1)
  # return
  2 * atan2(sqrt(a), sqrt(1 - a)) * 6378137
}

pr_DB$set_entry(FUN=custom_haversine, names="haversine", loop=TRUE, distance=TRUE)

average_dist <- df1 %>%
  select(-house) %>%
  group_by(province) %>%
  group_map(~ data.frame(avg=mean(proxy::dist(.x[ , c("lon", "lat")], method="haversine"))))

However, if you're expecting millions of rows per province,
proxy probably won't be able to allocate the intermediate (lower triangular of the) matrices.
So I ported the code to C++ and added multi-threading as a bonus:

EDIT: turns out the s2d helper was far from optimal,
this version now uses the formulas given here.

EDIT2: I just found out about RcppThread,
and it can be used to detect user interrupt.

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppParallel,RcppThread)]]

#include <cstddef> // size_t
#include <math.h> // sin, cos, sqrt, atan2, pow
#include <vector>

#include <RcppThread.h>
#include <Rcpp.h>
#include <RcppParallel.h>

using namespace std;
using namespace Rcpp;
using namespace RcppParallel;

// single to double indices for lower triangular of matrices without diagonal
void s2d(const size_t id, const size_t nrow, size_t& i, size_t& j) {
  j = nrow - 2 - static_cast<size_t>(sqrt(-8 * id + 4 * nrow * (nrow - 1) - 7) / 2 - 0.5);
  i = id + j + 1 - nrow * (nrow - 1) / 2 + (nrow - j) * ((nrow - j) - 1) / 2;
}

class HaversineCalculator : public Worker
{
public:
  HaversineCalculator(const NumericVector& lon,
                      const NumericVector& lat,
                      double& avg,
                      const int n)
    : lon_(lon)
    , lat_(lat)
    , avg_(avg)
    , n_(n)
    , cos_lat_(lon.length())
  {
    // terms for distance calculation
    for (size_t i = 0; i < cos_lat_.size(); i++) {
      cos_lat_[i] = cos(lat_[i] * 3.1415926535897 / 180);
    }
  }

  void operator()(size_t begin, size_t end) {
    // for Kahan summation
    double sum = 0;
    double c = 0;

    double to_rad = 3.1415926535897 / 180;

    size_t i, j;
    for (size_t ind = begin; ind < end; ind++) {
      if (RcppThread::isInterrupted(ind % static_cast<int>(1e5) == 0)) return;

      s2d(ind, lon_.length(), i, j);

      // haversine distance
      double d_lon = (lon_[j] - lon_[i]) * to_rad;
      double d_lat = (lat_[j] - lat_[i]) * to_rad;
      double d_hav = pow(sin(d_lat / 2), 2) + cos_lat_[i] * cos_lat_[j] * pow(sin(d_lon / 2), 2);
      if (d_hav > 1) d_hav = 1;
      d_hav = 2 * atan2(sqrt(d_hav), sqrt(1 - d_hav)) * 6378137;

      // the average part
      d_hav /= n_;

      // Kahan sum step
      double y = d_hav - c;
      double t = sum + y;
      c = (t - sum) - y;
      sum = t;
    }

    mutex_.lock();
    avg_ += sum;
    mutex_.unlock();
  }

private:
  const RVector<double> lon_;
  const RVector<double> lat_;
  double& avg_;
  const int n_;
  tthread::mutex mutex_;
  vector<double> cos_lat_;
};

// [[Rcpp::export]]
double avg_haversine(const DataFrame& input, const int nthreads) {
  NumericVector lon = input["lon"];
  NumericVector lat = input["lat"];

  double avg = 0;
  int size = lon.length() * (lon.length() - 1) / 2;
  HaversineCalculator hc(lon, lat, avg, size);

  int grain = size / nthreads / 10;
  RcppParallel::parallelFor(0, size, hc, grain);
  RcppThread::checkUserInterrupt();

  return avg;
}

This code won't allocate any intermediate matrix,
it will simply calculate the distance for each pair of what would be the lower triangular and accumulate the values for an average in the end.
See here for the Kahan summation part.

If you save that code in, say, haversine.cpp,
then you can do the following:

library(dplyr)
library(Rcpp)
library(RcppParallel)
library(RcppThread)

sourceCpp("haversine.cpp")

df1 %>%
  group_by(province) %>%
  group_map(~ data.frame(avg=avg_haversine(.x, parallel::detectCores())))
# A tibble: 2 x 2
# Groups:   province [2]
  province     avg
     <int>   <dbl>
1        1  15379.
2        2 793612.

Here's a sanity check too:

pr_DB$set_entry(FUN=geosphere::distHaversine, names="distHaversine", loop=TRUE, distance=TRUE)

df1 %>%
  select(-house) %>%
  group_by(province) %>%
  group_map(~ data.frame(avg=mean(proxy::dist(.x[ , c("lon", "lat")], method="distHaversine"))))

A word of caution though:

df <- data.frame(lon=runif(1e3, -90, 90), lat=runif(1e3, -90, 90))

system.time(proxy::dist(df, method="distHaversine"))
   user  system elapsed 
 34.353   0.005  34.394

system.time(proxy::dist(df, method="haversine"))
   user  system elapsed 
  0.789   0.020   0.809

system.time(avg_haversine(df, 4L))
   user  system elapsed 
  0.054   0.000   0.014

df <- data.frame(lon=runif(1e5, -90, 90), lat=runif(1e5, -90, 90))

system.time(avg_haversine(df, 4L))
   user  system elapsed 
 73.861   0.238  19.670

You'll probably have to wait quite a while if you have millions of rows...

I should also mention that it's not possible to detect user interrupt inside the threads created through RcppParallel,
so if you start the calculation you should either wait until it finishes,
or restart R/RStudio entirely.
See EDIT2 above.

Regarding complexity

Depending on your actual data and how many cores your computer has,
you may very well end up waiting days for the calculation to finish.
This problem has quadratic complexity
(per province, so to speak).
This line:

int size = lon.length() * (lon.length() - 1) / 2;

signifies the amount of (haversine) distance calculations that must be performed.
So if the number of rows increases by a factor of n,
the number of calculations increases by a factor of n^2 / 2, roughly speaking.

There is no way to optimize this;
you can't calculate the average of N numbers without actually computing each number first,
and you'll have a hard time finding something faster than multi-threaded C++ code,
so you'll either have to wait it out,
or throw more cores at the problem,
either with a single machine or with many machines working together.
Otherwise you can't solve this problem.