Calculate Euclidean distance matrix using a big.matrix object
Here is a way using RcppArmadillo
. Much of this is very similar to the RcppGallery example. This will return a big.matrix
with the associated pairwise (by row) euclidean distances. I like to wrap my big.matrix
functions in a wrapper function to create a cleaner syntax (i.e. avoid the @address
and other initializations.
Note - as we are using bigmemory (and therefore concerned with RAM usage) I have this example returned the N-1 x N-1 matrix of only lower triangular elements. You could modify this but this is what I threw together.
euc_dist.cpp
// To enable the functionality provided by Armadillo's various macros,
// simply include them before you include the RcppArmadillo headers.
#define ARMA_NO_DEBUG
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo, BH, bigmemory)]]
using namespace Rcpp;
using namespace arma;
// The following header file provides the definitions for the BigMatrix
// object
#include <bigmemory/BigMatrix.h>
// C++11 plugin
// [[Rcpp::plugins(cpp11)]]
template <typename T>
void BigArmaEuclidean(const Mat<T>& inBigMat, Mat<T> outBigMat) {
int W = inBigMat.n_rows;
for(int i = 0; i < W - 1; i++){
for(int j=i+1; j < W; j++){
outBigMat(j-1,i) = sqrt(sum(pow((inBigMat.row(i) - inBigMat.row(j)),2)));
}
}
}
// [[Rcpp::export]]
void BigArmaEuc(SEXP pInBigMat, SEXP pOutBigMat) {
// First we tell Rcpp that the object we've been given is an external
// pointer.
XPtr<BigMatrix> xpMat(pInBigMat);
XPtr<BigMatrix> xpOutMat(pOutBigMat);
int type = xpMat->matrix_type();
switch(type) {
case 1:
BigArmaEuclidean(
arma::Mat<char>((char *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<char>((char *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 2:
BigArmaEuclidean(
arma::Mat<short>((short *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<short>((short *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 4:
BigArmaEuclidean(
arma::Mat<int>((int *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<int>((int *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 8:
BigArmaEuclidean(
arma::Mat<double>((double *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<double>((double *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
default:
// We should never get here, but it resolves compiler warnings.
throw Rcpp::exception("Undefined type for provided big.matrix");
}
}
My little wrapper
bigMatrixEuc <- function(bigMat){
zeros <- big.matrix(nrow = nrow(bigMat)-1,
ncol = nrow(bigMat)-1,
init = 0,
type = typeof(bigMat))
BigArmaEuc(bigMat@address, zeros@address)
return(zeros)
}
The test
library(Rcpp)
sourceCpp("euc_dist.cpp")
library(bigmemory)
set.seed(123)
mat <- matrix(rnorm(16), 4)
bm <- as.big.matrix(mat)
# Call new euclidean function
bm_out <- bigMatrixEuc(bm)[]
# pull out the matrix elements for out purposes
distMat <- as.matrix(dist(mat))
distMat[upper.tri(distMat, diag=TRUE)] <- 0
distMat <- distMat[2:4, 1:3]
# check if identical
all.equal(bm_out, distMat, check.attributes = FALSE)
[1] TRUE
Fast way to compute distance matrix in R for large matrix
Perhaps try the distances
package: https://cran.r-project.org/web/packages/distances/distances.pdf
install.packages("distances")
library("distances")
set.seed(123)
M <- matrix(rnorm(39900*1990),nrow = 39900,ncol = 1990)
d <- distances(M)
Memory-efficient method to create dist object from distance matrix
My current solution is to calculate the dist object directly from lat and lon vectors, without generating the intermediate distance matrix at all. On large matrices, this is several hundred times faster than the "conventional" pipeline of geosphere::mdist()
followed by stats::as.dist()
and uses only as much memory as required to store the final dist object.
The following Rcpp source is based on using the functions from here to calculate haversine distance in c++, together with an adaptation of @Alexis method to iterate through the lower triangle elements in c++.
#include <Rcpp.h>
using namespace Rcpp;
double distanceHaversine(double latf, double lonf, double latt, double lont, double tolerance){
double d;
double dlat = latt - latf;
double dlon = lont - lonf;
d = (sin(dlat * 0.5) * sin(dlat * 0.5)) + (cos(latf) * cos(latt)) * (sin(dlon * 0.5) * sin(dlon * 0.5));
if(d > 1 && d <= tolerance){
d = 1;
}
return 2 * atan2(sqrt(d), sqrt(1 - d)) * 6378137.0;
}
double toRadians(double deg){
return deg * 0.01745329251; // PI / 180;
}
//-----------------------------------------------------------
// [[Rcpp::export]]
NumericVector calc_dist(Rcpp::NumericVector lat,
Rcpp::NumericVector lon,
double tolerance = 10000000000.0) {
std::size_t nlat = lat.size();
std::size_t nlon = lon.size();
if (nlat != nlon) throw std::range_error("lat and lon different lengths");
if (nlat < 2) throw std::range_error("Need at least 2 points");
std::size_t size = nlat * (nlat - 1) / 2;
NumericVector ans(size);
std::size_t k = 0;
double latf;
double latt;
double lonf;
double lont;
for (std::size_t j = 0; j < (nlat-1); j++) {
for (std::size_t i = j + 1; i < nlat; i++) {
latf = toRadians(lat[i]);
lonf = toRadians(lon[i]);
latt = toRadians(lat[j]);
lont = toRadians(lon[j]);
ans[k++] = distanceHaversine(latf, lonf, latt, lont, tolerance);
}
}
return ans;
}
/*** R
as_dist = function(lat, lon, tolerance = 10000000000.0) {
dd = calc_dist(lat, lon, tolerance)
attr(dd, "class") = "dist"
attr(dd, "Size") = length(lat)
attr(dd, "call") = match.call()
attr(dd, "Diag") = FALSE
attr(dd, "Upper") = FALSE
return(dd)
}
*/
How to create a Large Distance Matrix?
At this point R cannot allocate the random number of megabytes of RAM. At this point, your computer is using all of its memory somewhere else and there just isn't (some number) of MBytes available for your process to continue. You have several solutions at this point. Among them, get a machine with more RAM, close programs, or do your distance calculations in smaller batches. Try a smaller n; and when it works just repeat the process several times until you have your whole matrix of distances.
(Speed Challenge) Any faster method to calculate distance matrix between rows of two matrices, in terms of Euclidean distance?
method_XXX <- function() {
sqrt(outer(rowSums(x^2), rowSums(y^2), '+') - tcrossprod(x, 2 * y))
}
Unit: relative
expr min lq mean median uq max
method_ThomasIsCoding_v1() 12.151624 10.486417 9.213107 10.162740 10.235274 5.278517
method_ThomasIsCoding_v2() 6.923647 6.055417 5.549395 6.161603 6.140484 3.438976
method_ThomasIsCoding_v3() 7.133525 6.218283 5.709549 6.438797 6.382204 3.383227
method_AllanCameron() 7.093680 6.071482 5.776172 6.447973 6.497385 3.608604
method_XXX() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
how to calculate Euclidean distance between two matrices in R
You can use the package pdist
:
library(pdist)
dists <- pdist(t(mat1), t(mat2))
as.matrix(dists)
[,1] [,2] [,3]
[1,] 9220.40 9260.735 8866.033
[2,] 12806.35 12820.086 12121.927
[3,] 11630.86 11665.869 11155.823
this will give you all Euclidean distances of the pairs: (mat1$x,mat2$x), (mat1$x,mat2$y),..., (mat1$z,mat2$z)
Efficiently Calculating a Euclidean Distance Matrix Using Numpy
You can take advantage of the complex
type :
# build a complex array of your cells
z = np.array([complex(c.m_x, c.m_y) for c in cells])
First solution
# mesh this array so that you will have all combinations
m, n = np.meshgrid(z, z)
# get the distance via the norm
out = abs(m-n)
Second solution
Meshing is the main idea. But numpy
is clever, so you don't have to generate m
& n
. Just compute the difference using a transposed version of z
. The mesh is done automatically :
out = abs(z[..., np.newaxis] - z)
Third solution
And if z
is directly set as a 2-dimensional array, you can use z.T
instead of the weird z[..., np.newaxis]
. So finally, your code will look like this :
z = np.array([[complex(c.m_x, c.m_y) for c in cells]]) # notice the [[ ... ]]
out = abs(z.T-z)
Example
>>> z = np.array([[0.+0.j, 2.+1.j, -1.+4.j]])
>>> abs(z.T-z)
array([[ 0. , 2.23606798, 4.12310563],
[ 2.23606798, 0. , 4.24264069],
[ 4.12310563, 4.24264069, 0. ]])
As a complement, you may want to remove duplicates afterwards, taking the upper triangle :
>>> np.triu(out)
array([[ 0. , 2.23606798, 4.12310563],
[ 0. , 0. , 4.24264069],
[ 0. , 0. , 0. ]])
Some benchmarks
>>> timeit.timeit('abs(z.T-z)', setup='import numpy as np;z = np.array([[0.+0.j, 2.+1.j, -1.+4.j]])')
4.645645342274779
>>> timeit.timeit('abs(z[..., np.newaxis] - z)', setup='import numpy as np;z = np.array([0.+0.j, 2.+1.j, -1.+4.j])')
5.049334864854522
>>> timeit.timeit('m, n = np.meshgrid(z, z); abs(m-n)', setup='import numpy as np;z = np.array([0.+0.j, 2.+1.j, -1.+4.j])')
22.489568296184686
Related Topics
How to Retry a Statement on Error
Select Columns Based on Multiple Strings with Dplyr Contains()
Issue with Ggplot2, Geom_Bar, and Position="Dodge": Stacked Has Correct Y Values, Dodged Does Not
Ggplot Scale Color Gradient to Range Outside of Data Range
Example Needed: Change the Default Print Method of an Object
Converting Nested List (Unequal Length) to Data Frame
Combining New Lines and Italics in Facet Labels with Ggplot2
R: How to Use Coord_Cartesian on Facet_Grid with Free-Ranging Axis
Generating All Permutations of N Balls in M Bins
Remove Duplicates Based on 2Nd Column Condition
Datalabels in R Highcharter Cannot Be Seen After Print as Png or Jpg
How to Interrupt a Running Code in R with a Keyboard Command
How to Create Vectors with Specific Intervals in R