Efficient (Memory-Wise) Function for Repeated Distance Matrix Calculations and Chunking of Extra Large Distance Matrices

Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices

I've come up with a chunking solution for those extra large matrices that dist() can't handle, which I'm posting here in case anyone else finds it helpful (or finds fault with it, please!). It is significantly slower than dist(), but that is kind of irrelevant, since it should only ever be used when dist() throws an error - usually one of the following:

"Error in double(N * (N - 1)/2) : vector size specified is too large" 
"Error: cannot allocate vector of size 6.0 Gb"
"Error: negative length vectors are not allowed"

The function calculates the mean distance for the matrix, but you can change that to anything else, but in case you want to actually save the matrix I believe some sort of filebacked bigmemory matrix is in order.. Kudos to link for the idea and Ari for his help!

FunDistanceMatrixChunking <- function (df, blockSize=100){
  n <- nrow(df)
  blocks <- n %/% blockSize
  if((n %% blockSize) > 0)blocks <- blocks + 1
  chunk.means <- matrix(NA, nrow=blocks*(blocks+1)/2, ncol= 2)
  dex <- 1:blockSize
  chunk <- 0
  for(i in 1:blocks){    
    p <- dex + (i-1)*blockSize
    lex <- (blockSize+1):(2*blockSize)
    lex <- lex[p<= n]
    p <- p[p<= n]
    for(j in 1:blocks){
      q <- dex +(j-1)*blockSize
      q <- q[q<=n]     
      if (i == j) {       
        chunk <- chunk+1
        x <- dist(df[p,])
        chunk.means[chunk,] <- c(length(x), mean(x))}
      if ( i > j) {
        chunk <- chunk+1
        x <- as.matrix(dist(df[c(q,p),]))[lex,dex] 
        chunk.means[chunk,] <- c(length(x), mean(x))}
    }
  }
  mean <- weighted.mean(chunk.means[,2], chunk.means[,1])
  return(mean)
}
df <- cbind(var1=rnorm(1000), var2=rnorm(1000))
mean(dist(df))
FunDistanceMatrixChunking(df, blockSize=100)

Not sure whether I should have posted this as an edit, instead of an answer.. It does solve my problem, although I didn't really specify it this way..

R: Distm for big data? Calculating minimum distances between two matrices

You can use this R(cpp) function:

#include <Rcpp.h>
using namespace Rcpp;

double compute_a(double lat1, double long1, double lat2, double long2) {

  double sin_dLat = ::sin((lat2 - lat1) / 2);
  double sin_dLon = ::sin((long2 - long1) / 2);

  return sin_dLat * sin_dLat + ::cos(lat1) * ::cos(lat2) * sin_dLon * sin_dLon;
}

int find_min(double lat1, double long1,
             const NumericVector& lat2,
             const NumericVector& long2,
             int current0) {

  int m = lat2.size();
  double lat_k, lat_min, lat_max, a, a0;
  int k, current = current0;

  a0 = compute_a(lat1, long1, lat2[current], long2[current]);
  // Search before current0
  lat_min = lat1 - 2 * ::asin(::sqrt(a0));
  for (k = current0 - 1; k >= 0; k--) {
    lat_k = lat2[k];
    if (lat_k > lat_min) {
      a = compute_a(lat1, long1, lat_k, long2[k]);
      if (a < a0) {
        a0 = a;
        current = k;
        lat_min = lat1 - 2 * ::asin(::sqrt(a0));
      }
    } else {
      // No need to search further
      break;
    }
  }
  // Search after current0
  lat_max = lat1 + 2 * ::asin(::sqrt(a0));
  for (k = current0 + 1; k < m; k++) {
    lat_k = lat2[k];
    if (lat_k < lat_max) {
      a = compute_a(lat1, long1, lat_k, long2[k]);
      if (a < a0) {
        a0 = a;
        current = k;
        lat_max = lat1 + 2 * ::asin(::sqrt(a0));
      }
    } else {
      // No need to search further
      break;
    }
  }

  return current;
} 

// [[Rcpp::export]]
IntegerVector find_closest_point(const NumericVector& lat1,
                                 const NumericVector& long1,
                                 const NumericVector& lat2,
                                 const NumericVector& long2) {

  int n = lat1.size();
  IntegerVector res(n);

  int current = 0;
  for (int i = 0; i < n; i++) {
    res[i] = current = find_min(lat1[i], long1[i], lat2, long2, current);
  }

  return res; // need +1
}

/*** R
N <- 2000  # 2e6
M <- 500   # 2e4

pixels.latlon=cbind(runif(N,min=-180, max=-120), runif(N, min=50, max=85))
grwl.latlon=cbind(runif(M,min=-180, max=-120), runif(M, min=50, max=85))
# grwl.latlon <- grwl.latlon[order(grwl.latlon[, 2]), ]

library(geosphere)
system.time({
  #calculate the distance matrix
  dist.matrix = distm(pixels.latlon, grwl.latlon, fun=distHaversine)
  #Pick out the indices of the minimum distance
  rnum=apply(dist.matrix, 1, which.min)
})

find_closest <- function(lat1, long1, lat2, long2) {

  toRad <- pi / 180
  lat1  <- lat1  * toRad
  long1 <- long1 * toRad
  lat2  <- lat2  * toRad
  long2 <- long2 * toRad

  ord1  <- order(lat1)
  rank1 <- match(seq_along(lat1), ord1)
  ord2  <- order(lat2)

  ind <- find_closest_point(lat1[ord1], long1[ord1], lat2[ord2], long2[ord2])

  ord2[ind + 1][rank1]
}

system.time(
  test <- find_closest(pixels.latlon[, 2], pixels.latlon[, 1], 
                       grwl.latlon[, 2], grwl.latlon[, 1])
)
all.equal(test, rnum)

N <- 2e4
M <- 2e4
pixels.latlon=cbind(runif(N,min=-180, max=-120), runif(N, min=50, max=85))
grwl.latlon=cbind(long = runif(M,min=-180, max=-120), lat = runif(M, min=50, max=85))
system.time(
  test <- find_closest(pixels.latlon[, 2], pixels.latlon[, 1], 
                       grwl.latlon[, 2], grwl.latlon[, 1])
)
*/

It takes 0.5 sec for N = 2e4 and 4.2 sec for N = 2e5.
I can't make your code work to compare.

dist function with large number of points

There are asome things you could try, also depending on what you need exactly:

Calculate the distances in a loop, and only keep those that match the criterium. Especially when the number of matches is much smaller than the total size of the distance matrix, this saves a lot of RAM usage. This loop is probably very slow if it is implemented in pure R, that is alos why dist does not use R but I believe C to perform the calculations. This could mean that you get your results, but have to wait a while. Alternatively, the excellent Rcpp package would allow you to write this down in C/C++, making it much much faster probably.
Start using packages like bigmemory in storing the distance matrix. You then build it in a loop and store it iteratively in the bigmemory object (I have not worked with bigmemory before, so I don't know the exact details). Then after building the matrix, you can access it to extract your desired results. Effectively, all tricks to handle large data in R apply to this bullet. See e.g. R SO posts on big data.

Some interesting links (found googling for r distance matrix for large vector):

Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices
(lucky you!) http://stevemosher.wordpress.com/2012/04/08/using-bigmemory-for-a-distance-matrix/

Big matrix and memory problems

On converting a "dist" object to "(big.)matrix":
stats:::as.matrix.dist has calls to row, col, t and operators that create large intermediate objects. Avoiding these you could, among other alternatives, use something like:

With data:

nr = 1e4
m = matrix(runif(nr), nr, 10)
d = dist(m)

Then, slowly, allocate and fill a "matrix":

#as.matrix(d) #this gives error on my machine
n = attr(d, "Size")
md = matrix(0, n, n) 
id = cumsum(c(1L, (n - 1L) - 0:(n - 2L))) #to split "d"
for(j in 1:(n - 1L)) {
    i = (j + 1L):n
    md[i, j] = md[j, i] = d[id[j]:(id[j] + (n - (j + 1L)))]
}

(It seems that with allocating "md" as big.matrix(n, n, init = 0) equally works)

md[2:5, 1]
#[1] 2.64625973 2.01071637 0.09207748 0.09346157
d[1:4]
#[1] 2.64625973 2.01071637 0.09207748 0.09346157

Using smaller "nr" we could test:

all.equal(as.matrix(md), as.matrix(d), check.attributes = FALSE)
#[1] TRUE

variable specific distance matrices compatible with strings

For x and y, think "row x" compared with "row y". It may be informative to change your function to be
```
dist_func <- function(x, y) {
  browser()
  length(intersect(x,y))/3
}
```
and then run it, looking at the actual values of x and y on the first call to your function. (You probably won't need to go beyond the first and second instantiation of your function.)
intersect does not know anything about position within the vector, it is solely set-based, meaning "presence of". The help page is even titled "Sets" and starts with
```
Performs *set* union, intersection, ...
```
To get what you want, aren't you just looking for plain equality?
```
dist_func <- function(x, y) sum(x == y)/3
```
NB: true equality can be problematic if looking at numeric (non-integer) numbers, per R FAQ 7.31.
Your data is plagued with factors, not characters. You might notice
```
str(df2)
# 'data.frame': 10 obs. of  4 variables:
#  $ npi     : int  51 52 53 54 55 56 57 58 59 60
#  $ dier    : Factor w/ 9 levels "aap","beer","kip",..: 1 8 6 3 2 1 7 9 5 4
#  $ getal   : Factor w/ 8 levels "acht","drie",..: 3 4 4 2 5 6 7 8 1 1
#  $ mubilair: Factor w/ 9 levels "bank","bureau",..: 7 6 9 3 3 2 1 4 8 5
```
Notice, for example, that the first value in $getal is "acht", which is an integer 3 internally within the factors. You'll notice that the fourth integer-values for the three columns are 3, 2, 3 (respectively, which matches the distance metric of 0.667 in column "51" and row "54".
Either use read.table(..., stringsAsFactors = FALSE) or change your distance function to be something like:
```
dist_func2 <- function(x, y) {
  if (is.factor(x)) x <- as.character(x)
  if (is.factor(y)) y <- as.character(y)
  sum(x == y)/3
}
```
(I suggest stringsAsFactors personally, but YMMV.)

Pairwise Distance with Large NumPy Arrays (Chunking?)

You can split you array to smaller sized ones and calculate the distances for each pair separately.

splits = np.array_split(data, 10)
for i in range(len(splits)):
    for j in range(i, len(splits)):
        m = scipy.spatial.distance.cdist(splits[i], splits[j])
        # do something with m

as the most calculations occur in scipy overhead of python loops will be minimal.

If you boolean array fit into memory and you try to find values which in certain range you can do

import numpy as np
import scipy.spatial.distance

boolean = np.zeros((350, 350), dtype=np.bool_)
a = np.random.randn(350, 2)
splits = np.array_split(a, 10)
shift = splits[0].shape[0]
minDist = -0.5
maxDist = +0.5
for i in range(len(splits)):
    for j in range(i, len(splits)):
        m = scipy.spatial.distance.cdist(splits[i], splits[j])
        masked = (minDist <= m) & (m <= maxDist)
        boolean[i * shift: (i + 1) * shift, j * shift : (j + 1) * shift] = masked
        boolean[j * shift : (j + 1) * shift, i * shift: (i + 1) * shift] = masked.T

Parallelizing the transformation of a very large matrix from n^2 x 3 to n x n?

Since the input file is too large for memory, the transformed output will also be too large. So I'm assuming the goal is to produce a new output file, not to figure out a way to hold all of the information in memory at one time (the latter question might involve sparse matrices or some other technique).

For example, suppose we start with this data.

1   2   0.5
3   4   0.8
5   6   2.7
2   3   0.7
1   3   1.1
3   6   3.1
4   5   0.5
1   6   4.6

First split the input file apart into a bunch of intermediate input files, one per ORIGIN. In our example, we end up with 5 files.

1   2   0.5
1   3   1.1
1   6   4.6

2   3   0.7

3   4   0.8
3   6   3.1

4   5   0.5

5   6   2.7

Then use multiple processes to transform the intermediate input files into intermediate output files, each having the new matrix structure. Here are the resulting files based on the example.

1   .   0.5   1.1   .     .     4.6

2   .   .     0.7   .     .     .

3   .   .     .     0.8   .     3.1

4   .   .     .     .     0.5   .

5   .   .     .     .     .     2.7

Then concatenate the intermediate output files to produce the final output.

The general strategy described above can probably be optimized for speed in various ways by skipping some of the intermediate files. For example, you could probably avoid having a bunch of intermediate files by doing the following: (A) create a single intermediate input file, merge-sorted by ORIGIN; (B) while doing that also keep track of the file-seek (START, END) locations for each ORIGIN; then (C) use multiple processes to produce the final output, based on the merge-sorted file and the seek metadata. That approach might be speedier (it also might not), but it requires some more bookkeeping. My first instinct would be to start simple and evolve from there.

Efficient (Memory-Wise) Function for Repeated Distance Matrix Calculations and Chunking of Extra Large Distance Matrices