Dist Function with Large Number of Points

dist function with large number of points

There are asome things you could try, also depending on what you need exactly:

  • Calculate the distances in a loop, and only keep those that match the criterium. Especially when the number of matches is much smaller than the total size of the distance matrix, this saves a lot of RAM usage. This loop is probably very slow if it is implemented in pure R, that is alos why dist does not use R but I believe C to perform the calculations. This could mean that you get your results, but have to wait a while. Alternatively, the excellent Rcpp package would allow you to write this down in C/C++, making it much much faster probably.
  • Start using packages like bigmemory in storing the distance matrix. You then build it in a loop and store it iteratively in the bigmemory object (I have not worked with bigmemory before, so I don't know the exact details). Then after building the matrix, you can access it to extract your desired results. Effectively, all tricks to handle large data in R apply to this bullet. See e.g. R SO posts on big data.

Some interesting links (found googling for r distance matrix for large vector):

  • Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices
  • (lucky you!) http://stevemosher.wordpress.com/2012/04/08/using-bigmemory-for-a-distance-matrix/

How to efficiently calculate the distance between a large set of points and an arbitrary function?

I guess that you are more concerned about speed than accuracy. In that case reducing the problem to something that can be solved by NumPy's np.roots function should be fine. You didn't say that you are only interested in polynomials but you're example is polynomial so I'm presuming that. If you want to handle something more complicated then you will need to carefully define the scope because this problem is impossible in general if curve_func can be any arbitrary function.

We use SymPy to find the general equation of the difference and then differentiate that to get the equation for the stationary points (minima and maxima). Then we use lambdify to efficiently generate the polynomial coefficients to be passed to np.roots. A post-processing step would be needed to figure out which root returned by np.roots is the real root that you want. You can then substitute that back to find the distance:

In [56]: from sympy import *

In [57]: x, y = symbols('x, y')

In [58]: x1, y1 = symbols('x1, y1')

In [59]: curve_func = x**2+5*x+10

In [60]: dist2 = (x1 - x)**2 + (y1 - curve_func)**2

In [61]: dist2
Out[61]:
2
2 ⎛ 2 ⎞
(-x + x₁) + ⎝- x - 5⋅x + y₁ - 10⎠

In [62]: coeffs = Tuple(*dist2.diff(x).as_poly(x).all_coeffs())

In [63]: coeffs
Out[63]: (4, 30, 92 - 4⋅y₁, -2⋅x₁ - 10⋅y₁ + 100)

In [64]: poly_func = lambdify((x1, y1), coeffs)

In [65]: poly_func(2, 3)
Out[65]: (4, 30, 80, 66)

In [66]: np.roots(_)
Out[66]: array([-3. +1.41421356j, -3. -1.41421356j, -1.5+0.j ])

In [67]: dist2.subs(x, _[-1]).subs({x1:2, y1:3})
Out[67]: 15.3125000000000

We could also collect the outputs of np.roots into an array and use lambdify to evaluate that.

If you were more concerned about accuracy than speed then it might make sense to use SymPy's real_roots function rather than np.roots:

In [71]: real_roots(Poly(coeffs.subs({x1: 2, y1:3}), x))
Out[71]: [-3/2]

The timings here on this not powerful machine are about 1ms to find one distance (although it's probably faster in an intensive loop):

In [72]: %time poly_func(4, 5)
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 19.3 µs
Out[72]: (4, 30, 72, 42)

In [73]: %time np.roots(_)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 998 µs
Out[73]:
array([-3.32468811+1.13592756j, -3.32468811-1.13592756j,
-0.85062379+0.j ])

Number of points within a radius with large datasets- R

You could have done the counts in an apply

d <- d < 1609.34
nt <- apply(d, 1, sum)
nd <- apply(d, 1, function(i) length(unique(owner[i]))) - 1

I think that your computation of numdif is not correct, as it includes other owners multiple times if they have multiple parcels.

Given the large number of observations, I would consider this route:

d <- lapply(1:nrow(coords), function(i) which(distGeo(coords[i, ,drop=FALSE], coords) < 1609.34))
ntot <- sapply(d, length)
ndif <- sapply(d, function(i) length(unique(owner[i]))) - 1

That is slower, but it won't create a crazy large matrix

I should also add that your approach assumes that the parcels are small relatively to the distance considered, such that using the centroid is OK.
If this is not the case, it is possible to do the computations on the polygons with rgeos::gWithinDistance, at increased computational cost.

Closest pair for any of a huge number of points

The traditional approach is to preprocess the data
and put it in a data structure, often a K-d tree,
for which the "nearest point" query is very fast.

There is an implementation in the nnclust package.

library(nnclust)
foo <- cbind(x=c(1,2,4,4,10),y=c(1,2,4,4,10))
i <- nnfind(foo)$neighbour
plot(foo)
arrows( foo[,1], foo[,2], foo[i,1], foo[i,2] )

Calculating all distances between one point and a group of points efficiently in R

Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K.

# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)

system.time(
dists <- apply(centers, 2, function(center) {
colSums((x - center)^2)
})
)

Runs in:

utilisateur     système      écoulé 
0.100 0.008 0.108

on my laptop.

Dist function between matrix objects in R

Update 2: If we assume we have complete data and the accuracy is with we can implement an even faster version of the distances using rcpp. I have added this below and the fastest version uses the bytecode compiler. For those with experience with RcppParallel this can likely be improved further.

Update: The function rdist from the fields package is by far the fastest method found so far. (see Calculating all distances between one point and a group of points efficiently in R). It appears to be fastest when not using the bytecode compiler also.

Briefly performing some testing on the previous results I obtained that vapply is faster when using the bytecode compiler than all other methods (after the first run when it compiles the functions for the first time this is why the maxtime is greater during the bytecode runs).

I have tried the methods from @akrun and @ThomasIsCoding here also.

library(microbenchmark)
library(compiler)
library(collapse)
library(fields)
library(Rcpp)

set.seed(999)
data <- matrix(runif(1100), nrow = 11, ncol = 10)

x <- data[1, ]
y <- data[2:nrow(data), ]

distances <- dist(rbind(x, y))

func1 <- function(x, y) {
apply(y, MARGIN = 1, function(a, b = x) {
sqrt(sum((a - b)^2))
})
}

func2 <- function(x, y) {
dist(rbind(x, y))
}

func3 <- function(x, y) {
dapply(y, function(a, b = x) {
sqrt(sum((a-b)^2))
}, MARGIN = 1)
}

func4 <- function(x, y) {
vapply(seq_len(nrow(y)), function(i, b = x) sqrt(sum((y[i,]-b)^2)), numeric(1))
}

func5 <- function(x, y) {
rdist(rbind(x, y))
}

cppFunction('NumericVector func6(NumericVector x, NumericVector y) {
int n = x.size();
int n2 = y.size();

int maxiters = n2/n;

NumericVector results(maxiters);

for(int i = 0; i < maxiters; i++) {
results[i] = 0;
for(int j = 0; j < n; j++) {
double val = x[j] - y[j * maxiters + i];
results[i] += val * val;
}
results[i] = sqrt(results[i]);
}

return results;

}')

func7 <- function(x, y) sqrt(rowSums((y-x[col(y)])^2))

func8 <- function(x, y) sqrt(colSums((t(y) - x)^2))

compiler::enableJIT(0)
#> [1] 3

microbenchmark::microbenchmark(
func1(x, y),
func2(x, y),
func3(x, y),
func4(x, y),
func5(x, y),
func6(x, y),
func7(x, y),
func8(x, y)
)
#>Unit: microseconds
#> expr min lq mean median uq max neval
#> func1(x, y) 37.001 42.8010 50.53103 45.4520 53.6515 138.302 100
#> func2(x, y) 20.201 25.3510 30.23096 27.8515 31.4010 70.401 100
#> func3(x, y) 23.901 27.6510 55.45699 30.0010 35.7505 2248.902 100
#> func4(x, y) 20.501 23.2010 28.20101 24.6020 31.4010 119.501 100
#> func5(x, y) 6.100 8.6020 19.27804 9.6515 11.4510 891.001 100
#> func6(x, y) 1.501 2.4010 11.60706 2.9010 3.4510 848.102 100
#> func7(x, y) 14.401 17.2505 27.73793 19.7510 23.2510 596.002 100
#> func8(x, y) 18.901 22.5510 27.91699 24.9015 29.3010 73.301 100

compiler::enableJIT(3)
#> [1] 0

microbenchmark::microbenchmark(
func1(x, y),
func2(x, y),
func3(x, y),
func4(x, y),
func5(x, y),
func6(x, y),
func7(x, y),
func8(x, y)

)
#>Unit: microseconds
#> expr min lq mean median uq max neval
#> func1(x, y) 32.100 35.9510 85.49213 39.4015 44.2510 4298.002 100
#> func2(x, y) 19.701 23.6010 45.11697 26.0505 29.6005 1732.702 100
#> func3(x, y) 19.801 22.2515 76.96108 24.8010 27.6510 5023.201 100
#> func4(x, y) 16.302 19.2510 77.46094 20.3010 21.8005 5564.701 100
#> func5(x, y) 6.201 8.5010 41.53397 9.4510 11.0510 3032.301 100
#> func6(x, y) 1.401 2.3010 13.95802 2.7005 3.0020 1101.801 100
#> func7(x, y) 14.201 16.7010 64.09999 18.6510 21.0015 4307.901 100
#> func8(x, y) 19.201 22.4500 64.33288 24.8510 27.5010 3776.101 100

Created on 2021-04-04 by the reprex package (v2.0.0)

Just the results

#ordinary compiler

#>Unit: microseconds
#> expr min lq mean median uq max neval
#> func1(x, y) 37.001 42.8010 50.53103 45.4520 53.6515 138.302 100
#> func2(x, y) 20.201 25.3510 30.23096 27.8515 31.4010 70.401 100
#> func3(x, y) 23.901 27.6510 55.45699 30.0010 35.7505 2248.902 100
#> func4(x, y) 20.501 23.2010 28.20101 24.6020 31.4010 119.501 100
#> func5(x, y) 6.100 8.6020 19.27804 9.6515 11.4510 891.001 100
#> func6(x, y) 1.501 2.4010 11.60706 2.9010 3.4510 848.102 100
#> func7(x, y) 14.401 17.2505 27.73793 19.7510 23.2510 596.002 100
#> func8(x, y) 18.901 22.5510 27.91699 24.9015 29.3010 73.301 100

#bytecode compiler

#>Unit: microseconds
#> expr min lq mean median uq max neval
#> func1(x, y) 32.100 35.9510 85.49213 39.4015 44.2510 4298.002 100
#> func2(x, y) 19.701 23.6010 45.11697 26.0505 29.6005 1732.702 100
#> func3(x, y) 19.801 22.2515 76.96108 24.8010 27.6510 5023.201 100
#> func4(x, y) 16.302 19.2510 77.46094 20.3010 21.8005 5564.701 100
#> func5(x, y) 6.201 8.5010 41.53397 9.4510 11.0510 3032.301 100
#> func6(x, y) 1.401 2.3010 13.95802 2.7005 3.0020 1101.801 100
#> func7(x, y) 14.201 16.7010 64.09999 18.6510 21.0015 4307.901 100
#> func8(x, y) 19.201 22.4500 64.33288 24.8510 27.5010 3776.101 100


Related Topics



Leave a reply



Submit