Numpy Distance Calculations of Different Shaped Arrays

how to calculate distance between two different shapes numpy array in order to do KNN implementation

The problem is that the subtraction is performed along the first axis by default. And since the arrays differ in number of rows, it returns an error:

train = np.array([[1,2,3],[4,5,6]])
test = np.array([[1,2,3],[4,5,6],[7,8,9]])

((train - test)**2).sum(axis=1)
> ValueError: operands could not be broadcast together with shapes (2,3) (3,3)

To subtract along the first axis, you can add another axis to test:

((train - test[:,None])**2).sum(axis=1)
array([[ 9, 9, 9],
[ 9, 9, 9],
[45, 45, 45]])

Numpy distance calculations of different shaped arrays

The documentation of scipy.spatial.distance.euclidean states, that only 1D-vectors are allowed as inputs. Thus you must loop over your arrays like:

distances = np.empty(b.shape[0])
for i in range(b.shape[0]):
distances[i] = scipy.spatial.distance.euclidean(a, b[i])

If you want to have a vectorized implementation, you need to write your own function. Perhaps using np.vectorize with a correct signature will also work, but this is in fact also just a short-hand for a for-loop and will thus have the same performance as a simple for-loop.

As stated in my comment to hannes wittingham's solution, I'll post a one-liner which is focussing on performance:

distances = ((b - a)**2).sum(axis=1)**0.5

Writing out all the calculations reduces the number of separate functions calls and thus assignments of the intermediate results to new arrays. Thus it is about 22% faster than using the solution of hannes wittingham for an array shape of b.shape == (20, 3) and about 5% faster for an array shape of
b.shape == (20000, 3):

a = np.array([1, 1, 1,])
b = np.random.rand(20, 3)
%timeit ((b - a)**2).sum(axis=1)**0.5
# 5.37 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit euclidean_distances(a, b)
# 6.89 µs ± 345 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

b = np.random.rand(20000, 3)
%timeit ((b - a)**2).sum(axis=1)**0.5
# 588 µs ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distances(a, b)
# 616 µs ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But your are losing the flexibility of being able to easily change to distance calculation routine. When using the scipy.spatial.distance module, you can change the calculation routing by simply calling another method.

To improve the calculation performance even further, you can use a jit (just in time) compiler like numba for your functions:

import numba as nb
@nb.njit
def euc(a, b):
return ((b - a)**2).sum(axis=1)**0.5

This reduces the time needed to do the calculations by about 70% for small arrays and by about 60% for large arrays. Unluckily the axis keyword for np.linalg.norm is not yet supported by numba.

Calculating distances between vectors of two numpy arrays

The best solution depends on the input size of the arrays

For lower dimensional problems dim<20 or less, a kdtree approach is usually the way to go. There are quite a lot of answers regarding this topic eg. one I have written a few weeks ago.

If the dimension of the problems is too high you can switch to brute-force algorithms. Both of the following algorithms are much faster than your optimized approach, but on larger input sizes and low dimensional problems much slower than a kdtree approach O(log(n)) instead of O(n^2).

Brute force 1

The following example uses an algorithm described here. It is very fast on large dimensional problems because most of the calculation is done in a highly optimized matrix-matrix multiplication algorithm.
The disadvantage is high memory usage (all distances are calculated in one function call) and precision problems, because of the more error prone calculation method.

import numpy as np
from sklearn.metrics.pairwise import euclidean_distances

def nearest_samples_matrix_2(R,W):
R_Temp=R
W_Temp=W.reshape(-1,W.shape[2])
dist=euclidean_distances(R_Temp, W_Temp)
ind_1,ind_2=np.unravel_index(np.argmin(dist,axis=1),shape=(W.shape[0],W.shape[1]))
return np.vstack((ind_1,ind_2)).T

Brute force 2

This is quite similar to your naive approach, but uses a JIT-Compiler (Numba) to get good performance. Temporary arrays are not necessary and the precision should be good (as long as no overflow occurs). There is room for further optimization (loop tiling) on larger input sizes.

import numpy as np
import numba as nb

#parallelization is only beneficial on larger input data
@nb.njit(fastmath=True,parallel=True,cache=True)
def nearest_samples_matrix_3(r, W):
ind_i=0
ind_j=0
out=np.empty((r.shape[0],2),dtype=np.int64)
for x in nb.prange(r.shape[0]):
delta=0
for k in range(W.shape[2]):
delta += (r[x,k] - W[0,0,k])**2

for i in range(W.shape[0]):
for j in range(W.shape[1]):
norm = 0
for k in range(W.shape[2]):
norm += (r[x,k] - W[i,j,k])**2
if norm < delta:
delta=norm
ind_i=i
ind_j=j
out[x,0]=ind_i
out[x,1]=ind_j
return out

Timings

#small Arrays
N, M = 100, 200
F = 30
S = 50
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))

#your function
%timeit nearest_samples_matrix(R,W)
#268 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nearest_samples_matrix_2(R,W)
#5.62 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nearest_samples_matrix_3(R,W)
#3.68 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#larger arrays
N, M = 1_000, 2_000
F = 50
S = 100
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))

#%timeit nearest_samples_matrix_1(R,W)
#too slow
%timeit nearest_samples_matrix_2(R,W)
#2.76 s ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nearest_samples_matrix_3(R,W)
#1.42 s ± 402 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

calculating distance between two numpy arrays

If you just want the distances between each pair of points, then you don't need to calculate a full distance matrix.

Instead, calculate it directly:

import numpy as np

x = np.array([[[1,2,3,4,5],
[5,6,7,8,5],
[5,6,7,8,5]],
[[11,22,23,24,5],
[25,26,27,28,5],
[5,6,7,8,5]]])

y = np.array([[[31,32,33,34,5],
[35,36,37,38,5],
[5,6,7,8,5]],
[[41,42,43,44,5],
[45,46,47,48,5],
[5,6,7,8,5]]])

xx = x.reshape(2, -1)
yy = y.reshape(2, -1)
dist = np.hypot(*(xx - yy))

print dist

To explain a bit more about what's going on, first we reshape the arrays such that they have a 2xN shape (-1 is a placeholder that tells numpy to calculate the correct size along that axis automatically):

In [2]: x.reshape(2, -1)
Out[2]:
array([[ 1, 2, 3, 4, 5, 5, 6, 7, 8, 5, 5, 6, 7, 8, 5],
[11, 22, 23, 24, 5, 25, 26, 27, 28, 5, 5, 6, 7, 8, 5]])

Therefore, when we subtract xx and yy, we'll get a 2xN array:

In [3]: xx - yy
Out[3]:
array([[-30, -30, -30, -30, 0, -30, -30, -30, -30, 0, 0, 0, 0,
0, 0],
[-30, -20, -20, -20, 0, -20, -20, -20, -20, 0, 0, 0, 0,
0, 0]])

We can then unpack this in to dx and dy components:

In [4]: dx, dy = xx - yy

In [5]: dx
Out[5]:
array([-30, -30, -30, -30, 0, -30, -30, -30, -30, 0, 0, 0, 0,
0, 0])

In [6]: dy
Out[6]:
array([-30, -20, -20, -20, 0, -20, -20, -20, -20, 0, 0, 0, 0,
0, 0])

And calculate the distance (np.hypot is equivalent to np.sqrt(dx**2 + dy**2)):

In [7]: np.hypot(dx, dy)
Out[7]:
array([ 42.42640687, 36.05551275, 36.05551275, 36.05551275,
0. , 36.05551275, 36.05551275, 36.05551275,
36.05551275, 0. , 0. , 0. ,
0. , 0. , 0. ])

Or we can have the unpacking done automatically and do it all in one step:

In [8]: np.hypot(*(xx - yy))
Out[8]:
array([ 42.42640687, 36.05551275, 36.05551275, 36.05551275,
0. , 36.05551275, 36.05551275, 36.05551275,
36.05551275, 0. , 0. , 0. ,
0. , 0. , 0. ])

If you want to calculate other types of distances, just change np.hypot to the function you'd like to use. For example, for Manhattan/city-block distances:

In [9]: dist = np.sum(np.abs(xx - yy), axis=0)

In [10]: dist
Out[10]: array([60, 50, 50, 50, 0, 50, 50, 50, 50, 0, 0, 0, 0, 0, 0])

Numpy to replace for loop to calculate distances between a single point and multiple different points individually

Replace your for loop with this one-liner:

distances = np.linalg.norm(x_y - current_point, axis=1)

Calculate Euclidean distance between two python arrays

Here, you can just use np.linalg.norm to compute the Euclidean distance. Your bug is due to np.subtract is expecting the two inputs are of the same length.

import numpy as np

list_a = np.array([[0,1], [2,2], [5,4], [3,6], [4,2]])
list_b = np.array([[0,1],[5,4]])

def run_euc(list_a,list_b):
return np.array([[ np.linalg.norm(i-j) for j in list_b] for i in list_a])

print(run_euc(list_a, list_b))

The code produces:

[[0.         5.83095189]
[2.23606798 3.60555128]
[5.83095189 0. ]
[5.83095189 2.82842712]
[4.12310563 2.23606798]]


Related Topics



Leave a reply



Submit