How to Do N-D Distance and Nearest Neighbor Calculations on Numpy Arrays

How to do n-D distance and nearest neighbor calculations on numpy arrays

#1. All Distances

only using numpy

The naive method is:

D = np.sqrt(np.sum((X[:, None, :] - Y[None, :, :])**2, axis = -1))

However this takes up a lot of memory creating an (i, j, n)-shaped intermediate matrix, and is very slow

However, thanks to a trick from @Divakar (eucl_dist package, wiki), we can use a bit of algebra and np.einsum to decompose as such: (X - Y)**2 = X**2 - 2*X*Y + Y**2

D = np.sqrt(                                #  (X - Y) ** 2   
np.einsum('ij, ij ->i', X, X)[:, None] +    # = X ** 2        \
np.einsum('ij, ij ->i', Y, Y)          -    # + Y ** 2        \
2 * X.dot(Y.T))                             # - 2 * X * Y

Y is X

Similar to above:

XX = np.einsum('ij, ij ->i', X, X)
D = np.sqrt(XX[:, None] + XX - 2 * X.dot(X.T))

Beware that floating-point imprecision can make the diagonal terms deviate very slightly from zero with this method. If you need to make sure they are zero, you'll need to explicitly set it:

np.einsum('ii->i', D)[:] = 0

Any Package

scipy.spatial.distance.cdist is the most intuitive builtin function for this, and far faster than bare numpy

from scipy.spatial.distance import cdist
D = cdist(X, Y)

cdist can also deal with many, many distance measures as well as user-defined distance measures (although these are not optimized). Check the documentation linked above for details.

Y is X

For self-referring distances, scipy.spatial.distance.pdist works similar to cdist, but returns a 1-D condensed distance array, saving space on the symmetric distance matrix by only having each term once. You can convert this to a square matrix using squareform

from scipy.spatial.distance import pdist, squareform
D_cond = pdist(X)
D = squareform(D_cond)

#2. K Nearest Neighbors (KNN)

Only using numpy

We could use np.argpartition to get the k-nearest indices and use those to get the corresponding distance values. So, with D as the array holding the distance values obtained above, we would have -

if k == 1:
    k_i = D.argmin(0)
else:
    k_i = D.argpartition(k, axis = 0)[:k]
k_d = np.take_along_axis(D, k_i, axis = 0)

However we can speed this up a bit by not taking the square roots until we have reduced our dataset. np.sqrt is the slowest part of calculating the Euclidean norm, so we don't want to do that until the end.

D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
       np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
if k == 1:
    k_i = D_sq.argmin(0)
else:
    k_i = D_sq.argpartition(k, axis = 0)[:k]
k_d = np.sqrt(np.take_along_axis(D_sq, k_i, axis = 0))

Now, np.argpartition performs indirect partition and doesn't necessarily give us the elements in sorted order and only makes sure that the first k elements are the smallest ones. So, for a sorted output, we need to use argsort on the output from previous step -

sorted_idx = k_d.argsort(axis = 0)
k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
k_d_sorted = np.take_along_axis(k_d, sorted_idx, axis = 0)

If you only need, k_i, you never need the square root at all:

D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
       np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
if k == 1:
    k_i = D_sq.argmin(0)
else:
    k_i = D_sq.argpartition(k, axis = 0)[:k]
k_d_sq = np.take_along_axis(D_sq, k_i, axis = 0)
sorted_idx = k_d_sq.argsort(axis = 0)
k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)

X is Y

In the above code, replace:

D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
       np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)

with:

XX = np.einsum('ij, ij ->i', X, X)
D_sq = XX[:, None] + XX - 2 * X.dot(X.T))

Any Package

KD-Tree is a much faster method to find neighbors and constrained distances. Be aware the while KDTree is usually much faster than brute force solutions above for 3d (as long as oyu have more than 8 points), if you have n-dimensions, KDTree only scales well if you have more than 2**n points. For discussion and more advanced methods for high dimensions, see Here

The most recommended method for implementing KDTree is to use scipy's scipy.spatial.KDTree or scipy.spatial.cKDTree

from scipy.spatial import KDTree
X_tree = KDTree(X)
k_d, k_i = X_tree.query(Y, k = k)

Unfortunately scipy's KDTree implementation is slow and has a tendency to segfault for larger data sets. As pointed out by @HansMusgrave here, pykdtree increases the performance a lot, but is not as common an include as scipy and can only deal with Euclidean distance currently (while the KDTree in scipy can handle Minkowsi p-norms of any order)

X is Y

Use instead:

k_d, k_i = X_tree.query(X, k = k)

Arbitrary metrics

A BallTree has similar algorithmic properties to a KDTree. I'm not aware of a parallel/vectorized/fast BallTree in Python, but using scipy we can still have reasonable KNN queries for user-defined metrics. If available, builtin metrics will be much faster.

def d(a, b):
    return max(np.abs(a-b))

tree = sklearn.neighbors.BallTree(X, metric=d)
k_d, k_i = tree.query(Y)

This answer will be wrong if d() is not a metric. The only reason a BallTree is faster than brute force is because the properties of a metric allow it to rule out some solutions. For truly arbitrary functions, brute force is actually necessary.

#3. Radius search

Only using numpy

The simplest method is just to use boolean indexing:

mask = D_sq < r**2
r_i, r_j = np.where(mask)
r_d = np.sqrt(D_sq[mask])

Any Package

Similar to above, you can use scipy.spatial.KDTree.query_ball_point

r_ij = X_tree.query_ball_point(Y, r = r)

or scipy.spatial.KDTree.query_ball_tree

Y_tree = KDTree(Y)
r_ij = X_tree.query_ball_tree(Y_tree, r = r)

Unfortunately r_ij ends up being a list of index arrays that are a bit difficult to untangle for later use.

Much easier is to use cKDTree's sparse_distance_matrix, which can output a coo_matrix

from scipy.spatial import cKDTree
X_cTree = cKDTree(X)
Y_cTree = cKDTree(Y)
D_coo = X_cTree.sparse_distance_matrix(Y_cTree, r = r, output_type = `coo_matrix`)
r_i = D_coo.row
r_j = D_coo.column
r_d = D_coo.data

This is an extraordinarily flexible format for the distance matrix, as it stays an actual matrix (if converted to csr) can also be used for many vectorized operations.

Find nearest neighbors of a numpy array in list of numpy arrays using euclidian distance

Use scipy's kd-tree.

A small example is available here.

Many people seem to complain about the performance and recommend sklearn's implementation though (links sklearn.neighbors, which is using this data-structure internally)!

Python: Calculating the distance between points in an array

Pythagoras theorem states sqrt(a^2+b^2)=c where c is the distance between the tips of the orthogonal lines reaching point a and b.

import math
from math import sqrt
dist_list=[]
for i in A1[1:]:
    dist=sqrt(pow(A1[0][1]-i[1],2)+pow(A1[0][2]-i[2],2))
    dist_list.append(dist)

If you dont want to import math:

for i in A1[1:]:
    dist=pow(pow(A1[0][1]-i[1],2)+pow(A1[0][2]-i[2],2),0.5)
    dist_list2.append(dist)

Find closest/similar value(vector) inside a matrix

Approach #1

We can use Cython-powered kd-tree for quick nearest-neighbor lookup, which is very efficient both memory-wise and with performance -

In [276]: from scipy.spatial import cKDTree

In [277]: matrix[cKDTree(matrix).query(search_vec, k=1)[1]]
Out[277]: array([2, 2])

Approach #2

With SciPy's cdist -

In [286]: from scipy.spatial.distance import cdist

In [287]: matrix[cdist(matrix, np.atleast_2d(search_vec)).argmin()]
Out[287]: array([2, 2])

Approach #3

With Scikit-learn's Nearest Neighbors -

from sklearn.neighbors import NearestNeighbors

nbrs = NearestNeighbors(n_neighbors=1).fit(matrix)
closest_vec = matrix[nbrs.kneighbors(np.atleast_2d(search_vec))[1][0,0]]

Approach #4

With Scikit-learn's kdtree -

from sklearn.neighbors import KDTree
kdt = KDTree(matrix, metric='euclidean')
cv = matrix[kdt.query(np.atleast_2d(search_vec), k=1, return_distance=False)[0,0]]

Approach #5

From eucl_dist package (disclaimer: I am its author) and following the wiki contents, we could leverage matrix-multiplication -

M = matrix.dot(search_vec)
d = np.einsum('ij,ij->i',matrix,matrix) + np.inner(search_vec,search_vec) -2*M
closest_vec = matrix[d.argmin()]

How to find all distances between points in a matrix without duplicates?

Looks like in general you want a KDTree implementation, with query_pairs.

from scipy.spatial import KDTree
points_tree = KDTree(points)
points_in_radius = points_tree.query_pairs(radius)

This will be much faster than actually computing all of the instances and applying a tolerance.

How to Do N-D Distance and Nearest Neighbor Calculations on Numpy Arrays