Distance Calculation on Large Vectors [Performance]

Most efficient way to calculate every L2 distance between vectors of vector array A and vectors of vector array B?

sklearn.metrics.pairwise_distances solves exactly this problem.

Efficient euclidean distance calculation in python for millions of rows

Did you look at scipy.spatial.cKDTree ?

You can construct this data structure for one of your data set, and query it to get the distance for each point in the second data set.

KDTree = scipy.spatial.cKDTree(df1)
distances, indexes = KDTree.query(df2, n_jobs=-1)

I set here n_jobs=-1 to use all available processors.

(Speed Challenge) Any faster method to calculate distance matrix between rows of two matrices, in terms of Euclidean distance?

method_XXX <- function() {
sqrt(outer(rowSums(x^2), rowSums(y^2), '+') - tcrossprod(x, 2 * y))
}

Unit: relative
expr min lq mean median uq max
method_ThomasIsCoding_v1() 12.151624 10.486417 9.213107 10.162740 10.235274 5.278517
method_ThomasIsCoding_v2() 6.923647 6.055417 5.549395 6.161603 6.140484 3.438976
method_ThomasIsCoding_v3() 7.133525 6.218283 5.709549 6.438797 6.382204 3.383227
method_AllanCameron() 7.093680 6.071482 5.776172 6.447973 6.497385 3.608604
method_XXX() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

Calculating all distances between one point and a group of points efficiently in R

Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K.

# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)

system.time(
dists <- apply(centers, 2, function(center) {
colSums((x - center)^2)
})
)

Runs in:

utilisateur     système      écoulé 
0.100 0.008 0.108

on my laptop.

Pandas distance matrix performance with vector data

This is certainly more efficient and easier to read than using for loops.

df = pd.DataFrame([v for v in raw_data['counters_'].values()], 
index=raw_data['counters_'].keys()).T

>>> df.head()
4716823618 5072134420 5072142538
51811 1 1 4
51812 NaN 1 4
51820 1 1 4
51833 56 56 56
51835 8 9 8

# raw_data no longer needed. Delete to reduce memory footprint.
del raw_data

# Create scalars.
scalars = ((df ** 2).sum()) ** .5

>>> scalars
4716823618 289.679133
5072134420 330.548030
5072142538 331.957829
dtype: float64

def v_dist(col_1, col_2):
return 1 - ((df.iloc[:, col_1] * df.iloc[:, col_2]).sum() /
(scalars.iloc[col_1] * scalars.iloc[col_2]))

>>> v_dist(0, 1)
0.09036665882900885

>>> v_dist(0, 2)
0.060016436804916085

>>> v_dist(1, 2)
0.015009898476505357

m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)

>>> m
4716823618 5072134420 5072142538
4716823618 NaN NaN NaN
5072134420 NaN NaN NaN
5072142538 NaN NaN NaN

for row in range(m.shape[0]):
for col in range(row, m.shape[1]): # Note: m.shape[0] equals m.shape[1]
if row == col:
# No need to calculate value for diagonal.
m.iat[row, col] = 0
else:
# Do two calculation in one due to symmetry.
m.iat[row, col] = m.iat[col, row] = v_dist(row, col)

>>> m
4716823618 5072134420 5072142538
4716823618 0.000000 0.090367 0.060016
5072134420 0.090367 0.000000 0.015010
5072142538 0.060016 0.015010 0.000000

Wrapping all of this into a function:

def calc_matrix(raw_data):
df = pd.DataFrame([v for v in raw_data['counters_'].values()],
index=raw_data['counters_'].keys()).T
scalars = ((df ** 2).sum()) ** .5
m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
for row in range(m.shape[0]):
for col in range(row, m.shape[1]):
if row == col:
m.iat[row, col] = 0
else:
m.iat[row, col] = m.iat[col, row] = (1 -
(df.iloc[:, row] * df.iloc[:, col]).sum() /
(scalars.iloc[row] * scalars.iloc[col]))
return m

Vector Distance Calculation in Java - Optimization

Move the & 0xFF's outside the loop.

Do this by calculating an int[]-version of both a and b and rewrite your loop using these.



Related Topics



Leave a reply



Submit