Distance Calculation on Large Vectors [Performance]

Most efficient way to calculate every L2 distance between vectors of vector array A and vectors of vector array B?

sklearn.metrics.pairwise_distances solves exactly this problem.

Efficient euclidean distance calculation in python for millions of rows

Did you look at scipy.spatial.cKDTree ?

You can construct this data structure for one of your data set, and query it to get the distance for each point in the second data set.

KDTree = scipy.spatial.cKDTree(df1)
distances, indexes = KDTree.query(df2, n_jobs=-1)

I set here n_jobs=-1 to use all available processors.

(Speed Challenge) Any faster method to calculate distance matrix between rows of two matrices, in terms of Euclidean distance?

method_XXX <- function() {
  sqrt(outer(rowSums(x^2), rowSums(y^2), '+') - tcrossprod(x, 2 * y))
}

Unit: relative
                       expr       min        lq     mean    median        uq      max
 method_ThomasIsCoding_v1() 12.151624 10.486417 9.213107 10.162740 10.235274 5.278517
 method_ThomasIsCoding_v2()  6.923647  6.055417 5.549395  6.161603  6.140484 3.438976
 method_ThomasIsCoding_v3()  7.133525  6.218283 5.709549  6.438797  6.382204 3.383227
      method_AllanCameron()  7.093680  6.071482 5.776172  6.447973  6.497385 3.608604
               method_XXX()  1.000000  1.000000 1.000000  1.000000  1.000000 1.000000

Calculating all distances between one point and a group of points efficiently in R

Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K.

# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)

system.time(
  dists <- apply(centers, 2, function(center) {
    colSums((x - center)^2)
})
)

Runs in:

utilisateur     système      écoulé 
      0.100       0.008       0.108

on my laptop.

Pandas distance matrix performance with vector data

This is certainly more efficient and easier to read than using for loops.

df = pd.DataFrame([v for v in raw_data['counters_'].values()], 
                  index=raw_data['counters_'].keys()).T

>>> df.head()
       4716823618  5072134420  5072142538
51811           1           1           4
51812         NaN           1           4
51820           1           1           4
51833          56          56          56
51835           8           9           8

# raw_data no longer needed.  Delete to reduce memory footprint.
del raw_data  

# Create scalars.
scalars = ((df ** 2).sum()) ** .5

>>> scalars
4716823618    289.679133
5072134420    330.548030
5072142538    331.957829
dtype: float64

def v_dist(col_1, col_2):
    return 1 - ((df.iloc[:, col_1] * df.iloc[:, col_2]).sum() / 
                (scalars.iloc[col_1] * scalars.iloc[col_2]))

>>> v_dist(0, 1)
0.09036665882900885

>>> v_dist(0, 2)
0.060016436804916085

>>> v_dist(1, 2)
0.015009898476505357

m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)

>>> m
            4716823618  5072134420  5072142538
4716823618         NaN         NaN         NaN
5072134420         NaN         NaN         NaN
5072142538         NaN         NaN         NaN

for row in range(m.shape[0]):
    for col in range(row, m.shape[1]):  # Note: m.shape[0] equals m.shape[1]
        if row == col:
            # No need to calculate value for diagonal.
            m.iat[row, col] = 0
        else:
            # Do two calculation in one due to symmetry.
            m.iat[row, col] = m.iat[col, row] = v_dist(row, col)

>>> m
            4716823618  5072134420  5072142538
4716823618    0.000000    0.090367    0.060016
5072134420    0.090367    0.000000    0.015010
5072142538    0.060016    0.015010    0.000000

Wrapping all of this into a function:

def calc_matrix(raw_data):
    df = pd.DataFrame([v for v in raw_data['counters_'].values()], 
                      index=raw_data['counters_'].keys()).T
    scalars = ((df ** 2).sum()) ** .5
    m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
    for row in range(m.shape[0]):
        for col in range(row, m.shape[1]):
            if row == col:
                m.iat[row, col] = 0
            else:
                m.iat[row, col] = m.iat[col, row] =  (1 -                    
                    (df.iloc[:, row] * df.iloc[:, col]).sum() / 
                    (scalars.iloc[row] * scalars.iloc[col]))
    return m

Vector Distance Calculation in Java - Optimization

Move the & 0xFF's outside the loop.

Do this by calculating an int[]-version of both a and b and rewrite your loop using these.