Most efficient way to calculate every L2 distance between vectors of vector array A and vectors of vector array B?
sklearn.metrics.pairwise_distances
solves exactly this problem.
Efficient euclidean distance calculation in python for millions of rows
Did you look at scipy.spatial.cKDTree
?
You can construct this data structure for one of your data set, and query it to get the distance for each point in the second data set.
KDTree = scipy.spatial.cKDTree(df1)
distances, indexes = KDTree.query(df2, n_jobs=-1)
I set here n_jobs=-1
to use all available processors.
(Speed Challenge) Any faster method to calculate distance matrix between rows of two matrices, in terms of Euclidean distance?
method_XXX <- function() {
sqrt(outer(rowSums(x^2), rowSums(y^2), '+') - tcrossprod(x, 2 * y))
}
Unit: relative
expr min lq mean median uq max
method_ThomasIsCoding_v1() 12.151624 10.486417 9.213107 10.162740 10.235274 5.278517
method_ThomasIsCoding_v2() 6.923647 6.055417 5.549395 6.161603 6.140484 3.438976
method_ThomasIsCoding_v3() 7.133525 6.218283 5.709549 6.438797 6.382204 3.383227
method_AllanCameron() 7.093680 6.071482 5.776172 6.447973 6.497385 3.608604
method_XXX() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
Calculating all distances between one point and a group of points efficiently in R
Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K
.
# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)
system.time(
dists <- apply(centers, 2, function(center) {
colSums((x - center)^2)
})
)
Runs in:
utilisateur système écoulé
0.100 0.008 0.108
on my laptop.
Pandas distance matrix performance with vector data
This is certainly more efficient and easier to read than using for
loops.
df = pd.DataFrame([v for v in raw_data['counters_'].values()],
index=raw_data['counters_'].keys()).T
>>> df.head()
4716823618 5072134420 5072142538
51811 1 1 4
51812 NaN 1 4
51820 1 1 4
51833 56 56 56
51835 8 9 8
# raw_data no longer needed. Delete to reduce memory footprint.
del raw_data
# Create scalars.
scalars = ((df ** 2).sum()) ** .5
>>> scalars
4716823618 289.679133
5072134420 330.548030
5072142538 331.957829
dtype: float64
def v_dist(col_1, col_2):
return 1 - ((df.iloc[:, col_1] * df.iloc[:, col_2]).sum() /
(scalars.iloc[col_1] * scalars.iloc[col_2]))
>>> v_dist(0, 1)
0.09036665882900885
>>> v_dist(0, 2)
0.060016436804916085
>>> v_dist(1, 2)
0.015009898476505357
m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
>>> m
4716823618 5072134420 5072142538
4716823618 NaN NaN NaN
5072134420 NaN NaN NaN
5072142538 NaN NaN NaN
for row in range(m.shape[0]):
for col in range(row, m.shape[1]): # Note: m.shape[0] equals m.shape[1]
if row == col:
# No need to calculate value for diagonal.
m.iat[row, col] = 0
else:
# Do two calculation in one due to symmetry.
m.iat[row, col] = m.iat[col, row] = v_dist(row, col)
>>> m
4716823618 5072134420 5072142538
4716823618 0.000000 0.090367 0.060016
5072134420 0.090367 0.000000 0.015010
5072142538 0.060016 0.015010 0.000000
Wrapping all of this into a function:
def calc_matrix(raw_data):
df = pd.DataFrame([v for v in raw_data['counters_'].values()],
index=raw_data['counters_'].keys()).T
scalars = ((df ** 2).sum()) ** .5
m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
for row in range(m.shape[0]):
for col in range(row, m.shape[1]):
if row == col:
m.iat[row, col] = 0
else:
m.iat[row, col] = m.iat[col, row] = (1 -
(df.iloc[:, row] * df.iloc[:, col]).sum() /
(scalars.iloc[row] * scalars.iloc[col]))
return m
Vector Distance Calculation in Java - Optimization
Move the & 0xFF
's outside the loop.
Do this by calculating an int[]
-version of both a
and b
and rewrite your loop using these.
Related Topics
Efficient Way to Fill Time-Series Per Group
R Packages Fail to Compile with Gcc
How to Add Rows with 0 Counts to Summarised Output
Ggplot2: Adding Lines in a Loop and Retaining Colour Mappings
How to Pass Multiple Group_By Arguments and a Dynamic Variable Argument to a Dplyr Function
R Shiny - Checkboxes and Action Button Combination Issue
How to Create a Dropdown List in a Shiny Table Using Datatable When Editing the Table
Why Isn't the R Function Sink() Writing a Summary Output to My Results File
How to Sort a Vector of Alphanumeric Values Using Lexical Ordering in R
How to Display Line Numbers for Code Chunks in Rmarkdown HTML and PDF
How to Pass R Variable into SQLdf
Create a Concentric Circle Legend for a Ggplot Bubble Chart
Getting the Minimum of the Rows in a Data Frame
Place Text Values to Right of Sankey Diagram