Most Efficient Way to Construct Similarity Matrix

Most efficient way to construct similarity matrix

There are two useful function within scipy.spatial.distance that you can use for this: pdist and squareform. Using pdist will give you the pairwise distance between observations as a one-dimensional array, and squareform will convert this to a distance matrix.

One catch is that pdist uses distance measures by default, and not similarity, so you'll need to manually specify your similarity function. Judging by the commented output in your code, your DataFrame is also not in the orientation pdist expects, so I've undone the transpose you did in your code.

import pandas as pd
from scipy.spatial.distance import euclidean, pdist, squareform


def similarity_func(u, v):
return 1/(1+euclidean(u,v))

DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]})
DF_var.index = ["g1","g2","g3"]

dists = pdist(DF_var, similarity_func)
DF_euclid = pd.DataFrame(squareform(dists), columns=DF_var.index, index=DF_var.index)

create a fast custom similarity matrix in python

One way is to use scipy.spatial. That is already a lot more efficient than what you have rolled yourself. In particular, you could do the following, using pdist and a custom metric function:

import numpy as np
from scipy.spatial.distance import pdist, squareform


def sim_mat(df, weights):
mat = squareform(pdist(df.values, metric=lambda x, y: (x == y) @ weights))
np.fill_diagonal(mat, sum(weights))

return mat

Comparing this approach to your original method on datasets of increasing size, I obtain the following results:

comparison

Efficient method to create NxN similarity/distance matrix in python

Using numpyp.ma.MaskedArray, and give full play to the broadcast at the same time, you can get very good performance.

First get the values of df:

import numpy as np
from numpy import nan


ratings = np.array([[1., 1., nan, 4., 5.],
[nan, 1., 1., 5., 5.],
[4., nan, 4., 1., 2.]])
# ratings = df_ratings.values

Convert to MaskedArray:

from numpy.ma import masked_invalid


ratings = masked_invalid(ratings)
# masked_array(
# data=[[1.0, 1.0, --, 4.0, 5.0],
# [--, 1.0, 1.0, 5.0, 5.0],
# [4.0, --, 4.0, 1.0, 2.0]],
# mask=[[False, False, True, False, False],
# [ True, False, False, False, False],
# [False, True, False, False, False]],
# fill_value=1e+20)

Calculate the negative of agrement of all ratings between each pair of users:

temp = ratings - 2.5
not_agreements = temp[:, None] * temp[None] < 0
# Equivalent to
# from numpy.ma import masked_array
# not_argeements = masked_array([masked_array([(i - 2.5) * (j - 2.5) < 0 for j in ratings]) for i in ratings])

Similarly, calculate all proximity, impact and popularity, here I assume that rating_max, rating_min and rating_median are all scalars:

dist = np.abs(ratings[:, None] - ratings[None])
dist[not_agreements] *= 2
prox = ((2 * (rating_max - rating_min) + 1) - dist) ** 2

temp = np.abs(ratings - rating_median) + 1
impact_score = temp[:, None] * temp[None]
impact_score[not_agreements] = 1 / impact_score[not_agreements]

mu_k = ratings.mean(0)
temp = ratings - mu_k
shape = ratings.shape
pop = np.ones(shape[:1] + shape)
mask = temp[:, None] * temp[None] > 0
pop[mask] += ((temp[:, None] + temp[None]) / 2)[mask] ** 2

Multiply them and sum them along the last axis, then set the value on the diagonal to 1, and finally get the result you want:

similarity_matrix = (prox * impact_score * pop).sum(-1)
similarity_matrix[np.diag_indices_from(similarity_matrix)] = 1

similarity_matrix_df = pd.DataFrame(similarity_matrix, index=df_ratings.index, columns=df_ratings.index)

After testing, the running time of your traversal method is similar to that of my method in your example, but with the expansion of the array, the running time of your method increases very fast. When the shape of the array reaches (48, 50), it takes up to 10s, while my vectorization method takes only 0.06s.

Efficiently calculate large similarity matrix

Here are some bits and pieces of an answer, there are still too many gaps in what you've told us to permit a good answer, but you can fill those in yourself. From everything you've told us I don't think that the major part of your task is to efficiently calculate a large similarity matrix, I think that the major parts are to efficiently retrieve values from such a matrix and to efficiently update the matrix.

As we've already determined the matrix is sparse and symmetric; it would be useful to know how sparse. This reduces the storage requirements considerably, but we don't know by how much.

You've told us a bit about updates to user profiles but does your similarity matrix have to be updated as frequently ? My expectation (another assumption) is that similarity measures do not change quickly or sharply when a user modifies his/her profile. From this I hypothesise that working with a similarity measure which is a few minutes (even a few hours) out of date won't do any serious harm.

I think that all this takes us into the domain of databases, which should support fast access to stored similarity measures of the volumes you indicate. I'd be looking to do batch updates of the measures, and only of the measures for users whose profiles have changed, at an interval to suit your demands and availability of computer power.

As for the initial creation of the first version of the similarity matrix, so what if it takes a week in the background, you're only going to do it once.

Efficient computation of similarity matrix in Python (NumPy)

I'm not sure that you can due this using only numpy. I would use the method cdist from the scipy library, something like this:

import numpy as np 
from scipy.spatial.distance import cdist
B=5
X=np.random.rand(B*B).reshape((B,B))
dist = cdist(X, X, metric='euclidean')
K = np.exp(dist)

dist
array([[ 0. , 1.2659804 , 0.98231231, 0.80089176, 1.19326493],
[ 1.2659804 , 0. , 0.72658078, 0.80618767, 0.3776364 ],
[ 0.98231231, 0.72658078, 0. , 0.70205336, 0.81352455],
[ 0.80089176, 0.80618767, 0.70205336, 0. , 0.60025858],
[ 1.19326493, 0.3776364 , 0.81352455, 0.60025858, 0. ]])
K
array([[ 1. , 3.5465681 , 2.67062441, 2.22752646, 3.29783084],
[ 3.5465681 , 1. , 2.06799756, 2.23935453, 1.45883242],
[ 2.67062441, 2.06799756, 1. , 2.01789192, 2.25584482],
[ 2.22752646, 2.23935453, 2.01789192, 1. , 1.82259002],
[ 3.29783084, 1.45883242, 2.25584482, 1.82259002, 1. ]])

Hoping this can help you. Good work

EDIT
You can also use only numpy array, for a theano implementaion:

dist = (X ** 2).sum(1).reshape((X.shape[0], 1)) + (X ** 2).sum(1).reshape((1, X.shape[0])) - 2 * X.dot(X.T)

It should be work!



Related Topics



Leave a reply



Submit