What's the Fastest Way in Python to Calculate Cosine Similarity Given Sparse Matrix Data

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:

from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

A = np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)

similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))

#also can output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)
print('pairwise sparse output:\n {}\n'.format(similarities_sparse))

Results:

pairwise dense output:
[[ 1. 0.40824829 0.40824829]
[ 0.40824829 1. 0.33333333]
[ 0.40824829 0.33333333 1. ]]

pairwise sparse output:
(0, 1) 0.408248290464
(0, 2) 0.408248290464
(0, 0) 1.0
(1, 0) 0.408248290464
(1, 2) 0.333333333333
(1, 1) 1.0
(2, 1) 0.333333333333
(2, 0) 0.408248290464
(2, 2) 1.0

If you want column-wise cosine similarities simply transpose your input matrix beforehand:

A_sparse.transpose()

How to find cosine similarity of one vector vs matrix

Leverage 2D vectorized matrix-multiplication

Here's one with NumPy using matrix-multiplication on 2D data -

p1 = mat[:,-1].dot(mat[:,:-1])
p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])
out1 = p1/p2

Explanation : p1 is the vectorized equivalent of looping of dot(mat[:,i], mat[:,-1]). p2 is of (norm(mat[:,i])*norm(mat[:,-1])).

Sample run for verification -

In [57]: np.random.seed(0)
...: mat = np.random.rand(149,1001)

In [58]: out = np.empty(mat.shape[1]-1)
...: for i in range(mat.shape[1]-1):
...: out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))

In [59]: p1 = mat[:,-1].dot(mat[:,:-1])
...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])
...: out1 = p1/p2

In [60]: np.allclose(out, out1)
Out[60]: True

Timings -

In [61]: %%timeit
...: out = np.empty(mat.shape[1]-1)
...: for i in range(mat.shape[1]-1):
...: out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))
18.5 ms ± 977 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [62]: %%timeit
...: p1 = mat[:,-1].dot(mat[:,:-1])
...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])
...: out1 = p1/p2
939 µs ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# @yatu's soln
In [89]: a = mat

In [90]: %timeit cosine_similarity(a[None,:,-1] , a.T[:-1])
2.47 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Further optimize on norm with einsum

Alternatively, we could compute p2 with np.einsum.

So, norm(mat[:,:-1],axis=0) could be replaced by :

np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))

Hence, giving us a modified p2 :

p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])

Timings on same setup as earlier -

In [82]: %%timeit
...: p1 = mat[:,-1].dot(mat[:,:-1])
...: p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])
...: out1 = p1/p2
607 µs ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

30x+ speedup over loopy one!



Related Topics



Leave a reply



Submit