What's the fastest way in Python to calculate cosine similarity given sparse matrix data?
You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
A = np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)
similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))
#also can output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)
print('pairwise sparse output:\n {}\n'.format(similarities_sparse))
Results:
pairwise dense output:
[[ 1. 0.40824829 0.40824829]
[ 0.40824829 1. 0.33333333]
[ 0.40824829 0.33333333 1. ]]
pairwise sparse output:
(0, 1) 0.408248290464
(0, 2) 0.408248290464
(0, 0) 1.0
(1, 0) 0.408248290464
(1, 2) 0.333333333333
(1, 1) 1.0
(2, 1) 0.333333333333
(2, 0) 0.408248290464
(2, 2) 1.0
If you want column-wise cosine similarities simply transpose your input matrix beforehand:
A_sparse.transpose()
How to find cosine similarity of one vector vs matrix
Leverage 2D
vectorized matrix-multiplication
Here's one with NumPy using matrix-multiplication on 2D data -
p1 = mat[:,-1].dot(mat[:,:-1])
p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])
out1 = p1/p2
Explanation : p1
is the vectorized equivalent of looping of dot(mat[:,i], mat[:,-1])
. p2
is of (norm(mat[:,i])*norm(mat[:,-1]))
.
Sample run for verification -
In [57]: np.random.seed(0)
...: mat = np.random.rand(149,1001)
In [58]: out = np.empty(mat.shape[1]-1)
...: for i in range(mat.shape[1]-1):
...: out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))
In [59]: p1 = mat[:,-1].dot(mat[:,:-1])
...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])
...: out1 = p1/p2
In [60]: np.allclose(out, out1)
Out[60]: True
Timings -
In [61]: %%timeit
...: out = np.empty(mat.shape[1]-1)
...: for i in range(mat.shape[1]-1):
...: out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))
18.5 ms ± 977 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [62]: %%timeit
...: p1 = mat[:,-1].dot(mat[:,:-1])
...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])
...: out1 = p1/p2
939 µs ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# @yatu's soln
In [89]: a = mat
In [90]: %timeit cosine_similarity(a[None,:,-1] , a.T[:-1])
2.47 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Further optimize on norm
with einsum
Alternatively, we could compute p2
with np.einsum
.
So, norm(mat[:,:-1],axis=0)
could be replaced by :
np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))
Hence, giving us a modified p2
:
p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])
Timings on same setup as earlier -
In [82]: %%timeit
...: p1 = mat[:,-1].dot(mat[:,:-1])
...: p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])
...: out1 = p1/p2
607 µs ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
30x+
speedup over loopy one!
Related Topics
Naturally Sorting Pandas Dataframe
How to Use Qscrollarea to Make Scrollbars Appear
How to Find Numeric Columns in Pandas
How to Apply Piecewise Linear Fit in Python
Inserting Line at Specified Position of a Text File
How to Login to a Website with Python
When Would the -E, --Editable Option Be Useful with Pip Install
Python Typeerror: Not Enough Arguments for Format String
How to Do/Workaround a Conditional Join in Python Pandas
How to Get an Event Callback When a Tkinter Entry Widget Is Modified
How to Move a Tick Label in Matplotlib
Does Python Have a Bitfield Type
Beginner Python: Reading and Writing to the Same File
Repeating Each Element of a Numpy Array 5 Times
Matching Nested Structures with Regular Expressions in Python