How to calculate the cosine similarity of two string list by sklearn?
It seems it needs
- word-vectors,
- two dimentional data (list with many word-vectors)
print(cosine_similarity( [a_vect], [b_vect] ))
Full working code:
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']
# count word occurrences
a_vals = Counter(a_file)
b_vals = Counter(b_file)
# convert to word-vectors
words = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]
b_vect = [b_vals.get(word, 0) for word in words]
# find cosine
len_a = sum(av*av for av in a_vect) ** 0.5
len_b = sum(bv*bv for bv in b_vect) ** 0.5
dot = sum(av*bv for av,bv in zip(a_vect, b_vect))
cosine = dot / (len_a * len_b)
print(cosine)
print(cosine_similarity([a_vect], [b_vect]))
Result:
0.2886751345948129
[[0.28867513]]
EDIT:
You can also use one list with all data (so second argument will be None
)
and it will compare all pairs (a,a)
, (a,b)
, (b,a)
, (b,b)
.
print(cosine_similarity( [a_vect, b_vect] ))
Result:
[[1. 0.28867513]
[0.28867513 1. ]]
You can use longer list [a,b,c, ...]
and it will check all possible pairs.
Documentation: cosine_similarity
Cosine Simiarlity scores for each array combination in a list of arrays Python
If I understand correctly, what you are trying to do is to get he cosine distance when using each matrix as an 1Xn
dimensional vector. The easiest thing in my opinion will be to vectorially implement the cosine similarity with numpy functions. As a reminder, given two 1D vectors x
and y
, the cosine similarity is given by:
cosine_similarity = x.dot(y) / (np.linalg.norm(x, 2) * np.linalg.norm(y, 2))
To do this with the three metrices, we will first flatten them into 1D representation and stack them together:
matrices_1d = temp = np.vstack((C.reshape((1, -1)), D.reshape(1, -1), E.reshape(1,-1)))
Now that we have the vector-representation of each matrix, we can compute the L2 norm using np.linalg.norm
(read on this functions here) as follows:
norm_vec = np.linalg.norm(matrices_1d , ord=2, axis=1)
And finally, we can compute the cosine distances as follows:
cos_sim = matrices_1d .dot(matrices_1d .T) / np.outer(norm_vec ,norm_vec)
# array([[1. , 0.9126993 , 0.9699609 ],
# [0.9126993 , 1. , 0.93485159],
# [0.9699609 , 0.93485159, 1. ]])
Note that as a sanity check, the diagonal values are 1 since the cosine distance of a vector from itself is 1.
The cosine distance if defined to be 1-cos_sim
and is easy to computeonce you have the similarity.
What is the fastest way of calculate cosine similarity between rows of two same shape matrices
Possible in one single line: the trick is to just specify the axis over which perform the norm and the dot product.
X = np.random.randn(3,2)
Y = np.random.randn(3,2)
(X * Y).sum(axis=1) / np.linalg.norm(X, axis=1) / np.linalg.norm(Y, axis=1)
The first part, (X * Y).sum(axis=1)
takes care of computing the dot product. axis=1
specify that we perform the dot product over the columns, i.e. get a result for each row (the datapoints).
The second part simply computes the norm of each vector, with the same method.
computing cosine similarity in vectorized operation
use cosine_similarity from sklearn
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(np.random.randn(20).reshape(5,4), columns=["ref_x", "ref_y", "alt_x", "alt_y"])
co_sim = cosine_similarity(df.to_numpy())
pd.DataFrame(co_sim)
output:
0 1 2 3 4
0 1.000000 0.085483 -0.126060 -0.137558 -0.411323
1 0.085483 1.000000 -0.447271 -0.277837 0.440389
2 -0.126060 -0.447271 1.000000 0.309562 -0.306372
3 -0.137558 -0.277837 0.309562 1.000000 -0.811515
4 -0.411323 0.440389 -0.306372 -0.811515 1.000000
Related Topics
Concatenating Two One-Dimensional Numpy Arrays
Convert Column to Date Format (Pandas Dataframe)
Force Python to Forego Native SQLite3 and Use the (Installed) Latest SQLite3 Version
Building Python with Ssl Support in Non-Standard Location
Is It Ok to Use Dashes in Python Files When Trying to Import Them
Regex to Extract Urls from Href Attribute in HTML with Python
Matplotlib: Specify Format of Floats for Tick Labels
Using Beautifulsoup to Extract Text Without Tags
Calling the "Source" Command from Subprocess.Popen
Purpose of Calling Function Without Brackets Python
(Z3Py) Checking All Solutions for Equation
Variable Assignment and Modification (In Python)
Why Does Python Assignment Not Return a Value
Database Does Not Update Automatically with MySQL and Python