Cosine Similarity Between 2 Number Lists

How to calculate the cosine similarity of two string list by sklearn?

It seems it needs

  • word-vectors,
  • two dimentional data (list with many word-vectors)
print(cosine_similarity( [a_vect], [b_vect] ))

Full working code:

from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity

a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']

# count word occurrences
a_vals = Counter(a_file)
b_vals = Counter(b_file)

# convert to word-vectors
words = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]
b_vect = [b_vals.get(word, 0) for word in words]

# find cosine
len_a = sum(av*av for av in a_vect) ** 0.5
len_b = sum(bv*bv for bv in b_vect) ** 0.5
dot = sum(av*bv for av,bv in zip(a_vect, b_vect))
cosine = dot / (len_a * len_b)

print(cosine)
print(cosine_similarity([a_vect], [b_vect]))

Result:

0.2886751345948129
[[0.28867513]]

EDIT:

You can also use one list with all data (so second argument will be None)

and it will compare all pairs (a,a), (a,b), (b,a), (b,b).

print(cosine_similarity( [a_vect, b_vect] ))

Result:

[[1.         0.28867513]
[0.28867513 1. ]]

You can use longer list [a,b,c, ...] and it will check all possible pairs.


Documentation: cosine_similarity

Cosine Simiarlity scores for each array combination in a list of arrays Python

If I understand correctly, what you are trying to do is to get he cosine distance when using each matrix as an 1Xn dimensional vector. The easiest thing in my opinion will be to vectorially implement the cosine similarity with numpy functions. As a reminder, given two 1D vectors x and y, the cosine similarity is given by:

cosine_similarity = x.dot(y) / (np.linalg.norm(x, 2) * np.linalg.norm(y, 2))

To do this with the three metrices, we will first flatten them into 1D representation and stack them together:

matrices_1d = temp = np.vstack((C.reshape((1, -1)), D.reshape(1, -1), E.reshape(1,-1)))

Now that we have the vector-representation of each matrix, we can compute the L2 norm using np.linalg.norm(read on this functions here) as follows:

norm_vec = np.linalg.norm(matrices_1d , ord=2, axis=1)

And finally, we can compute the cosine distances as follows:

cos_sim = matrices_1d .dot(matrices_1d .T) / np.outer(norm_vec ,norm_vec)
# array([[1. , 0.9126993 , 0.9699609 ],
# [0.9126993 , 1. , 0.93485159],
# [0.9699609 , 0.93485159, 1. ]])

Note that as a sanity check, the diagonal values are 1 since the cosine distance of a vector from itself is 1.

The cosine distance if defined to be 1-cos_sim and is easy to computeonce you have the similarity.

What is the fastest way of calculate cosine similarity between rows of two same shape matrices

Possible in one single line: the trick is to just specify the axis over which perform the norm and the dot product.

X = np.random.randn(3,2)
Y = np.random.randn(3,2)
(X * Y).sum(axis=1) / np.linalg.norm(X, axis=1) / np.linalg.norm(Y, axis=1)

The first part, (X * Y).sum(axis=1) takes care of computing the dot product. axis=1 specify that we perform the dot product over the columns, i.e. get a result for each row (the datapoints).

The second part simply computes the norm of each vector, with the same method.

computing cosine similarity in vectorized operation

use cosine_similarity from sklearn

from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(np.random.randn(20).reshape(5,4), columns=["ref_x", "ref_y", "alt_x", "alt_y"])
co_sim = cosine_similarity(df.to_numpy())
pd.DataFrame(co_sim)

output:

    0   1   2   3   4
0 1.000000 0.085483 -0.126060 -0.137558 -0.411323
1 0.085483 1.000000 -0.447271 -0.277837 0.440389
2 -0.126060 -0.447271 1.000000 0.309562 -0.306372
3 -0.137558 -0.277837 0.309562 1.000000 -0.811515
4 -0.411323 0.440389 -0.306372 -0.811515 1.000000


Related Topics



Leave a reply



Submit