Calculate Cosine Similarity Given 2 Sentence Strings

Calculate cosine similarity given 2 sentence strings

A simple pure-Python implementation would be:

import math
import re
from collections import Counter

WORD = re.compile(r"\w+")

def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)

text1 = "This is a foo bar sentence ."
text2 = "This sentence is similar to a foo bar sentence ."

vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)

cosine = get_cosine(vector1, vector2)

print("Cosine:", cosine)

Prints:

Cosine: 0.861640436855

The cosine formula used here is described here.

This does not include weighting of the words by tf-idf, but in order to use tf-idf, you need to have a reasonably large corpus from which to estimate tfidf weights.

You can also develop it further, by using a more sophisticated way to extract words from a piece of text, stem or lemmatise it, etc.

Computing cosine similarity between specific strings using dplyr

My solution gathers all combinations then calculates similarity. Try this:

library(tidyverse)
library(stringdist)

df1 <- tibble(
abel = rnorm(10),
abby= rnorm(10),
bret= rnorm(10))

df2<- tibble(
barista = rnorm(10),
beekeeper=rnorm(10),
economist = rnorm(10),
lawyer = rnorm(10),
ranger = rnorm(10),
trader = rnorm(10))

df_2 <- colnames(df2)
df_1 <- as_tibble_col(colnames(df1), "col1") %>%
  mutate(col2 = map(col1, function(x) mutate(as_tibble_col(unique(unlist(str_split(x, ""))[-1]), "first_letters"),
                                             match = map(first_letters, ~df_2[which(str_detect(df_2, paste0("^", .)))])))) %>%
  unnest(everything()) %>%
  unnest(everything()) %>%
  mutate(cos_sim = stringsim(col1, match, "cosine"))

# # A tibble: 9 × 4
#   col1  first_letters match     cos_sim
#   <chr> <chr>         <chr>       <dbl>
# 1 abel  b             barista     0.5  
# 2 abel  b             beekeeper   0.557
# 3 abel  e             economist   0.151
# 4 abel  l             lawyer      0.612
# 5 abby  b             barista     0.544
# 6 abby  b             beekeeper   0.152
# 7 bret  r             ranger      0.530
# 8 bret  e             economist   0.302
# 9 bret  t             trader      0.707

How can I calculate Cosine similarity between two strings vectors

using the lsa package and the manual for this package

# create some files
library('lsa')
td = tempfile()
dir.create(td)
write( c("HDa","2Pb","2","BxU","BuQ","Bve"), file=paste(td, "D1", sep="/"))
write( c("HCK","2Pb","2","09","F","G"), file=paste(td, "D2", sep="/"))

# read files into a document-term matrix
myMatrix = textmatrix(td, minWordLength=1)

EDIT: show how is the mymatrix object

myMatrix
#myMatrix
#       docs
#  terms D1 D2
#    2    1  1
#    2pb  1  1
#    buq  1  0
#    bve  1  0
#    bxu  1  0
#    hda  1  0
#    09   0  1
#    f    0  1
#    g    0  1
#    hck  0  1

# Calculate cosine similarity
res <- lsa::cosine(myMatrix[,1], myMatrix[,2])
res
#0.3333

To find cosine similarity between two string(names)

As mentioned in the other answer, the cosine similarity is one because the two strings have the exact same representation.

That means that this code:

tfidf_vectorizer=TfidfVectorizer()
tfidf_matrix=tfidf_vectorizer.fit_transform(documents)

produces, well:

print(tfidf_matrix.toarray())
[[ 1.]
 [ 1.]]

This means that the two strings/documents (here the rows in the array) have the same representation.

That is because the TfidfVectorizer tokenizes your document using word tokens, and keeps only words with at least 2 characters.

So you could do one of the following:

Use:

tfidf_vectorizer=TfidfVectorizer(analyzer="char")

to get character n-grams instead of word n-grams.

Change the token pattern so that it keeps one-letter tokens:
```
tfidf_vectorizer=TfidfVectorizer(token_pattern=u'(?u)\\b\w+\\b')
```
This is just a simple modification from the default pattern you can see in the documentation. Note that I had to escape the \b occurrences in the regular expression as I was getting an 'empty vocabulary' error.

Hope this helps.

How to calculate the cosine similarity of two string list by sklearn?

It seems it needs

word-vectors,
two dimentional data (list with many word-vectors)

print(cosine_similarity( [a_vect], [b_vect] ))

Full working code:

from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity

a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']

# count word occurrences
a_vals = Counter(a_file)
b_vals = Counter(b_file)

# convert to word-vectors
words  = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]       
b_vect = [b_vals.get(word, 0) for word in words]        

# find cosine
len_a  = sum(av*av for av in a_vect) ** 0.5             
len_b  = sum(bv*bv for bv in b_vect) ** 0.5             
dot    = sum(av*bv for av,bv in zip(a_vect, b_vect))   
cosine = dot / (len_a * len_b) 

print(cosine)
print(cosine_similarity([a_vect], [b_vect]))

Result:

0.2886751345948129
[[0.28867513]]

EDIT:

You can also use one list with all data (so second argument will be None)

and it will compare all pairs (a,a), (a,b), (b,a), (b,b).

print(cosine_similarity( [a_vect, b_vect] ))

Result:

[[1.         0.28867513]
 [0.28867513 1.        ]]

You can use longer list [a,b,c, ...] and it will check all possible pairs.

Documentation: cosine_similarity

Calculate Cosine Similarity Given 2 Sentence Strings