Calculate cosine similarity given 2 sentence strings
A simple pure-Python implementation would be:
import math
import re
from collections import Counter
WORD = re.compile(r"\w+")
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
text1 = "This is a foo bar sentence ."
text2 = "This sentence is similar to a foo bar sentence ."
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
cosine = get_cosine(vector1, vector2)
print("Cosine:", cosine)
Prints:
Cosine: 0.861640436855
The cosine formula used here is described here.
This does not include weighting of the words by tf-idf, but in order to use tf-idf, you need to have a reasonably large corpus from which to estimate tfidf weights.
You can also develop it further, by using a more sophisticated way to extract words from a piece of text, stem or lemmatise it, etc.
Computing cosine similarity between specific strings using dplyr
My solution gathers all combinations then calculates similarity. Try this:
library(tidyverse)
library(stringdist)
df1 <- tibble(
abel = rnorm(10),
abby= rnorm(10),
bret= rnorm(10))
df2<- tibble(
barista = rnorm(10),
beekeeper=rnorm(10),
economist = rnorm(10),
lawyer = rnorm(10),
ranger = rnorm(10),
trader = rnorm(10))
df_2 <- colnames(df2)
df_1 <- as_tibble_col(colnames(df1), "col1") %>%
mutate(col2 = map(col1, function(x) mutate(as_tibble_col(unique(unlist(str_split(x, ""))[-1]), "first_letters"),
match = map(first_letters, ~df_2[which(str_detect(df_2, paste0("^", .)))])))) %>%
unnest(everything()) %>%
unnest(everything()) %>%
mutate(cos_sim = stringsim(col1, match, "cosine"))
# # A tibble: 9 × 4
# col1 first_letters match cos_sim
# <chr> <chr> <chr> <dbl>
# 1 abel b barista 0.5
# 2 abel b beekeeper 0.557
# 3 abel e economist 0.151
# 4 abel l lawyer 0.612
# 5 abby b barista 0.544
# 6 abby b beekeeper 0.152
# 7 bret r ranger 0.530
# 8 bret e economist 0.302
# 9 bret t trader 0.707
How can I calculate Cosine similarity between two strings vectors
using the lsa
package and the manual for this package
# create some files
library('lsa')
td = tempfile()
dir.create(td)
write( c("HDa","2Pb","2","BxU","BuQ","Bve"), file=paste(td, "D1", sep="/"))
write( c("HCK","2Pb","2","09","F","G"), file=paste(td, "D2", sep="/"))
# read files into a document-term matrix
myMatrix = textmatrix(td, minWordLength=1)
EDIT: show how is the mymatrix
object
myMatrix
#myMatrix
# docs
# terms D1 D2
# 2 1 1
# 2pb 1 1
# buq 1 0
# bve 1 0
# bxu 1 0
# hda 1 0
# 09 0 1
# f 0 1
# g 0 1
# hck 0 1
# Calculate cosine similarity
res <- lsa::cosine(myMatrix[,1], myMatrix[,2])
res
#0.3333
To find cosine similarity between two string(names)
As mentioned in the other answer, the cosine similarity is one because the two strings have the exact same representation.
That means that this code:
tfidf_vectorizer=TfidfVectorizer()
tfidf_matrix=tfidf_vectorizer.fit_transform(documents)
produces, well:
print(tfidf_matrix.toarray())
[[ 1.]
[ 1.]]
This means that the two strings/documents (here the rows in the array) have the same representation.
That is because the TfidfVectorizer
tokenizes your document using word tokens, and keeps only words with at least 2 characters.
So you could do one of the following:
Use:
tfidf_vectorizer=TfidfVectorizer(analyzer="char")
to get character n-grams instead of word n-grams.
Change the token pattern so that it keeps one-letter tokens:
tfidf_vectorizer=TfidfVectorizer(token_pattern=u'(?u)\\b\w+\\b')
This is just a simple modification from the default pattern you can see in the documentation. Note that I had to escape the
\b
occurrences in the regular expression as I was getting an 'empty vocabulary' error.
Hope this helps.
How to calculate the cosine similarity of two string list by sklearn?
It seems it needs
- word-vectors,
- two dimentional data (list with many word-vectors)
print(cosine_similarity( [a_vect], [b_vect] ))
Full working code:
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']
# count word occurrences
a_vals = Counter(a_file)
b_vals = Counter(b_file)
# convert to word-vectors
words = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]
b_vect = [b_vals.get(word, 0) for word in words]
# find cosine
len_a = sum(av*av for av in a_vect) ** 0.5
len_b = sum(bv*bv for bv in b_vect) ** 0.5
dot = sum(av*bv for av,bv in zip(a_vect, b_vect))
cosine = dot / (len_a * len_b)
print(cosine)
print(cosine_similarity([a_vect], [b_vect]))
Result:
0.2886751345948129
[[0.28867513]]
EDIT:
You can also use one list with all data (so second argument will be None
)
and it will compare all pairs (a,a)
, (a,b)
, (b,a)
, (b,b)
.
print(cosine_similarity( [a_vect, b_vect] ))
Result:
[[1. 0.28867513]
[0.28867513 1. ]]
You can use longer list [a,b,c, ...]
and it will check all possible pairs.
Documentation: cosine_similarity
Related Topics
How to Get a String After a Specific Substring
Why am I Seeing "Typeerror: String Indices Must Be Integers"
How to Remove Relative Shift in Matplotlib Axis
Python Selenium Wait for Several Elements to Load
Why Does List.Append() Return None
Why Do My Tkinter Widgets Get Stored as None
Why Doesn't Pygame Draw in the Window Before the Delay or Sleep
Hash Function in Python 3.3 Returns Different Results Between Sessions
Embedding a Pygame Window into a Tkinter or Wxpython Frame
Store Different Datatypes in One Numpy Array
Taking Multiple Integers on the Same Line as Input from the User in Python
How to Create a List of Lambdas (In a List Comprehension/For Loop)
How to Get a Directory Listing Sorted by Creation Date in Python
Scope of Lambda Functions and Their Parameters