Python: tf-idf-cosine: to find document similarity
WIth the Help of @excray's comment, I manage to figure it out the answer, What we need to do is actually write a simple for loop to iterate over the two arrays that represent the train data and test data.
First implement a simple lambda function to hold formula for the cosine calculation:
cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)
And then just write a simple for loop to iterate over the to vector, logic is for every "For each vector in trainVectorizerArray, you have to find the cosine similarity with the vector in testVectorizerArray."
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA
train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query
stopWords = stopwords.words('english')
vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray
cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)
for vector in trainVectorizerArray:
print vector
for testV in testVectorizerArray:
print testV
cosine = cx(vector, testV)
print cosine
transformer.fit(trainVectorizerArray)
print
print transformer.transform(trainVectorizerArray).toarray()
transformer.fit(testVectorizerArray)
print
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()
Here is the output:
Fit Vectorizer to train set [[1 0 1 0]
[0 1 0 1]]
Transform Vectorizer to test set [[0 1 1 1]]
[1 0 1 0]
[0 1 1 1]
0.408
[0 1 0 1]
[0 1 1 1]
0.816
[[ 0.70710678 0. 0.70710678 0. ]
[ 0. 0.70710678 0. 0.70710678]]
[[ 0. 0.57735027 0.57735027 0.57735027]]
Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?
Here is my suggestion:
- We don't have to fit the model twice. we could reuse the same vectorizer
- text cleaning function can be plugged into
TfidfVectorizer
directly usingpreprocessing
attribute.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer(preprocessor=nlp.clean_tf_idf_text)
docs_tfidf = vectorizer.fit_transform(allDocs)
def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
"""
vectorizer: TfIdfVectorizer model
docs_tfidf: tfidf vectors for all docs
query: query doc
return: cosine similarity between query and all docs
"""
query_tfidf = vectorizer.transform([query])
cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
return cosineSimilarities
TD-IDF Find Cosine Similarity Between New Document and Dataset
I have found a way for it to work. Instead of using fit_transform, you need to first fit the new document to the corpus TFIDF matrix like this:
queryTFIDF = TfidfVectorizer().fit(words)
Now we can 'transform' this vector into that matrix shape by using the transform function:
queryTFIDF = queryTFIDF.transform([query])
Where query is the query string.
We can then find cosine similarities and find the 10 most similar/relevant documents:
cosine_similarities = cosine_similarity(queryTFIDF, datasetTFIDF).flatten()
related_product_indices = cosine_similarities.argsort()[:-11:-1]
TF-IDF/Cosine Similarity - Similarity Histogram
Looking at the histogram, It would seem that the document similarity is not that concentrated (Cosine simlarity is bounded [0,1], and your histogram range is ~0.2-1). Whether this is good or bad depends on your expectation of the data, and what you want to do with the TF-IDF matrix later on. If you have a diverse corpus (e.g. wikipedia) then you would expect a wide range and be suspicious if you had a narrow range of Cosine similarity scores. However, if your Corpus is derived from a highly similar set of documents (e.g. a book report from a class of students).
In general, the distribution of your similarity scores is more just an FYI than a measure of dataset quality.
How to compute the similarity between two text documents?
The common way of doing this is to transform the documents into TF-IDF vectors and then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online.
Computing Pairwise Similarities
TF-IDF (and similar text transformations) are implemented in the Python packages Gensim and scikit-learn. In the latter package, computing cosine similarities is as easy as
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
or, if the documents are plain strings,
>>> corpus = ["I'd like an apple",
... "An apple a day keeps the doctor away",
... "Never compare an apple to an orange",
... "I prefer scikit-learn to Orange",
... "The scikit-learn docs are Orange and Blue"]
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")
>>> tfidf = vect.fit_transform(corpus)
>>> pairwise_similarity = tfidf * tfidf.T
though Gensim may have more options for this kind of task.
See also this question.
[Disclaimer: I was involved in the scikit-learn TF-IDF implementation.]
Interpreting the Results
From above, pairwise_similarity
is a Scipy sparse matrix that is square in shape, with the number of rows and columns equal to the number of documents in the corpus.
>>> pairwise_similarity
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 17 stored elements in Compressed Sparse Row format>
You can convert the sparse array to a NumPy array via .toarray()
or .A
:
>>> pairwise_similarity.toarray()
array([[1. , 0.17668795, 0.27056873, 0. , 0. ],
[0.17668795, 1. , 0.15439436, 0. , 0. ],
[0.27056873, 0.15439436, 1. , 0.19635649, 0.16815247],
[0. , 0. , 0.19635649, 1. , 0.54499756],
[0. , 0. , 0.16815247, 0.54499756, 1. ]])
Let's say we want to find the document most similar to the final document, "The scikit-learn docs are Orange and Blue". This document has index 4 in corpus
. You can find the index of the most similar document by taking the argmax of that row, but first you'll need to mask the 1's, which represent the similarity of each document to itself. You can do the latter through np.fill_diagonal()
, and the former through np.nanargmax()
:
>>> import numpy as np
>>> arr = pairwise_similarity.toarray()
>>> np.fill_diagonal(arr, np.nan)
>>> input_doc = "The scikit-learn docs are Orange and Blue"
>>> input_idx = corpus.index(input_doc)
>>> input_idx
4
>>> result_idx = np.nanargmax(arr[input_idx])
>>> corpus[result_idx]
'I prefer scikit-learn to Orange'
Note: the purpose of using a sparse matrix is to save (a substantial amount of space) for a large corpus & vocabulary. Instead of converting to a NumPy array, you could do:
>>> n, _ = pairwise_similarity.shape
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()
3
(TF-IDF)How to return the five related article after calculating cosine similarity
First: if you want 5 articles then instead of [:-5:-1]
you have to use [:-6:-1]
because for negative values it works little different.
Or use [::-1][:5]
- [::-1]
will reverse all values and then you can use normal [:5]
When you have related_docs_indices
then you can use .iloc[]
to get elements from DataFrame
sample_df.iloc[ related_docs_indices ]
If you will have elements with the same similarity then it will gives them in reversed order.
BTW:
You can also add similarities
to DataFrame
sample_df['similarity'] = cosine_similarities
and then sort (reversed) and get 5 items.
sample_df.sort_values('similarity', ascending=False)[:5]
If you will have elements with the same similarity then it will gives them in original order.
Minimal working code with some data - so everyone can copy and test it.
Because I have only 5 elements in DataFrame
so I search 2 elements.
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
sample_df = pd.DataFrame({
'paper_id': [1, 2, 3, 4, 5],
'title': ['Covid19', 'Flu', 'Cancer', 'Covid19 Again', 'New Air Conditioners'],
'abstract': ['covid19', 'flu', 'cancer', 'covid19', 'air conditioner'],
'body_text': ['Hello covid19', 'Hello flu', 'Hello cancer', 'Hello covid19 again', 'Buy new air conditioner'],
})
def get_cleaned_text(df, row):
return row
txt_cleaned = get_cleaned_text(sample_df, sample_df['abstract'])
question = ['Can covid19 transmit through air']
tfidf_vector = TfidfVectorizer()
tfidf = tfidf_vector.fit_transform(txt_cleaned)
tfidf_question = tfidf_vector.transform(question)
cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()
sample_df['similarity'] = cosine_similarities
number = 2
#related_docs_indices = cosine_similarities.argsort()[:-(number+1):-1]
related_docs_indices = cosine_similarities.argsort()[::-1][:number]
print('index:', related_docs_indices)
print('similarity:', cosine_similarities[related_docs_indices])
print('\n--- related_docs_indices ---\n')
print(sample_df.iloc[related_docs_indices])
print('\n--- sort_values ---\n')
print( sample_df.sort_values('similarity', ascending=False)[:number] )
Result:
index: [3 0]
similarity: [0.62791376 0.62791376]
--- related_docs_indices ---
paper_id title abstract body_text similarity
3 4 Covid19 Again covid19 Hello covid19 again 0.627914
0 1 Covid19 covid19 Hello covid19 0.627914
--- sort_values ---
paper_id title abstract body_text similarity
0 1 Covid19 covid19 Hello covid19 0.627914
3 4 Covid19 Again covid19 Hello covid19 again 0.627914
Calculate cosine similarity of document relevance
You got the errors because you are attempting to convert RDD into Vectors forcibly.
You can achieve what you need without doing the conversion by doing the following steps :
- Join both your RDDs into one RDD. Note that I am assuming you do not have a unique index in both RDDs for joining.
# Adding index to both RDDs by row.
rdd1 = normalizedtfidf.zipWithIndex().map(lambda arg : (arg[1], arg[0]))
rdd2 = keywordTF.zipWithIndex().map(lambda arg : (arg[1], arg[0]))
# Join both RDDs.
rdd_joined = rdd1.join(rdd2)
map
RDD with a function to calculate cosine distance.
def cosine_dist(row):
x = row[1][0]
y = row[1][1]
return (1 - x.dot(y)/(x.norm(2)*y.norm(2)))
res = rdd_joined.map(cosine_dist)
You can then use your results or run collect
to see them.
Related Topics
Python: Sort Function Breaks in the Presence of Nan
Split an Integer into Digits to Compute an Isbn Checksum
Convert a 1D Array to a 2D Array in Numpy
Why Doesn't Django's Model.Save() Call Full_Clean()
Iso to Datetime Object: 'Z' Is a Bad Directive
Variable Defined with With-Statement Available Outside of With-Block
Matplotlib Semi-Log Plot: Minor Tick Marks Are Gone When Range Is Large
Convert Bytes to Bits in Python
What Is an 'Endpoint' in Flask
Convert Python Strings into Floats Explicitly Using the Comma or the Point as Separators
In Python, How Does One Catch Warnings as If They Were Exceptions
How to Get Current Function into a Variable
Is Generator.Next() Visible in Python 3