How to compute the similarity between two text documents?
The common way of doing this is to transform the documents into TF-IDF vectors and then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online.
Computing Pairwise Similarities
TF-IDF (and similar text transformations) are implemented in the Python packages Gensim and scikit-learn. In the latter package, computing cosine similarities is as easy as
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
or, if the documents are plain strings,
>>> corpus = ["I'd like an apple",
... "An apple a day keeps the doctor away",
... "Never compare an apple to an orange",
... "I prefer scikit-learn to Orange",
... "The scikit-learn docs are Orange and Blue"]
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")
>>> tfidf = vect.fit_transform(corpus)
>>> pairwise_similarity = tfidf * tfidf.T
though Gensim may have more options for this kind of task.
See also this question.
[Disclaimer: I was involved in the scikit-learn TF-IDF implementation.]
Interpreting the Results
From above, pairwise_similarity
is a Scipy sparse matrix that is square in shape, with the number of rows and columns equal to the number of documents in the corpus.
>>> pairwise_similarity
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 17 stored elements in Compressed Sparse Row format>
You can convert the sparse array to a NumPy array via .toarray()
or .A
:
>>> pairwise_similarity.toarray()
array([[1. , 0.17668795, 0.27056873, 0. , 0. ],
[0.17668795, 1. , 0.15439436, 0. , 0. ],
[0.27056873, 0.15439436, 1. , 0.19635649, 0.16815247],
[0. , 0. , 0.19635649, 1. , 0.54499756],
[0. , 0. , 0.16815247, 0.54499756, 1. ]])
Let's say we want to find the document most similar to the final document, "The scikit-learn docs are Orange and Blue". This document has index 4 in corpus
. You can find the index of the most similar document by taking the argmax of that row, but first you'll need to mask the 1's, which represent the similarity of each document to itself. You can do the latter through np.fill_diagonal()
, and the former through np.nanargmax()
:
>>> import numpy as np
>>> arr = pairwise_similarity.toarray()
>>> np.fill_diagonal(arr, np.nan)
>>> input_doc = "The scikit-learn docs are Orange and Blue"
>>> input_idx = corpus.index(input_doc)
>>> input_idx
4
>>> result_idx = np.nanargmax(arr[input_idx])
>>> corpus[result_idx]
'I prefer scikit-learn to Orange'
Note: the purpose of using a sparse matrix is to save (a substantial amount of space) for a large corpus & vocabulary. Instead of converting to a NumPy array, you could do:
>>> n, _ = pairwise_similarity.shape
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()
3
Ways of obtaining a similarity metric between two full text documents?
There is no simple answer to this question. As similarities will perform better or worse depending on the particular task you want to perform.
Having said that, you do have a couple of options regarding comparing blocks of text. This post compares and ranks several different ways of computing sentence similarity, which you can then aggregate to perform full document similarity. How to aggregate this? will also depend on your particular task. A simple, but often well-performing approach is to compute the average sentence similarities of the 2 (or more) documents.
Other useful links for this topics include:
- Introduction to Information Retrieval (free book)
- Doc2Vec (from gensim, for paragraph embeddings, which is probably very suitable for your case)
Similarity between two text documents in Python
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(min_df=1)
tfidf = vect.fit_transform(["My name is Ankit",
"Ankit name is very famous",
"Ankit like his name",
"India has a lot of beautiful cities"])
print ((tfidf * tfidf.T).A)
Related Topics
"Fire and Forget" Python Async/Await
Generate 'N' Unique Random Numbers Within a Range
Scope of Lambda Functions and Their Parameters
Create Nice Column Output in Python
Does "\D" in Regex Mean a Digit
Multiprocessing Global Variable Updates Not Returned to Parent
Shooting a Bullet in Pygame in the Direction of Mouse
Create a Directly-Executable Cross-Platform Gui App Using Python
Convert a Number Range to Another Range, Maintaining Ratio
Using Pip Behind a Proxy with Cntlm
How to Generate Keyboard Events
How to Extract the Decision Rules from Scikit-Learn Decision-Tree
Change the Name of a Key in Dictionary
Check If Something Is (Not) in a List in Python
Plotting with Seaborn Using the Matplotlib Object-Oriented Interface