R: Calculate cosine distance from a term-document matrix with tm and proxy
Since tm
's term document matrices are just sparse "simple triplet matrices" from the slam
package, you could use the functions there to calculate the distances directly from the definition of cosine similarity:
library(slam)
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
This takes advantage of sparse matrix multiplication. In my hands, a tdm with 2963 terms in 220 documents and 97% sparsity took barely a couple of seconds.
I haven't profiled this, so I have no idea if it's any faster than proxy::dist()
.
NOTE: for this to work, you should not coerce the tdm into a regular matrix, i.e don't do tdm <- as.matrix(tdm)
.
Calculate Cosine Similarity between two documents in TermDocumentMatrix of tm Package in R
120,000 x 120,000 matrix * 8 bytes (dbl float) = 115.2 gigabytes. This isn't necessarily beyond the capability of R, but you do need at least that much memory, regardless of what language you use. Realistically, you'll probably want to write to the disk, either using some database such as Sql (e.g. RSQLite package) or if you plan to only use R in your analysis, it might be better to use the "ff" package for storing/accessing large matrices on disk.
You could do this iteratively and multithread it to improve the speed of calculation.
To find the distance between two docs, you can do something like this:
dist(t(tdm[,1]), t(tdm[,2]), method='cosine')
Cosine similarity of 2 DTMs in R
Here is a way to calculate the cosine distance between two matrices. The use of tm is just for data purposes...
library(slam)
library(tm)
data("acq")
data("crude")
dtm <- DocumentTermMatrix(c(acq, crude))
index <- sample(1:70, size = 10)
dtm1 <- dtm[index, ]
dtm2 <- dtm[-index, ]
cosine_sim <- tcrossprod_simple_triplet_matrix(dtm1, dtm2)/sqrt(row_sums(dtm1^2) %*% t(row_sums(dtm2^2)))
The cosine function was adapted from this SO post: R: Calculate cosine distance from a term-document matrix with tm and proxy
cosine similarity between 2 document term matrix
Disregarding all the tm
stuff, as it seems to be besides the point, proxy::dist()
has the argument pairwise
which lets you do what you want.
set.seed(1)
N <- 6*8
m <- matrix(sample(c(0, 1, 1), N, rep=TRUE)*rpois(N, 6), 6)
dimnames(m) <- list(c(paste0("ID", 1:3, "_2000"), paste0("ID", 1:3, "_2001")),
sample(LETTERS, ncol(m)))
library(proxy)
proxy::dist(m[1:3,], m[4:6,], pairwise=TRUE, method="cosine")
# 0.6160563 0.2746764 0.2038266
# Which is the same as
diag(proxy::dist(m[1:3,], m[4:6,], method="cosine"))
# 0.6160563 0.2746764 0.2038266
Distance matrix calculation taking too long in R
Cosine distance is a simple dot product of two matrices with L2 normalization. In your case it even simpler - product of L2 normalized dtm on dtm transposed. Here is reproducible example using Matrix
and text2vec
packages:
library(text2vec)
library(Matrix)
cosine <- function(m) {
m_normalized <- m / sqrt(rowSums(m ^ 2))
tcrossprod(m_normalized)
}
data("movie_review")
data = rep(movie_review$review, 3)
it = itoken(data, tolower, word_tokenizer)
v = create_vocabulary(it) %>%
prune_vocabulary(term_count_min = 5)
vectorizer = vocab_vectorizer(v)
it = itoken(data, tolower, word_tokenizer)
dtm = create_dtm(it, vectorizer)
dim(dtm)
# 15000 24548
system.time( dtm_cos <- cosine(dtm) )
# user system elapsed
# 41.914 6.963 50.761
dim(dtm)
# 15000 15000
EDIT:
For tm
package see this question: R: Calculate cosine distance from a term-document matrix with tm and proxy
Related Topics
Asymmetric Expansion of Ggplot Axis Limits
Harnessing .F List Names with Purrr::Pmap
How to Find Common Rows Between Two Dataframe in R
Canonical Tidyverse Method to Update Some Values of a Vector from a Look-Up Table
Applying a Function to Each Row of a Data.Table
Remove Text After Final Period in String
Calling a User-Defined R Function from C++ Using Rcpp
Removing Unused Factors from a Facet in Ggplot2
Combining Different Types of Graphs Together (R)
How to Extend Letters Past 26 Characters E.G., Aa, Ab, Ac...
Looping Through List of Data Frames in R
How to Make a Timeseries Boxplot in R
Dplyr Group by Colnames Described as Vector of Strings
Ggplot: How to Set Default Color for All Geoms
Numbers as Column Names of Data Frames
Keeping Only Certain Rows of a Data Frame Based on a Set of Values
Is There a Fast Estimation of Simple Regression (A Regression Line with Only Intercept and Slope)