R: Calculate Cosine Distance from a Term-Document Matrix with Tm and Proxy

R: Calculate cosine distance from a term-document matrix with tm and proxy

Since tm's term document matrices are just sparse "simple triplet matrices" from the slam package, you could use the functions there to calculate the distances directly from the definition of cosine similarity:

library(slam)
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))

This takes advantage of sparse matrix multiplication. In my hands, a tdm with 2963 terms in 220 documents and 97% sparsity took barely a couple of seconds.

I haven't profiled this, so I have no idea if it's any faster than proxy::dist().

NOTE: for this to work, you should not coerce the tdm into a regular matrix, i.e don't do tdm <- as.matrix(tdm).

Calculate Cosine Similarity between two documents in TermDocumentMatrix of tm Package in R

120,000 x 120,000 matrix * 8 bytes (dbl float) = 115.2 gigabytes. This isn't necessarily beyond the capability of R, but you do need at least that much memory, regardless of what language you use. Realistically, you'll probably want to write to the disk, either using some database such as Sql (e.g. RSQLite package) or if you plan to only use R in your analysis, it might be better to use the "ff" package for storing/accessing large matrices on disk.

You could do this iteratively and multithread it to improve the speed of calculation.

To find the distance between two docs, you can do something like this:

dist(t(tdm[,1]), t(tdm[,2]), method='cosine')

Cosine similarity of 2 DTMs in R

Here is a way to calculate the cosine distance between two matrices. The use of tm is just for data purposes...

library(slam)
library(tm)
data("acq")
data("crude")

dtm <- DocumentTermMatrix(c(acq, crude))

index <- sample(1:70, size = 10)

dtm1 <- dtm[index, ]
dtm2 <- dtm[-index, ]

cosine_sim <- tcrossprod_simple_triplet_matrix(dtm1, dtm2)/sqrt(row_sums(dtm1^2) %*% t(row_sums(dtm2^2)))

The cosine function was adapted from this SO post: R: Calculate cosine distance from a term-document matrix with tm and proxy

cosine similarity between 2 document term matrix

Disregarding all the tm stuff, as it seems to be besides the point, proxy::dist() has the argument pairwise which lets you do what you want.

set.seed(1)
N <- 6*8
m <- matrix(sample(c(0, 1, 1), N, rep=TRUE)*rpois(N, 6), 6)
dimnames(m) <- list(c(paste0("ID", 1:3, "_2000"), paste0("ID", 1:3, "_2001")),
sample(LETTERS, ncol(m)))

library(proxy)
proxy::dist(m[1:3,], m[4:6,], pairwise=TRUE, method="cosine")
# 0.6160563 0.2746764 0.2038266

# Which is the same as
diag(proxy::dist(m[1:3,], m[4:6,], method="cosine"))
# 0.6160563 0.2746764 0.2038266

Distance matrix calculation taking too long in R

Cosine distance is a simple dot product of two matrices with L2 normalization. In your case it even simpler - product of L2 normalized dtm on dtm transposed. Here is reproducible example using Matrix and text2vec packages:

library(text2vec)
library(Matrix)

cosine <- function(m) {
m_normalized <- m / sqrt(rowSums(m ^ 2))
tcrossprod(m_normalized)
}

data("movie_review")
data = rep(movie_review$review, 3)
it = itoken(data, tolower, word_tokenizer)
v = create_vocabulary(it) %>%
prune_vocabulary(term_count_min = 5)
vectorizer = vocab_vectorizer(v)
it = itoken(data, tolower, word_tokenizer)
dtm = create_dtm(it, vectorizer)
dim(dtm)
# 15000 24548

system.time( dtm_cos <- cosine(dtm) )
# user system elapsed
# 41.914 6.963 50.761
dim(dtm)
# 15000 15000

EDIT:
For tm package see this question: R: Calculate cosine distance from a term-document matrix with tm and proxy



Related Topics



Leave a reply



Submit