How to Efficiently Retrieve Top K-Similar Vectors by Cosine Similarity Using R

How to efficiently retrieve top K-similar vectors by cosine similarity using R?

No need to compute the similarity for every row. You can use this instead:

coSim2<-function(mat1, mat2, topK){
#similarity computation:

xy <- tcrossprod(mat1, mat2)
xx <- rowSums(mat1^2)
yy <- rowSums(mat2^2)
result <- xy/sqrt(outer(xx,yy))

#top similar rows from train (per row in test):

top <- apply(result, 2, order, decreasing=TRUE)[1:topK,]
result_df <- data.frame(testRowId=c(col(top)), trainRowId=c(top))
result_df$CosineSimilarity <- result[as.matrix(result_df[,2:1])]
list(similarity=result_df, index=t(top))
}

Test data (I've reduced your train matrix)

set.seed(123)
train<-matrix(round(runif(100000),0),nrow=500,ncol=200)
set.seed(987)
test<-matrix(round(runif(400000),0),nrow=2000,ncol=200)

Result:

> system.time(cosineSim<-coSim(train, test, topK=50)) #380secs
user system elapsed
41.71 1.59 43.72

> system.time(cosineSim2<-coSim2(train, test, topK=50)) #380secs
user system elapsed
0.46 0.02 0.49

Using your full 5000 x 200 train matrix, coSim2 runs in 7.8 sec.

Also note:

> any(cosineSim$similarity != cosineSim2$similarity)
[1] FALSE
> any(cosineSim$index != cosineSim2$index)
[1] FALSE

You can't use identical because my function returns integers instead of doubles for row IDs.

Retrieving top k similar rows in a matrix for each row via cosine similarity in R

This function is based on the linked answer:

CosineSimilarities <- function(m, top.k) {
# Computes cosine similarity between each row and all other rows in a matrix.
#
# Args:
# m: Matrix of values.
# top.k: Number of top rows to show for each row.
#
# Returns:
# Data frame with columns for pair of rows, and cosine similarity, for top
# `top.k` rows per row.
#
# Similarity computation
cp <- tcrossprod(m)
mm <- rowSums(m ^ 2)
result <- cp / sqrt(outer(mm, mm))
# Top similar rows from train (per row)
# Use `top.k + 1` to remove the self-reference (similarity = 1)
top <- apply(result, 2, order, decreasing=TRUE)[seq(top.k + 1), ]
result.df <- data.frame(row.id1=c(col(top)), row.id2=c(top))
result.df$cosine.similarity <- result[as.matrix(result.df[, 2:1])]
# Remove same-row records and return
return(result.df[result.df$row.id1 != result.df$row.id2, ])
}

For example:

(m <- matrix(1:9, nrow=3))
# [,1] [,2] [,3]
# [1,] 1 4 7
# [2,] 2 5 8
# [3,] 3 6 9
CosineSimilarities(m, 1)
# row.id1 row.id2 cosine.similarity
# 2 1 2 0.9956
# 4 2 3 0.9977
# 6 3 2 0.9977

Top N Values of Cosine Similarity Matrix in R

The way I would do this would be to convert the matrix to a tibble. We can do this by following the steps here to convert the upper triangular part of the matrix to a dataframe in 2 columns (see here: Convert upper triangular part of a matrix to 3-column long format).

After this we can simply use the top_n(2, val) function weigthed by our values. Another method at this step would be to arrange the values in descending order using arrange(desc(val)) and then use the head(2) function to take the top 2 values.

I have produced a reprex of my methods below

library(tidyverse)

southpark_matrix <- structure(c(0, 0.165272735625452, 0.386480286121192, 0.170696960480773,
0.0869562860988618, 0.165272735625452, 0, 0.251690602341816,
0.472701602991984, 0.137486001150133, 0.386480286121192, 0.251690602341816,
0, 0.255849200006255, 0.0972813221214626, 0.170696960480773,
0.472701602991984, 0.255849200006255, 0, 0.156449701347234, 0.0869562860988618,
0.137486001150133, 0.0972813221214626, 0.156449701347234, 0), .Dim = c(5L,
5L), .Dimnames = list(Docs = c("Mr. Garrison_2", "Cartman_3",
"Mr. Garrison_3", "Cartman_4", "Jimbo_5"), Docs = c("Mr. Garrison_2",
"Cartman_3", "Mr. Garrison_3", "Cartman_4", "Jimbo_5")))

# Convert the matrix to an upper diagonal form
ind <- which(upper.tri(southpark_matrix, diag = TRUE), arr.ind = TRUE)
dimnam <- dimnames(southpark_matrix)
df <- data.frame(row = dimnam[[1]][ind[, 1]],
col = dimnam[[2]][ind[, 2]],
val = southpark_matrix[ind])
#top n method
df %>%
tibble() %>%
top_n(2, val)
#> # A tibble: 2 x 3
#> row col val
#> <chr> <chr> <dbl>
#> 1 Mr. Garrison_2 Mr. Garrison_3 0.386
#> 2 Cartman_3 Cartman_4 0.473

#arrange and head method
df %>%
arrange(desc(val)) %>%
head(2)
#> # A tibble: 2 x 3
#> row col val
#> <chr> <chr> <dbl>
#> 1 Cartman_3 Cartman_4 0.473
#> 2 Mr. Garrison_2 Mr. Garrison_3 0.386

Created on 2021-04-04 by the reprex package (v2.0.0)

Using Cosine similarity in String vector to filter out similar strings

So, to rephrase what you want: You'd like to calculate the pairwise similarities for all string pairs. You would then like to use that similarity matrix to identify groups of strings that are dissimilar enough to form distinct groups. For each of these groups, you want to drop all but the longest string and return that. Did I get that right?

After some experimenting, here is my proposed solution, step by step:

  • calculate the similarity matrix and binarize it using the threshold value
  • identify distinct groups (cliques) using a graph algorithm from the igraph package
  • find all strings in each clique and retain the longest string

NB: I had to adjust the threshold to 0.4 to make your example work.


Similarity Matrix

This is heavily based on the code you provided, but I packed it up as a function and used the tidyverse to make the code, at least to my taste, a little bit more readable.

library(tm)
library(lsa)
library(tidyverse)

get_cos_sim <- function(corpus) {
# pre-process corpus
doc <- corpus %>%
VectorSource %>%
tm::VCorpus()
# get term frequency matrix
tfm <- doc %>%
DocumentTermMatrix(
control = corpus %>% list(
removePunctuation = TRUE,
wordLengths = c(1, Inf),
weighting = weightTf)) %>%
as.matrix()
# get row-wise similarity
sim <- NULL
for(i in 1:nrow(tfm)) {
sim_i <- apply(
X = tfm,
MARGIN = 1,
FUN = lsa::cosine,
tfm[i,])
sim <- rbind(sim, sim_i)
}
# set identity diagonal to zero
diag(sim) <- 0
# label and return
rownames(sim) <- corpus
return(sim)
}

Now we apply this function to your example data

# example corpus
strings <- c(
"Dan is a good man and very smart",
"A good man is rare",
"Alex can be trusted with anything",
"Dan likes to share his food",
"Rare are man who can be trusted",
"Please share food")

# get pairwise similarities
sim <- get_cos_sim(strings)
# binarize (using a different threshold to make your example work)
sim <- sim > .4

Identify Distinct Groups

This turned out to be an interesting problem! I found this paper, Chalermsook & Chuzhoy: Maximum Independent Set of Rectangles, that led me to this implementation in the igraph package. Basically, we treat similar strings as connected vertices in a graph and then look for distinct groups in the graph of the whole similarity matrix

library(igraph)

# create graph from adjacency matrix
cliques <- sim %>%
dplyr::as_data_frame() %>%
mutate(from = row_number()) %>%
gather(key = 'to', value = 'edge', -from) %>%
filter(edge == T) %>%
graph_from_data_frame(directed = FALSE) %>%
max_cliques()

Find Longest String

Now we can use the list of cliques to retrieve the strings for each of the vertices and pick the longest string per clique. Caveat: strings that have no similar strings in the corpus are missing from the graph. I am adding them back in manually. There might be a function in the igraph package that's better at dealing with it, would be interested if anyone finds something

# get the string indices per vertex clique first
string_cliques_index <- cliques %>%
unlist %>%
names %>%
as.numeric
# find the indices that are distinct but not in a clique
# (i.e. unconnected vertices)
string_uniques_index <- colnames(sim)[!colnames(sim) %in% string_cliques_index] %>%
as.numeric
# get a list with all indices
all_distict <- cliques %>%
lapply(names) %>%
lapply(as.numeric) %>%
c(string_uniques_index)
# get a list of distinct strings
lapply(all_distict, find_longest, strings)

Test case:

Let's test this with a longer vector of different strings:

strings <- c(
"Dan is a good man and very smart",
"A good man is rare",
"Alex can be trusted with anything",
"Dan likes to share his food",
"Rare are man who can be trusted",
"Please share food",
"NASA is a government organisation",
"The FBI organisation is part of the government of USA",
"Hurricanes are a tragedy",
"Mangoes are very tasty to eat ",
"I like to eat tasty food",
"The thief was caught by the FBI")

I get this binarized similarity matrix:

Dan is a good man and very smart                      FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
A good man is rare TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Alex can be trusted with anything FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Dan likes to share his food FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Rare are man who can be trusted FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Please share food FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
NASA is a government organisation FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The FBI organisation is part of the government of USA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Hurricanes are a tragedy FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Mangoes are very tasty to eat FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
I like to eat tasty food FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
The thief was caught by the FBI FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE

Based on these similarities, the expected outcome would be:

# included
Dan is a good man and very smart
Alex can be trusted with anything
Dan likes to share his food
NASA is a government organisation
The FBI organisation is part of the government of USA
Hurricanes are a tragedy
Mangoes are very tasty to eat

# omitted
A good man is rare
Rare are man who can be trusted
Please share food
I like to eat tasty food
The thief was caught by the FBI

The actual output has the right elements, but not in the original order.
You can reorder using your original string vector though

[[1]]
[1] "The FBI organisation is part of the government of USA"

[[2]]
[1] "Dan is a good man and very smart"

[[3]]
[1] "Alex can be trusted with anything"

[[4]]
[1] "Dan likes to share his food"

[[5]]
[1] "Mangoes are very tasty to eat "

[[6]]
[1] "NASA is a government organisation"

[[7]]
[1] "Hurricanes are a tragedy"

That's all!
Hope it is what you were looking for and may be useful to others.

What method to find n most similar/related vectors within N numeric vectors in R? (N n)

Put each vector as columns in a matrix, then calculate the cosine similarity of each column pair using crossprod as in this answer. Then you could find the maximum n values in each column.

v <- mapply(get, letters[1:10], mode = "numeric")
crossprod(v)/(sqrt(tcrossprod(colSums(v^2))))*(1 - diag(ncol(v)))
#> a b c d e f g h i j
#> a 0.0000000 0.9958592 0.9958592 0.9958592 0.8062730 0.8946692 0.5415304 0.9616223 0.8162174 0.9280323
#> b 0.9958592 0.0000000 1.0000000 1.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> c 0.9958592 1.0000000 0.0000000 1.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> d 0.9958592 1.0000000 1.0000000 0.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> e 0.8062730 0.7636241 0.7636241 0.7636241 0.0000000 0.9226539 0.6904075 0.7748305 0.7656671 0.7887775
#> f 0.8946692 0.8736978 0.8736978 0.8736978 0.9226539 0.0000000 0.7374396 0.9189132 0.8929772 0.9557896
#> g 0.5415304 0.4943473 0.4943473 0.4943473 0.6904075 0.7374396 0.0000000 0.6968355 0.8281550 0.6265219
#> h 0.9616223 0.9592858 0.9592858 0.9592858 0.7748305 0.9189132 0.6968355 0.0000000 0.9292092 0.9614179
#> i 0.8162174 0.8106045 0.8106045 0.8106045 0.7656671 0.8929772 0.8281550 0.9292092 0.0000000 0.9003699
#> j 0.9280323 0.9318911 0.9318911 0.9318911 0.7887775 0.9557896 0.6265219 0.9614179 0.9003699 0.0000000

Comparing to correlation using cor:

cor(v)*(1 - 2*diag(ncol(v)))
#> a b c d e f g h i j
#> a -1.00000000 0.9863939 0.9863939 0.9863939 -0.05488213 0.18898224 -0.8565862 0.77151675 -0.3434014 0.5669467
#> b 0.98639392 -1.0000000 1.0000000 1.0000000 -0.15338363 0.18641093 -0.8745781 0.83712138 -0.1919465 0.6524383
#> c 0.98639392 1.0000000 -1.0000000 1.0000000 -0.15338363 0.18641093 -0.8745781 0.83712138 -0.1919465 0.6524383
#> d 0.98639392 1.0000000 1.0000000 -1.0000000 -0.15338363 0.18641093 -0.8745781 0.83712138 -0.1919465 0.6524383
#> e -0.05488213 -0.1533836 -0.1533836 -0.1533836 -1.00000000 0.45635690 -0.2276335 -0.66054273 -0.7086322 -0.2696654
#> f 0.18898224 0.1864109 0.1864109 0.1864109 0.45635690 -1.00000000 -0.4174789 -0.02916059 -0.3374632 0.6428571
#> g -0.85658615 -0.8745781 -0.8745781 -0.8745781 -0.22763353 -0.41747888 -1.0000000 -0.53565298 0.2415150 -0.6304783
#> h 0.77151675 0.8371214 0.8371214 0.8371214 -0.66054273 -0.02916059 -0.5356530 -1.00000000 0.2331471 0.6998542
#> i -0.34340141 -0.1919465 -0.1919465 -0.1919465 -0.70863219 -0.33746319 0.2415150 0.23314715 -1.0000000 0.1817109
#> j 0.56694671 0.6524383 0.6524383 0.6524383 -0.26966544 0.64285714 -0.6304783 0.69985421 0.1817109 -1.0000000

Fast computation of 10^6 cosine vector similarities in R

Here's my take on it.

If I define cosine similarity as

coss <- function(x) {crossprod(x)/(sqrt(tcrossprod(colSums(x^2))))}

(I think that is about as quickly as I can make it with base R functions and the often overseen crossprod which is a little gem). If I compare it with an RCpp function using RCppArmadillo (slightly updated as suggested by @f-privé)

NumericMatrix cosine_similarity(NumericMatrix x) {
arma::mat X(x.begin(), x.nrow(), x.ncol(), false);

// Compute the crossprod
arma::mat res = X.t() * X;
int n = x.ncol();
arma::vec diag(n);
int i, j;

for (i=0; i<n; i++) {
diag(i) = sqrt(res(i,i));
}

for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
res(i, j) /= diag(i)*diag(j);

return(wrap(res));
}

(this might possibly be optimised with some of the specialized functions in the armadillo library - just wanted to get some timing measurements).

Comparing those yields

> XX <- matrix(rnorm(120*1600), ncol=1600)
> microbenchmark::microbenchmark(cosine_similarity(XX), coss(XX), coss2(XX), times=50)
> microbenchmark::microbenchmark(coss(x), coss2(x), cosine_similarity(x), cosine_similarity2(x), coss3(x), times=50)
Unit: milliseconds
expr min lq mean median uq max
coss(x) 173.0975 183.0606 192.8333 187.6082 193.2885 331.9206
coss2(x) 162.4193 171.3178 183.7533 178.8296 184.9762 319.7934
cosine_similarity2(x) 169.6075 175.5601 191.4402 181.3405 186.4769 319.8792
neval cld
50 a
50 b
50 a

which is really not that bad. The gain in computing the cosine similarity using C++ is super small (with @ f-privé's solution being fastest) so I'm guessing your timing issues are due to what you are doing to convert the text from the words to numbers and not when calculating the cosine similarity. Without knowing more about your specific code it is hard for us to help you.



Related Topics



Leave a reply



Submit