How to efficiently retrieve top K-similar vectors by cosine similarity using R?
No need to compute the similarity for every row. You can use this instead:
coSim2<-function(mat1, mat2, topK){
#similarity computation:
xy <- tcrossprod(mat1, mat2)
xx <- rowSums(mat1^2)
yy <- rowSums(mat2^2)
result <- xy/sqrt(outer(xx,yy))
#top similar rows from train (per row in test):
top <- apply(result, 2, order, decreasing=TRUE)[1:topK,]
result_df <- data.frame(testRowId=c(col(top)), trainRowId=c(top))
result_df$CosineSimilarity <- result[as.matrix(result_df[,2:1])]
list(similarity=result_df, index=t(top))
}
Test data (I've reduced your train
matrix)
set.seed(123)
train<-matrix(round(runif(100000),0),nrow=500,ncol=200)
set.seed(987)
test<-matrix(round(runif(400000),0),nrow=2000,ncol=200)
Result:
> system.time(cosineSim<-coSim(train, test, topK=50)) #380secs
user system elapsed
41.71 1.59 43.72
> system.time(cosineSim2<-coSim2(train, test, topK=50)) #380secs
user system elapsed
0.46 0.02 0.49
Using your full 5000 x 200 train
matrix, coSim2
runs in 7.8 sec.
Also note:
> any(cosineSim$similarity != cosineSim2$similarity)
[1] FALSE
> any(cosineSim$index != cosineSim2$index)
[1] FALSE
You can't use identical
because my function returns integers instead of doubles for row IDs.
Retrieving top k similar rows in a matrix for each row via cosine similarity in R
This function is based on the linked answer:
CosineSimilarities <- function(m, top.k) {
# Computes cosine similarity between each row and all other rows in a matrix.
#
# Args:
# m: Matrix of values.
# top.k: Number of top rows to show for each row.
#
# Returns:
# Data frame with columns for pair of rows, and cosine similarity, for top
# `top.k` rows per row.
#
# Similarity computation
cp <- tcrossprod(m)
mm <- rowSums(m ^ 2)
result <- cp / sqrt(outer(mm, mm))
# Top similar rows from train (per row)
# Use `top.k + 1` to remove the self-reference (similarity = 1)
top <- apply(result, 2, order, decreasing=TRUE)[seq(top.k + 1), ]
result.df <- data.frame(row.id1=c(col(top)), row.id2=c(top))
result.df$cosine.similarity <- result[as.matrix(result.df[, 2:1])]
# Remove same-row records and return
return(result.df[result.df$row.id1 != result.df$row.id2, ])
}
For example:
(m <- matrix(1:9, nrow=3))
# [,1] [,2] [,3]
# [1,] 1 4 7
# [2,] 2 5 8
# [3,] 3 6 9
CosineSimilarities(m, 1)
# row.id1 row.id2 cosine.similarity
# 2 1 2 0.9956
# 4 2 3 0.9977
# 6 3 2 0.9977
Top N Values of Cosine Similarity Matrix in R
The way I would do this would be to convert the matrix to a tibble. We can do this by following the steps here to convert the upper triangular part of the matrix to a dataframe in 2 columns (see here: Convert upper triangular part of a matrix to 3-column long format).
After this we can simply use the top_n(2, val) function weigthed by our values. Another method at this step would be to arrange the values in descending order using arrange(desc(val)) and then use the head(2) function to take the top 2 values.
I have produced a reprex of my methods below
library(tidyverse)
southpark_matrix <- structure(c(0, 0.165272735625452, 0.386480286121192, 0.170696960480773,
0.0869562860988618, 0.165272735625452, 0, 0.251690602341816,
0.472701602991984, 0.137486001150133, 0.386480286121192, 0.251690602341816,
0, 0.255849200006255, 0.0972813221214626, 0.170696960480773,
0.472701602991984, 0.255849200006255, 0, 0.156449701347234, 0.0869562860988618,
0.137486001150133, 0.0972813221214626, 0.156449701347234, 0), .Dim = c(5L,
5L), .Dimnames = list(Docs = c("Mr. Garrison_2", "Cartman_3",
"Mr. Garrison_3", "Cartman_4", "Jimbo_5"), Docs = c("Mr. Garrison_2",
"Cartman_3", "Mr. Garrison_3", "Cartman_4", "Jimbo_5")))
# Convert the matrix to an upper diagonal form
ind <- which(upper.tri(southpark_matrix, diag = TRUE), arr.ind = TRUE)
dimnam <- dimnames(southpark_matrix)
df <- data.frame(row = dimnam[[1]][ind[, 1]],
col = dimnam[[2]][ind[, 2]],
val = southpark_matrix[ind])
#top n method
df %>%
tibble() %>%
top_n(2, val)
#> # A tibble: 2 x 3
#> row col val
#> <chr> <chr> <dbl>
#> 1 Mr. Garrison_2 Mr. Garrison_3 0.386
#> 2 Cartman_3 Cartman_4 0.473
#arrange and head method
df %>%
arrange(desc(val)) %>%
head(2)
#> # A tibble: 2 x 3
#> row col val
#> <chr> <chr> <dbl>
#> 1 Cartman_3 Cartman_4 0.473
#> 2 Mr. Garrison_2 Mr. Garrison_3 0.386
Created on 2021-04-04 by the reprex package (v2.0.0)
Using Cosine similarity in String vector to filter out similar strings
So, to rephrase what you want: You'd like to calculate the pairwise similarities for all string pairs. You would then like to use that similarity matrix to identify groups of strings that are dissimilar enough to form distinct groups. For each of these groups, you want to drop all but the longest string and return that. Did I get that right?
After some experimenting, here is my proposed solution, step by step:
- calculate the similarity matrix and binarize it using the threshold value
- identify distinct groups (cliques) using a graph algorithm from the
igraph
package - find all strings in each clique and retain the longest string
NB: I had to adjust the threshold to 0.4 to make your example work.
Similarity Matrix
This is heavily based on the code you provided, but I packed it up as a function and used the tidyverse
to make the code, at least to my taste, a little bit more readable.
library(tm)
library(lsa)
library(tidyverse)
get_cos_sim <- function(corpus) {
# pre-process corpus
doc <- corpus %>%
VectorSource %>%
tm::VCorpus()
# get term frequency matrix
tfm <- doc %>%
DocumentTermMatrix(
control = corpus %>% list(
removePunctuation = TRUE,
wordLengths = c(1, Inf),
weighting = weightTf)) %>%
as.matrix()
# get row-wise similarity
sim <- NULL
for(i in 1:nrow(tfm)) {
sim_i <- apply(
X = tfm,
MARGIN = 1,
FUN = lsa::cosine,
tfm[i,])
sim <- rbind(sim, sim_i)
}
# set identity diagonal to zero
diag(sim) <- 0
# label and return
rownames(sim) <- corpus
return(sim)
}
Now we apply this function to your example data
# example corpus
strings <- c(
"Dan is a good man and very smart",
"A good man is rare",
"Alex can be trusted with anything",
"Dan likes to share his food",
"Rare are man who can be trusted",
"Please share food")
# get pairwise similarities
sim <- get_cos_sim(strings)
# binarize (using a different threshold to make your example work)
sim <- sim > .4
Identify Distinct Groups
This turned out to be an interesting problem! I found this paper, Chalermsook & Chuzhoy: Maximum Independent Set of Rectangles, that led me to this implementation in the igraph
package. Basically, we treat similar strings as connected vertices in a graph and then look for distinct groups in the graph of the whole similarity matrix
library(igraph)
# create graph from adjacency matrix
cliques <- sim %>%
dplyr::as_data_frame() %>%
mutate(from = row_number()) %>%
gather(key = 'to', value = 'edge', -from) %>%
filter(edge == T) %>%
graph_from_data_frame(directed = FALSE) %>%
max_cliques()
Find Longest String
Now we can use the list of cliques to retrieve the strings for each of the vertices
and pick the longest string per clique. Caveat: strings that have no similar strings in the corpus are missing from the graph. I am adding them back in manually. There might be a function in the igraph
package that's better at dealing with it, would be interested if anyone finds something
# get the string indices per vertex clique first
string_cliques_index <- cliques %>%
unlist %>%
names %>%
as.numeric
# find the indices that are distinct but not in a clique
# (i.e. unconnected vertices)
string_uniques_index <- colnames(sim)[!colnames(sim) %in% string_cliques_index] %>%
as.numeric
# get a list with all indices
all_distict <- cliques %>%
lapply(names) %>%
lapply(as.numeric) %>%
c(string_uniques_index)
# get a list of distinct strings
lapply(all_distict, find_longest, strings)
Test case:
Let's test this with a longer vector of different strings:
strings <- c(
"Dan is a good man and very smart",
"A good man is rare",
"Alex can be trusted with anything",
"Dan likes to share his food",
"Rare are man who can be trusted",
"Please share food",
"NASA is a government organisation",
"The FBI organisation is part of the government of USA",
"Hurricanes are a tragedy",
"Mangoes are very tasty to eat ",
"I like to eat tasty food",
"The thief was caught by the FBI")
I get this binarized similarity matrix:
Dan is a good man and very smart FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
A good man is rare TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Alex can be trusted with anything FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Dan likes to share his food FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Rare are man who can be trusted FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Please share food FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
NASA is a government organisation FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The FBI organisation is part of the government of USA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Hurricanes are a tragedy FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Mangoes are very tasty to eat FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
I like to eat tasty food FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
The thief was caught by the FBI FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Based on these similarities, the expected outcome would be:
# included
Dan is a good man and very smart
Alex can be trusted with anything
Dan likes to share his food
NASA is a government organisation
The FBI organisation is part of the government of USA
Hurricanes are a tragedy
Mangoes are very tasty to eat
# omitted
A good man is rare
Rare are man who can be trusted
Please share food
I like to eat tasty food
The thief was caught by the FBI
The actual output has the right elements, but not in the original order.
You can reorder using your original string vector though
[[1]]
[1] "The FBI organisation is part of the government of USA"
[[2]]
[1] "Dan is a good man and very smart"
[[3]]
[1] "Alex can be trusted with anything"
[[4]]
[1] "Dan likes to share his food"
[[5]]
[1] "Mangoes are very tasty to eat "
[[6]]
[1] "NASA is a government organisation"
[[7]]
[1] "Hurricanes are a tragedy"
That's all!
Hope it is what you were looking for and may be useful to others.
What method to find n most similar/related vectors within N numeric vectors in R? (N n)
Put each vector as columns in a matrix, then calculate the cosine similarity of each column pair using crossprod
as in this answer. Then you could find the maximum n
values in each column.
v <- mapply(get, letters[1:10], mode = "numeric")
crossprod(v)/(sqrt(tcrossprod(colSums(v^2))))*(1 - diag(ncol(v)))
#> a b c d e f g h i j
#> a 0.0000000 0.9958592 0.9958592 0.9958592 0.8062730 0.8946692 0.5415304 0.9616223 0.8162174 0.9280323
#> b 0.9958592 0.0000000 1.0000000 1.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> c 0.9958592 1.0000000 0.0000000 1.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> d 0.9958592 1.0000000 1.0000000 0.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> e 0.8062730 0.7636241 0.7636241 0.7636241 0.0000000 0.9226539 0.6904075 0.7748305 0.7656671 0.7887775
#> f 0.8946692 0.8736978 0.8736978 0.8736978 0.9226539 0.0000000 0.7374396 0.9189132 0.8929772 0.9557896
#> g 0.5415304 0.4943473 0.4943473 0.4943473 0.6904075 0.7374396 0.0000000 0.6968355 0.8281550 0.6265219
#> h 0.9616223 0.9592858 0.9592858 0.9592858 0.7748305 0.9189132 0.6968355 0.0000000 0.9292092 0.9614179
#> i 0.8162174 0.8106045 0.8106045 0.8106045 0.7656671 0.8929772 0.8281550 0.9292092 0.0000000 0.9003699
#> j 0.9280323 0.9318911 0.9318911 0.9318911 0.7887775 0.9557896 0.6265219 0.9614179 0.9003699 0.0000000
Comparing to correlation using cor
:
cor(v)*(1 - 2*diag(ncol(v)))
#> a b c d e f g h i j
#> a -1.00000000 0.9863939 0.9863939 0.9863939 -0.05488213 0.18898224 -0.8565862 0.77151675 -0.3434014 0.5669467
#> b 0.98639392 -1.0000000 1.0000000 1.0000000 -0.15338363 0.18641093 -0.8745781 0.83712138 -0.1919465 0.6524383
#> c 0.98639392 1.0000000 -1.0000000 1.0000000 -0.15338363 0.18641093 -0.8745781 0.83712138 -0.1919465 0.6524383
#> d 0.98639392 1.0000000 1.0000000 -1.0000000 -0.15338363 0.18641093 -0.8745781 0.83712138 -0.1919465 0.6524383
#> e -0.05488213 -0.1533836 -0.1533836 -0.1533836 -1.00000000 0.45635690 -0.2276335 -0.66054273 -0.7086322 -0.2696654
#> f 0.18898224 0.1864109 0.1864109 0.1864109 0.45635690 -1.00000000 -0.4174789 -0.02916059 -0.3374632 0.6428571
#> g -0.85658615 -0.8745781 -0.8745781 -0.8745781 -0.22763353 -0.41747888 -1.0000000 -0.53565298 0.2415150 -0.6304783
#> h 0.77151675 0.8371214 0.8371214 0.8371214 -0.66054273 -0.02916059 -0.5356530 -1.00000000 0.2331471 0.6998542
#> i -0.34340141 -0.1919465 -0.1919465 -0.1919465 -0.70863219 -0.33746319 0.2415150 0.23314715 -1.0000000 0.1817109
#> j 0.56694671 0.6524383 0.6524383 0.6524383 -0.26966544 0.64285714 -0.6304783 0.69985421 0.1817109 -1.0000000
Fast computation of 10^6 cosine vector similarities in R
Here's my take on it.
If I define cosine similarity as
coss <- function(x) {crossprod(x)/(sqrt(tcrossprod(colSums(x^2))))}
(I think that is about as quickly as I can make it with base R functions and the often overseen crossprod
which is a little gem). If I compare it with an RCpp function using RCppArmadillo (slightly updated as suggested by @f-privé)
NumericMatrix cosine_similarity(NumericMatrix x) {
arma::mat X(x.begin(), x.nrow(), x.ncol(), false);
// Compute the crossprod
arma::mat res = X.t() * X;
int n = x.ncol();
arma::vec diag(n);
int i, j;
for (i=0; i<n; i++) {
diag(i) = sqrt(res(i,i));
}
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
res(i, j) /= diag(i)*diag(j);
return(wrap(res));
}
(this might possibly be optimised with some of the specialized functions in the armadillo library - just wanted to get some timing measurements).
Comparing those yields
> XX <- matrix(rnorm(120*1600), ncol=1600)
> microbenchmark::microbenchmark(cosine_similarity(XX), coss(XX), coss2(XX), times=50)
> microbenchmark::microbenchmark(coss(x), coss2(x), cosine_similarity(x), cosine_similarity2(x), coss3(x), times=50)
Unit: milliseconds
expr min lq mean median uq max
coss(x) 173.0975 183.0606 192.8333 187.6082 193.2885 331.9206
coss2(x) 162.4193 171.3178 183.7533 178.8296 184.9762 319.7934
cosine_similarity2(x) 169.6075 175.5601 191.4402 181.3405 186.4769 319.8792
neval cld
50 a
50 b
50 a
which is really not that bad. The gain in computing the cosine similarity using C++ is super small (with @ f-privé's solution being fastest) so I'm guessing your timing issues are due to what you are doing to convert the text from the words to numbers and not when calculating the cosine similarity. Without knowing more about your specific code it is hard for us to help you.
Related Topics
Change Distance Between X-Axis Ticks in Ggplot2
Change The Color of a Ggplot Geom a Posteriori (After Having Specified Another Color)
Show Source Code for a Function in a Package in R
Cumulative Sums Over Run Lengths. Can This Loop Be Vectorized
Split Data.Frame Row into Multiple Rows Based on Commas
Shiny Sliderinput from Max to Min
How to Make UI Respond to Reactive Values in for Loop
Convert a Row of a Data Frame to a Simple Vector in R
How to Extract Coefficients' Standard Error from an "Aov" Model
How to Define Multiple Variables with Lapply
Ggplot: Subset a Layer Where Data Is Passed Using a Pipe
Clear R Environment of All Objetcs & Packages