Findassocs for Multiple Terms in R

findAssocs for multiple terms in R

If I understand correctly, an lapply solution is probably the way to answer your question. This is the same approach as the answer that you link to, but here's a self-contained example that might be closer to your use case:

Load libraries and reproducible data (please include these in your future questions here)

library(tm)
library(RWeka)
data(crude)

Your bigram tokenizer...

#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

Check that it worked by inspecting a random sample...

inspect(txtTdmBi[1000:1005, 10:15])
A term-document matrix (6 terms, 6 documents)

Non-/sparse entries: 1/35
Sparsity           : 97%
Maximal term length: 18 
Weighting          : term frequency (tf)

                    Docs
Terms                248 273 349 352 353 368
  for their            0   0   0   0   0   0
  for west             0   0   0   0   0   0
  forced it            0   0   0   0   0   0
  forced to            0   0   0   0   0   0
  forces trying        1   0   0   0   0   0
  foreign investment   0   0   0   0   0   0

Here is the answer to your question:

Now use a lapply function to calculate the associated words for every item in the vector of terms in the term-document matrix. The vector of terms is most simply accessed with txtTdmBi$dimnames$Terms. For example txtTdmBi$dimnames$Terms[[1005]] is "foreign investment".

Here I've used llply from the plyr package so we can have a progress bar (comforting for big jobs), but it's basically the same as the base lapply function.

library(plyr)
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5), .progress = "text" )

The output is a list where each item in the list is a vector of named numbers where the name is the term and the number is the correlation value. For example, to see the terms associated with "foreign investment", we can access the list like so:

dat[[1005]]

and here are the terms associated with that term (I've just pasted in the top few)

168 million              1986 was            1987 early               300 mln                31 pct 
                 1.00                  1.00                  1.00                  1.00                  1.00 
                a bit          a crossroads             a leading           a political          a population 
                 1.00                  1.00                  1.00                  1.00                  1.00 
            a reduced              a series            a slightly            about zero    activity continues 
                 1.00                  1.00                  1.00                  1.00                  1.00 
         advisers are   agricultural sector       agriculture the              all such          also reviews 
                 1.00                  1.00                  1.00                  1.00                  1.00 
         and advisers           and attract           and imports       and liberalised             and steel 
                 1.00                  1.00                  1.00                  1.00                  1.00 
            and trade           and virtual       announced since            appears to           are equally 
                 1.00                  1.00                  1.00                  1.00                  1.00 
     are recommending             areas for              areas of                 as it              as steps 
                 1.00                  1.00                  1.00                  1.00                  1.00 
            asia with          asian member    assesses indonesia           attract new            balance of 
                 1.00                  1.00                  1.00                  1.00                  1.00

Is that what you want to do?

Incidentally, if your term-document matrix is very large, you may want to try this version of findAssocs:

# u is a term document matrix
# term is your term
# corlimit is a value -1 to 1

findAssocsBig <- function(u, term, corlimit){
  suppressWarnings(x.cor <-  gamlr::corr(t(u[ !u$dimnames$Terms == term, ]),        
                                         as.matrix(t(u[  u$dimnames$Terms == term, ]))  ))  
  x <- sort(round(x.cor[(x.cor[, term] > corlimit), ], 2), decreasing = TRUE)
  return(x)
}

This can be used like so:

dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5), .progress = "text" )

The advantage of this is that it uses a different method of converting the TDM to a matrix tm:findAssocs. This different method uses memory more efficiently and so prevents this kind of message: Error: cannot allocate vector of size 1.9 Gb from occurring.

Quick benchmarking shows that both findAssocs functions are about the same speed, so the main difference is in the use of memory:

library(microbenchmark)
microbenchmark(
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5)),
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5)),
times = 10)

Unit: seconds
                                                                                     expr      min       lq   median
 dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi,      i, 0.5)) 10.82369 11.03968 11.25492
     dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi,      i, 0.5)) 10.70980 10.85640 11.14156
       uq      max neval
 11.39326 11.89754    10
 11.18877 11.97978    10

R : Visualize correlated words against one or more words

A slightly different approach is required for two words, here's a quick attempt:

require(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)

# Compute correlations and store in data frame...

toi1 <- "oil" # term of interest
toi2 <- "winter"
corlimit <- 0.7 #  lower correlation bound limit.

corr1 <-  findAssocs(tdm, toi1, corlimit)[[1]]
corr1 <- cbind(read.table(text = names(corr1), stringsAsFactors = FALSE), corr1)
corr2 <- findAssocs(tdm, toi2, corlimit)[[1]]
corr2 <- cbind(read.table(text = names(corr2), stringsAsFactors = FALSE), corr2)

# join them together
library(dplyr)
two_terms_corrs <- full_join(corr1, corr2)

# gather for plotting
library(tidyr)
two_terms_corrs_gathered <- gather(two_terms_corrs, term, correlation, corr1:corr2)

# insert the actual terms of interest so they show up on the legend
two_terms_corrs_gathered$term <- ifelse(two_terms_corrs_gathered$term  == "corr1", toi1, toi2)

# Draw the plot...

require(ggplot2)
ggplot(two_terms_corrs_gathered, aes(x = V1, y = correlation, colour =  term ) ) +
  geom_point(size = 3) +
  ylab(paste0("Correlation with the terms ", "\"", toi1,  "\"", " and ",  "\"", toi2, "\"")) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

Sample Image

Math of tm::findAssocs how does this function work?

 findAssocs
#function (x, term, corlimit) 
#UseMethod("findAssocs", x)
#<environment: namespace:tm>

methods(findAssocs )
#[1] findAssocs.DocumentTermMatrix* findAssocs.matrix*   findAssocs.TermDocumentMatrix*

 getAnywhere(findAssocs.DocumentTermMatrix)
#-------------
A single object matching ‘findAssocs.DocumentTermMatrix’ was found
It was found in the following places
  registered S3 method for findAssocs from namespace tm
  namespace:tm
with value

function (x, term, corlimit) 
{
    ind <- term == Terms(x)
    suppressWarnings(x.cor <- cor(as.matrix(x[, ind]), as.matrix(x[, 
        !ind])))

That was where self-references were removed.

    findAssocs(x.cor, term, corlimit)
}
<environment: namespace:tm>
#-------------
 getAnywhere(findAssocs.matrix)
#-------------
A single object matching ‘findAssocs.matrix’ was found
It was found in the following places
  registered S3 method for findAssocs from namespace tm
  namespace:tm
with value

function (x, term, corlimit) 
sort(round(x[term, which(x[term, ] > corlimit)], 2), decreasing = TRUE)
<environment: namespace:tm>

word association - findAssocs and numeric (0)

Consider the following example:

library(tm)
corp <- VCorpus(VectorSource(
          c("hello world", "hello another World ", "and hello yet another world")))
tdm <- TermDocumentMatrix(corp)
inspect(tdm)
#          Docs
# Terms     1 2 3
#   and     0 0 1
#   another 0 1 1
#   hello   1 1 1
#   world   1 1 1
#   yet     0 0 1

Now consider

findAssocs(x=tdm, terms=c("hello", "yet"), corlimit=.4)
# $hello
# numeric(0)
# 
# $yet
#     and another 
#     1.0     0.5

From what I understand, findAssocs looks at the correlations of hello with everything but hello and yet, as well as yet with everything except hello and yet. yet and and have a correlation coefficient of 1.0, which is above the lower limit of 0.4. yet is also in 50% of all documents containing another - that's also above our 0.4 limit.

Here's another example showcasing this:

findAssocs(x=tdm, terms=c("yet", "another"), corlimit=0)
# $yet
# and 
#   1 
# 
# $another
# and 
# 0.5

Note that hello (and world) don't yield any results because they are in every document. This means the term frequency has zero variance and cor under the hood yields NA (like cor(rep(1,3), 1:3), which gives NA plus a zero-standard-deviation-warning).

Findassocs for Multiple Terms in R