Finding 2 & 3 Word Phrases Using R Tm Package

Using tm() to mine PDFs for two and three word phrases

Here is a way to get what you want using the tm package together with RWeka. You need to create a separate tokenizer function that you plug into the DocumentTermMatrix function. RWeka plays very nicely with tm for this.

If you don't want to install RWeka due to java dependencies, you can use any other package like tidytext or quanteda. If you have need of speed because of the size of your data, I advice using the quanteda package (example below the tm code). Quanteda runs in parallel and with quanteda_options you can specify how many cores you want to use (2 cores are the default).

note:

Note that the unigrams and bigrams in your dictionary overlap . In the example used you will see that in text 127 "prices" (3) and "contract prices" (1) will double count the prices.

library(tm)
library(RWeka)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))

my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")


# adjust to min = 2 and max = 3 for 2 and 3 word ngrams
RWeka_tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 1, max = 2)) 
}

dtm <- DocumentTermMatrix(crude, control=list(tokenize = RWeka_tokenizer,
                                              dictionary = my_words))

# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL, check.names = FALSE)

For speed if you have a big corpus quanteda might be better:

library(quanteda)

corp_crude <- corpus(crude)
# adjust ngrams to 2:3 for 2 and 3 word ngrams
toks_crude <- tokens(corp_crude, ngrams = 1:2, concatenator = " ")
toks_crude <- tokens_keep(toks_crude, pattern = dictionary(list(words = my_words)), valuetype = "fixed")
dfm_crude <- dfm(toks_crude)
df1 <- convert(dfm_crude, to = "data.frame")

Text Mining in R: Counting 2-3 word phrases

Removing stopwords can remove noise from the data, causing issues such as those you are having a above:

library(tm)
library(corpus)
library(dplyr)
corpus <- Corpus(VectorSource(gutenberg_corpus(55)))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
term_stats(corpus, ngrams = 2:3) %>% 
  arrange(desc(count)) %>%
  group_by(grp = str_extract(as.character(term), "\\w+\\s+\\w+")) %>% 
  mutate(count_unique = ifelse(length(unique(count)) > 1, max(count) - min(count), count)) %>% 
  ungroup() %>% 
  select(-grp)

Find 2 words phrase using tm R

I haven't found the reason either, but if you are only interested in the counts regardless in which documents the bigrams occured, you could get them alternatively via this pipeline:

library(tm)
lilbrary(dplyr)
library(quanteda)

# ..construct the corpus as in your post ...

corpus %>% 
  unlist() %>%  
  tokens() %>%
  tokens_ngrams(2:2, concatenator = " ") %>%  
  unlist() %>%  
  as.data.frame() %>% 
  group_by_(".") %>%  
  summarize(cnt=n()) %>%
  arrange(desc(cnt))

Find multi-word strings in more than one document

Here you need a method for detecting collocations, which fortunately quanteda has in the form of textstat_collocations(). Once you detect these, you can compound your tokens to make these into a single "token", and then get their frequencies in the standard way.

You do not need to know the length in advance, but do need to specify a range. Below, I've added some more text, and included a size range from 2 to 3. This also picks up the "criminal background check", without confusing the term "background" that is also in the phrase "background work". (By default, detection is case insensitive.)

library("quanteda")
## Package version: 2.1.0

text <- c(
  "Introduction Here you see something Related work another info here",
  "Introduction another text Background work something to now",
  "Background work is related to related work",
  "criminal background checks are useful",
  "The law requires criminal background checks"
)

colls <- textstat_collocations(text, size = 2:3)
colls
##                  collocation count count_nested length    lambda          z
## 1        criminal background     2            2      2  4.553877  2.5856967
## 2          background checks     2            2      2  4.007333  2.3794386
## 3               related work     2            2      2  2.871680  2.3412833
## 4            background work     2            2      2  2.322388  2.0862256
## 5 criminal background checks     2            0      3 -1.142097 -0.3426584

Here we can see that the phrases are being detected and distinguished. Now we can use tokens_compound to join them:

toks <- tokens(text) %>%
  tokens_compound(colls, concatenator = " ")

dfm(toks) %>%
  dfm_trim(min_termfreq = 2) %>%
  dfm_remove(stopwords("en")) %>%
  textstat_frequency()
##                      feature frequency rank docfreq group
## 1               introduction         2    1       2   all
## 2                  something         2    1       2   all
## 3                    another         2    1       2   all
## 4               related work         2    1       2   all
## 5            background work         2    1       2   all
## 6 criminal background checks         2    1       2   all

Text Mining - Count Frequencies of Phrases (more than one word)

I created following function for obtaining word n-grams and their corresponding frequencies

library(tau) 
library(data.table)
# given a string vector and size of ngrams this function returns     word ngrams with corresponding frequencies
createNgram <-function(stringVector, ngramSize){

  ngram <- data.table()

  ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)

  if(ngramSize==1){
    ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))  
  }
  else {
    ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
  }
  return(ngram)
}

Given a string like

text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."

Here is how to call the function for a pair of words, for phrases of length 3 pass 3 as argument

res <- createNgram(text, 2)

printing res outputs

           w1w2      freq   length
 1:        I want    2      6
 2:        R text    2      6
 3:       This is    2      7
 4:         and I    2      5
 5:        and is    1      6
 6:     count the    2      9
 7:   example and    2     11
 8:  frequency of    2     12
 9:         is my    3      5
10:      little R    2      8
11:     my little    2      9
12:         my of    1      5
13:       of This    1      7
14:       of some    2      7
15:   pattern and    1     11
16:   some patter    1     11
17:  some pattern    1     12
18:  text example    2     12
19: the frequency    2     13
20:      to count    2      8
21:       want to    2      7

R tm package select huge amount of words to keep in text corpus

I looked at your requirements and maybe a combination to tm and quanteda can help. See below.

Once you have a list of frequent words you can use quanteda in parallel to get the bigrams.

library(quanteda)

# set number of threads 
quanteda_options(threads = 4) 

my_corp <- corpus(crude) # corpus from tm can be used here (txt_corpus)
my_toks <- tokens(my_corp, remove_punct = TRUE) # add extra removal if needed

# Use list of frequent words from tm. 
# speed gain should occur here
my_toks <- tokens_keep(my_toks, frequent_words)

# ngrams, concatenator is _ by default
bitoks <- tokens_ngrams(my_toks)

textstat_frequency(dfm(bitoks)) # ordered from high to low

   feature frequency rank docfreq group
1    to_to        41    1      12   all
2    to_of        35    2      15   all
3   oil_to        33    3      17   all
4    to_in        32    4      12   all
5    of_to        29    5      14   all
6    in_to        28    6      11   all
7    in_of        21    7       8   all
8   to_oil        21    7      13   all
9    of_in        21    7      10   all
10  of_oil        20   10      14   all
11   of_of        20   10       8   all
12  in_oil        19   12      10   all
13  oil_in        18   13      11   all
14  oil_of        18   13      11   all
15   in_in        14   15       9   all
16 oil_oil        13   16      10   all

quanteda does have a topfeatures function, but it doesn't work like findfreqterms. Otherwise you could do it completely in quanteda.

If the dfm generation is taking too much memory, you can use as.character to transform the token object and use this either in dplyr or data.table. See code below.

library(dplyr)
out_dp <- tibble(features = as.character(bitoks)) %>% 
  group_by(features) %>% 
  tally()


library(data.table)
out_dt <- data.table(features = as.character(bitoks))
out_dt <- out_dt[, .N, by = features]