Finding 2 & 3 Word Phrases Using R Tm Package

Using tm() to mine PDFs for two and three word phrases

Here is a way to get what you want using the tm package together with RWeka. You need to create a separate tokenizer function that you plug into the DocumentTermMatrix function. RWeka plays very nicely with tm for this.

If you don't want to install RWeka due to java dependencies, you can use any other package like tidytext or quanteda. If you have need of speed because of the size of your data, I advice using the quanteda package (example below the tm code). Quanteda runs in parallel and with quanteda_options you can specify how many cores you want to use (2 cores are the default).

note:

Note that the unigrams and bigrams in your dictionary overlap . In the example used you will see that in text 127 "prices" (3) and "contract prices" (1) will double count the prices.

library(tm)
library(RWeka)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))

my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")


# adjust to min = 2 and max = 3 for 2 and 3 word ngrams
RWeka_tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 1, max = 2))
}

dtm <- DocumentTermMatrix(crude, control=list(tokenize = RWeka_tokenizer,
dictionary = my_words))

# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL, check.names = FALSE)

For speed if you have a big corpus quanteda might be better:

library(quanteda)

corp_crude <- corpus(crude)
# adjust ngrams to 2:3 for 2 and 3 word ngrams
toks_crude <- tokens(corp_crude, ngrams = 1:2, concatenator = " ")
toks_crude <- tokens_keep(toks_crude, pattern = dictionary(list(words = my_words)), valuetype = "fixed")
dfm_crude <- dfm(toks_crude)
df1 <- convert(dfm_crude, to = "data.frame")

Text Mining in R: Counting 2-3 word phrases

Removing stopwords can remove noise from the data, causing issues such as those you are having a above:

library(tm)
library(corpus)
library(dplyr)
corpus <- Corpus(VectorSource(gutenberg_corpus(55)))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
term_stats(corpus, ngrams = 2:3) %>%
arrange(desc(count)) %>%
group_by(grp = str_extract(as.character(term), "\\w+\\s+\\w+")) %>%
mutate(count_unique = ifelse(length(unique(count)) > 1, max(count) - min(count), count)) %>%
ungroup() %>%
select(-grp)

Find 2 words phrase using tm R

I haven't found the reason either, but if you are only interested in the counts regardless in which documents the bigrams occured, you could get them alternatively via this pipeline:

library(tm)
lilbrary(dplyr)
library(quanteda)

# ..construct the corpus as in your post ...

corpus %>%
unlist() %>%
tokens() %>%
tokens_ngrams(2:2, concatenator = " ") %>%
unlist() %>%
as.data.frame() %>%
group_by_(".") %>%
summarize(cnt=n()) %>%
arrange(desc(cnt))

Find multi-word strings in more than one document

Here you need a method for detecting collocations, which fortunately quanteda has in the form of textstat_collocations(). Once you detect these, you can compound your tokens to make these into a single "token", and then get their frequencies in the standard way.

You do not need to know the length in advance, but do need to specify a range. Below, I've added some more text, and included a size range from 2 to 3. This also picks up the "criminal background check", without confusing the term "background" that is also in the phrase "background work". (By default, detection is case insensitive.)

library("quanteda")
## Package version: 2.1.0

text <- c(
"Introduction Here you see something Related work another info here",
"Introduction another text Background work something to now",
"Background work is related to related work",
"criminal background checks are useful",
"The law requires criminal background checks"
)

colls <- textstat_collocations(text, size = 2:3)
colls
## collocation count count_nested length lambda z
## 1 criminal background 2 2 2 4.553877 2.5856967
## 2 background checks 2 2 2 4.007333 2.3794386
## 3 related work 2 2 2 2.871680 2.3412833
## 4 background work 2 2 2 2.322388 2.0862256
## 5 criminal background checks 2 0 3 -1.142097 -0.3426584

Here we can see that the phrases are being detected and distinguished. Now we can use tokens_compound to join them:

toks <- tokens(text) %>%
tokens_compound(colls, concatenator = " ")

dfm(toks) %>%
dfm_trim(min_termfreq = 2) %>%
dfm_remove(stopwords("en")) %>%
textstat_frequency()
## feature frequency rank docfreq group
## 1 introduction 2 1 2 all
## 2 something 2 1 2 all
## 3 another 2 1 2 all
## 4 related work 2 1 2 all
## 5 background work 2 1 2 all
## 6 criminal background checks 2 1 2 all

Text Mining - Count Frequencies of Phrases (more than one word)

I created following function for obtaining word n-grams and their corresponding frequencies

library(tau) 
library(data.table)
# given a string vector and size of ngrams this function returns word ngrams with corresponding frequencies
createNgram <-function(stringVector, ngramSize){

ngram <- data.table()

ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)

if(ngramSize==1){
ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
else {
ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
return(ngram)
}

Given a string like

text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."

Here is how to call the function for a pair of words, for phrases of length 3 pass 3 as argument

res <- createNgram(text, 2)

printing res outputs

           w1w2      freq   length
1: I want 2 6
2: R text 2 6
3: This is 2 7
4: and I 2 5
5: and is 1 6
6: count the 2 9
7: example and 2 11
8: frequency of 2 12
9: is my 3 5
10: little R 2 8
11: my little 2 9
12: my of 1 5
13: of This 1 7
14: of some 2 7
15: pattern and 1 11
16: some patter 1 11
17: some pattern 1 12
18: text example 2 12
19: the frequency 2 13
20: to count 2 8
21: want to 2 7

R tm package select huge amount of words to keep in text corpus

I looked at your requirements and maybe a combination to tm and quanteda can help. See below.

Once you have a list of frequent words you can use quanteda in parallel to get the bigrams.

library(quanteda)

# set number of threads
quanteda_options(threads = 4)

my_corp <- corpus(crude) # corpus from tm can be used here (txt_corpus)
my_toks <- tokens(my_corp, remove_punct = TRUE) # add extra removal if needed

# Use list of frequent words from tm.
# speed gain should occur here
my_toks <- tokens_keep(my_toks, frequent_words)

# ngrams, concatenator is _ by default
bitoks <- tokens_ngrams(my_toks)

textstat_frequency(dfm(bitoks)) # ordered from high to low

feature frequency rank docfreq group
1 to_to 41 1 12 all
2 to_of 35 2 15 all
3 oil_to 33 3 17 all
4 to_in 32 4 12 all
5 of_to 29 5 14 all
6 in_to 28 6 11 all
7 in_of 21 7 8 all
8 to_oil 21 7 13 all
9 of_in 21 7 10 all
10 of_oil 20 10 14 all
11 of_of 20 10 8 all
12 in_oil 19 12 10 all
13 oil_in 18 13 11 all
14 oil_of 18 13 11 all
15 in_in 14 15 9 all
16 oil_oil 13 16 10 all

quanteda does have a topfeatures function, but it doesn't work like findfreqterms. Otherwise you could do it completely in quanteda.

If the dfm generation is taking too much memory, you can use as.character to transform the token object and use this either in dplyr or data.table. See code below.

library(dplyr)
out_dp <- tibble(features = as.character(bitoks)) %>%
group_by(features) %>%
tally()


library(data.table)
out_dt <- data.table(features = as.character(bitoks))
out_dt <- out_dt[, .N, by = features]


Related Topics



Leave a reply



Submit