More Efficient Means of Creating a Corpus and Dtm with 4M Rows

More efficient means of creating a corpus and DTM with 4M rows

I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.

In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm may give us (and certainly much faster than qdap). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.

data <- data.frame(
    text=c("Let the big dogs hunt",
        "No holds barred",
        "My child is an honor student"
    ), stringsAsFactors = F)

## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]

library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))

lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
    tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)

library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ] 

library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)

tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm

## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm

R - slowly working lapply with sort on ordered factor

For now I've speeded it up replacing

sort(as.integer(factor(x, levels = lev, ordered = TRUE)))

with

ind = which(lev %in% x)
cnt = as.integer(factor(x, levels = lev[ind], ordered = TRUE))
sort(ind[cnt])

Now timings are:

expr      min       lq     mean   median       uq      max neval
...  5.248479 6.202161 6.892609 6.501382 7.313061 10.17205   100

Convert Corpus from quanteda to tm

You can construct a tm Corpus/VCorpus directly from a VectorSource wrapped in VCorpus, because a quanteda corpus is just a special character vector.

library("tm")
## Loading required package: NLP

# from version 3.0 of quanteda
data(data_corpus_inaugural, package = "quanteda")

VCorpus(VectorSource(data_corpus_inaugural))
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 59

However... Do you really want/need to do this?

R: sparse matrix multiplication with data.table and quanteda package?

This works just fine:

mytext <- c("Let the big dogs hunt", 
            "No holds barred", 
            "My child is an honor student")     
myMatrix <- dfm(mytext)

myMatrix %*% t(myMatrix)
## 3 x 3 sparse Matrix of class "dgCMatrix"
##       text1 text2 text3
## text1     5     .     .
## text2     .     3     .
## text3     .     .     6

No need to coerce to a dense matrix using as.matrix(). Note that it is no longer a "dfmSparse" object because it's no longer a matrix of documents by features.

Create dfm step by step with quanteda

We designed dfm() not as a "black box" but more as a Swiss army knife that combines many of the options that typical users want to apply when converting their texts to a matrix of documents and features. However all of these options are also available through lower-level processing commands, should you wish to exert a finer level of control.

However one of the design principles of quanteda is that text only becomes "features" through the process of tokenisation. If you have a set of tokenised features that you wish to exclude, you must first tokenise your text, or you cannot exclude them. Unlike other text packages for R (e.g. tm), these steps are applied "downstream" from a corpus, so that the corpus remains an unprocessed set of texts to which manipulations will be applied (but will not itself be a transformed set of texts). The purpose of this is to preserve generality but also to promote reproducibility and transparency in text analysis.

In response to your questions:

You can however override our encouraged behaviour using the texts(myCorpus) <- function, where what is assigned to the texts will override the existing texts. So you could use regular expressions to remove punctuation and numbers -- for example the stringi commands and using the Unicode classes for punctuation and numerals to identify patterns.
I would recommend you tokenise before removing stopwords. Stop "words" are tokens, so there is no way to remove these from the text before you tokenise the text. Even applying regular expressions to substitute them for "" involves specifying some form of word boundary in the regex - again, this is tokenisation.
To tokenise into unigrams and bigrams:
tokens(myCorpus, ngrams = 1:2)
To create the dfm, simply call dfm(myTokens). (You could also have applied step 3, for ngrams, at this stage.

Bonus 1: n=2 collocations produces the same list as bigrams, except in a different format. Did you intend something else? (Separate SO question perhaps?)

Bonus 2: See dfm_trim(x, sparsity = ). The removeSparseTerms() options are quite confusing to most people, but this included for migrants from tm. See this post for a full explanation.

BTW: Use texts() instead of ie2010Corpus$documents$texts -- we will rewrite the object structure of a corpus soon, so you should not access its internals this way when there is an extractor function. (Also, this step is unnecessary - here you have simply recreated the corpus.)

Update 2018-01:

The new name for the corpus object is data_corpus_irishbudget2010, and the collocation scoring function is textstat_collocations().

Naive Bayes in Quanteda vs caret: wildly different results

The answer is that caret (which uses naive_bayes from the naivebayes package) assumes a Gaussian distribution, whereas quanteda::textmodel_nb() is based on a more text-appropriate multinomial distribution (with the option of a Bernoulli distribution as well).

The documentation for textmodel_nb() replicates the example from the IIR book (Manning, Raghavan, and Schütze 2008) and a further example from Jurafsky and Martin (2018) is also referenced. See:

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. An Introduction to Information Retrieval. Cambridge University Press (Chapter 13). https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Jurafsky, Daniel, and James H. Martin. 2018. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of 3rd edition, September 23, 2018 (Chapter 4). https://web.stanford.edu/~jurafsky/slp3/4.pdf

Another package, e1071, produces the same results you found as it is also based on a Gaussian distribution.

library("e1071")
nb_e1071 <- naiveBayes(x = training_m,
                       y = as.factor(docvars(training_dfm, "Sentiment")))
nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
table(actual_class, nb_e1071_pred)
##             nb_e1071_pred
## actual_class neg pos
##          neg 246   3
##          pos 249   2

However both caret and e1071 work on dense matrices, which is one reason they are so mind-numbingly slow compared to the quanteda approach which operates on the sparse dfm. So from the standpoint of appropriateness, efficiency, and (as per your results) the performance of the classifier, it should be pretty clear which one is preferred!

library("rbenchmark")
benchmark(
    quanteda = { 
        nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))
        predicted_class <- predict(nb_quanteda, newdata = test_dfm)
    },
    caret = {
        nb_caret <- train(x = training_m,
                          y = as.factor(docvars(training_dfm, "Sentiment")),
                          method = "naive_bayes",
                          trControl = trainControl(method = "none"),
                          tuneGrid = data.frame(laplace = 1,
                                                usekernel = FALSE,
                                                adjust = FALSE),
                          verbose = FALSE)
        predicted_class_caret <- predict(nb_caret, newdata = test_m)
    },
    e1071 = {
        nb_e1071 <- naiveBayes(x = training_m,
                       y = as.factor(docvars(training_dfm, "Sentiment")))
        nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
    },
    replications = 1
)
##       test replications elapsed relative user.self sys.self user.child sys.child
## 2    caret            1  29.042  123.583    25.896    3.095          0         0
## 3    e1071            1 217.177  924.157   215.587    1.169          0         0
## 1 quanteda            1   0.235    1.000     0.213    0.023          0         0

Really fast word ngram vectorization in R

This is a really interesting problem, and one that I have spent a lot of time grappling with in the quanteda package. It involves three aspects that I will comment on, although it's only the third that really addresses your question. But the first two points explain why I have only focused on the ngram creation function, since -- as you point out -- that is where the speed improvement can be made.

Tokenization. Here you are using string::str_split_fixed() on the space character, which is the fastest, but not the best method for tokenizing. We implemented this almost exactly the same was in quanteda::tokenize(x, what = "fastest word"). It's not the best because stringi can do much smarter implementations of whitespace delimiters. (Even the character class \\s is smarter, but slightly slower -- this is implemented as what = "fasterword"). Your question was not about tokenization though, so this point is just context.
Tabulating the document-feature matrix. Here we also use the Matrix package, and index the documents and features (I call them features, not terms), and create a sparse matrix directly as you do in the code above. But your use of match() is a lot faster than the match/merge methods we were using through data.table. I am going to recode the quanteda::dfm() function since your method is more elegant and faster. Really, really glad I saw this!
ngram creation. Here I think I can actually help in terms of performance. We implement this in quanteda through an argument to quanteda::tokenize(), called grams = c(1) where the value can be any integer set. Our match for unigrams and bigrams would be ngrams = 1:2, for instance. You can examine the code at https://github.com/kbenoit/quanteda/blob/master/R/tokenize.R, see the internal function ngram(). I've reproduced this below and made a wrapper so that we can directly compare it to your find_ngrams() function.

Code:

# wrapper
find_ngrams2 <- function(x, ngrams = 1, concatenator = " ") { 
    if (sum(1:length(ngrams)) == sum(ngrams)) {
        result <- lapply(x, ngram, n = length(ngrams), concatenator = concatenator, include.all = TRUE)
    } else {
        result <- lapply(x, function(x) {
            xnew <- c()
            for (n in ngrams) 
                xnew <- c(xnew, ngram(x, n, concatenator = concatenator, include.all = FALSE))
            xnew
        })
    }
    result
}

# does the work
ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {

    if (length(tokens) < n) 
        return(NULL)

    # start with lower ngrams, or just the specified size if include.all = FALSE
    start <- ifelse(include.all, 
                    1, 
                    ifelse(length(tokens) < n, 1, n))

    # set max size of ngram at max length of tokens
    end <- ifelse(length(tokens) < n, length(tokens), n)

    all_ngrams <- c()
    # outer loop for all ngrams down to 1
    for (width in start:end) {
        new_ngrams <- tokens[1:(length(tokens) - width + 1)]
        # inner loop for ngrams of width > 1
        if (width > 1) {
            for (i in 1:(width - 1)) 
                new_ngrams <- paste(new_ngrams, 
                                    tokens[(i + 1):(length(tokens) - width + 1 + i)], 
                                    sep = concatenator)
        }
        # paste onto previous results and continue
        all_ngrams <- c(all_ngrams, new_ngrams)
    }

    all_ngrams
}

Here is the comparison for a simple text:

txt <- c("The quick brown fox named Seamus jumps over the lazy dog.", 
         "The dog brings a newspaper from a boy named Seamus.")
tokens <- tokenize(toLower(txt), removePunct = TRUE)
tokens
# [[1]]
# [1] "the"    "quick"  "brown"  "fox"    "named"  "seamus" "jumps"  "over"   "the"    "lazy"   "dog"   
# 
# [[2]]
# [1] "the"       "dog"       "brings"    "a"         "newspaper" "from"      "a"         "boy"       "named"     "seamus"   
# 
# attr(,"class")
# [1] "tokenizedTexts" "list"     

microbenchmark::microbenchmark(zach_ng <- find_ngrams(tokens, 2),
                               ken_ng <- find_ngrams2(tokens, 1:2))
# Unit: microseconds
#                                expr     min       lq     mean   median       uq     max neval
#   zach_ng <- find_ngrams(tokens, 2) 288.823 326.0925 433.5831 360.1815 542.9585 897.469   100
# ken_ng <- find_ngrams2(tokens, 1:2)  74.216  87.5150 130.0471 100.4610 146.3005 464.794   100

str(zach_ng)
# List of 2
# $ : chr [1:21] "the" "quick" "brown" "fox" ...
# $ : chr [1:19] "the" "dog" "brings" "a" ...
str(ken_ng)
# List of 2
# $ : chr [1:21] "the" "quick" "brown" "fox" ...
# $ : chr [1:19] "the" "dog" "brings" "a" ...

For your really large, simulated text, here is the comparison:

tokens <- stri_split_fixed(sents1, ' ')
zach_ng1_t1 <- system.time(zach_ng1 <- find_ngrams(tokens, 2))
ken_ng1_t1 <- system.time(ken_ng1 <- find_ngrams2(tokens, 1:2))
zach_ng1_t1
#    user  system elapsed 
# 230.176   5.243 246.389 
ken_ng1_t1
#   user  system elapsed 
# 58.264   1.405  62.889

Already an improvement, I'd be delighted if this could be improved further. I also should be able to implement the faster dfm() method into quanteda so that you can get what you want simply through:

dfm(sents1, ngrams = 1:2, what = "fastestword",
    toLower = FALSE, removePunct = FALSE, removeNumbers = FALSE, removeTwitter = TRUE))

(That already works but is slower than your overall result, because the way you create the final sparse matrix object is faster - but I will change this soon.)

More Efficient Means of Creating a Corpus and Dtm with 4M Rows