Naive Bayes in Quanteda VS Caret: Wildly Different Results

Naive Bayes in Quanteda vs caret: wildly different results

The answer is that caret (which uses naive_bayes from the naivebayes package) assumes a Gaussian distribution, whereas quanteda::textmodel_nb() is based on a more text-appropriate multinomial distribution (with the option of a Bernoulli distribution as well).

The documentation for textmodel_nb() replicates the example from the IIR book (Manning, Raghavan, and Schütze 2008) and a further example from Jurafsky and Martin (2018) is also referenced. See:

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. An Introduction to Information Retrieval. Cambridge University Press (Chapter 13). https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Jurafsky, Daniel, and James H. Martin. 2018. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of 3rd edition, September 23, 2018 (Chapter 4). https://web.stanford.edu/~jurafsky/slp3/4.pdf

Another package, e1071, produces the same results you found as it is also based on a Gaussian distribution.

library("e1071")
nb_e1071 <- naiveBayes(x = training_m,
                       y = as.factor(docvars(training_dfm, "Sentiment")))
nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
table(actual_class, nb_e1071_pred)
##             nb_e1071_pred
## actual_class neg pos
##          neg 246   3
##          pos 249   2

However both caret and e1071 work on dense matrices, which is one reason they are so mind-numbingly slow compared to the quanteda approach which operates on the sparse dfm. So from the standpoint of appropriateness, efficiency, and (as per your results) the performance of the classifier, it should be pretty clear which one is preferred!

library("rbenchmark")
benchmark(
    quanteda = { 
        nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))
        predicted_class <- predict(nb_quanteda, newdata = test_dfm)
    },
    caret = {
        nb_caret <- train(x = training_m,
                          y = as.factor(docvars(training_dfm, "Sentiment")),
                          method = "naive_bayes",
                          trControl = trainControl(method = "none"),
                          tuneGrid = data.frame(laplace = 1,
                                                usekernel = FALSE,
                                                adjust = FALSE),
                          verbose = FALSE)
        predicted_class_caret <- predict(nb_caret, newdata = test_m)
    },
    e1071 = {
        nb_e1071 <- naiveBayes(x = training_m,
                       y = as.factor(docvars(training_dfm, "Sentiment")))
        nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
    },
    replications = 1
)
##       test replications elapsed relative user.self sys.self user.child sys.child
## 2    caret            1  29.042  123.583    25.896    3.095          0         0
## 3    e1071            1 217.177  924.157   215.587    1.169          0         0
## 1 quanteda            1   0.235    1.000     0.213    0.023          0         0

Trying to use the Naive Bayes Learner in R but predict() giving different results than model would suggest

Actually predict function works just fine. Don't get me wrong but problem is in what you are doing. You are building the model using this formula: type ~ ., right? It is clear what we have on the left-hand side of the formula so lets look at the right-hand side.

In your data you have only to variables - type and statement and because type is dependent variable only thing that counts as independent variable is statement. So far everything is clear.

Let's take a look at Bayesian Classifier. A priori probabilities are obvious, right? What about
conditional probabilities? From the classifier point of view you have only one categorical Variable (your sentences). For the classifier point it is only some list of labels. All of them are unique so a posteriori probabilities will be close to the the a priori.

In other words only thing we can tell when we get a new observation is that probability of it being spam is equal to probability of message being spam in your train set.

If you want to use any method of machine learning to work with natural language you have to pre-process your data first. Depending on you problem it could for example mean stemming, lemmatization, computing n-gram statistics, tf-idf. Training classifier is the last step.

Why are results different for column / row of a Quanteda freq. co-occurence matrix?

This is because by default, fcm() returns only the upper triangle of the symmetric co-occurrence matrix (symmetric when ordered = FALSE). To make the two index slices equivalent, you would need to specify tri = FALSE.

library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

toks <- tokens(c("a a a b b c", "a a c e", "a c e f g"))

# default is only upper triangle
fcm(toks, context = "window", window = 2, tri = TRUE)
## Feature co-occurrence matrix of: 6 by 6 features.
##         features
## features a b c e f g
##        a 8 3 3 2 0 0
##        b 0 2 2 0 0 0
##        c 0 0 0 2 1 0
##        e 0 0 0 0 1 1
##        f 0 0 0 0 0 1
##        g 0 0 0 0 0 0

This can make it symmetric in which case the index slicing is the same:

fcmat2 <- fcm(toks, context = "window", window = 2, tri = FALSE)
fcmat2
## Feature co-occurrence matrix of: 6 by 6 features.
##         features
## features a b c e f g
##        a 8 3 3 2 0 0
##        b 3 2 2 0 0 0
##        c 3 2 0 2 1 0
##        e 2 0 2 0 1 1
##        f 0 0 1 1 0 1
##        g 0 0 0 1 1 0

fcmat2[, "a"]
## Feature co-occurrence matrix of: 6 by 1 features.
##         features
## features a
##        a 8
##        b 3
##        c 3
##        e 2
##        f 0
##        g 0
t(fcmat2["a", ])
## Feature co-occurrence matrix of: 6 by 1 features.
##         features
## features a
##        a 8
##        b 3
##        c 3
##        e 2
##        f 0
##        g 0

how is PcGw computed in quanteda's Naive Bayes?

The application is clearly explained in the book chapter you cite, but in essence, the different is that PcGw is the "probability of the class given the word", and PwGc is the "probability of the word given the class". The former is the posterior and what we need for computing the probability of a class membership for a group of words using the joint probability (in quanteda, this is applied using the predict() function). The latter is simply the likelihood that comes from the relative frequencies of the features in each class, smoothed by default by adding one to the counts by class.

You can verify this if you want as follows. First, group the training documents by training class, and then smooth them.

trainingset_bygroup <- dfm_group(trainingset[1:4, ], trainingclass[-5]) %>%
    dfm_smooth(smoothing = 1)
trainingset_bygroup
# Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
# 2 x 6 sparse Matrix of class "dfm"
#     features
# docs Chinese Beijing Shanghai Macao Tokyo Japan
#    N       2       1        1     1     2     2
#    Y       6       2        2     2     1     1

Then you can see that the (smoothed) word likelihoods are the same as PwGc.

trainingset_bygroup / rowSums(trainingset_bygroup)
# Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
# 2 x 6 sparse Matrix of class "dfm"
#     features
# docs   Chinese   Beijing  Shanghai     Macao      Tokyo      Japan
#    N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
#    Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857

tmod1$PwGc
#        features
# classes   Chinese   Beijing  Shanghai     Macao      Tokyo      Japan
#       N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
#       Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857

But you probably care more about the P(class|word), since that's what Bayes formula is all about and incorporates the prior class probabilities P(c).

More efficient means of creating a corpus and DTM with 4M rows

I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.

In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm may give us (and certainly much faster than qdap). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.

data <- data.frame(
    text=c("Let the big dogs hunt",
        "No holds barred",
        "My child is an honor student"
    ), stringsAsFactors = F)

## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]

library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))

lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
    tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)

library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ] 

library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)

tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm

## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm

Using textstat_simil with a dictionary or globs in Quanteda

It's possible, but first you would need to convert the glob matches with "rain*" into "rain" by using dfm_lookup(). (Note: there are other ways to do this, such as tokenizing and then using tokens_lookup(), or tokens_replace(), but I think the lookup approach is more straightforward and this is also what you asked in the question.

Also note that for feature similarity, you must have more than a single document, which explains why I added two more here.

txt <- c("It is raining. It rains a lot during the rainy season",
         "Raining today, and it rained yesterday.",
         "When it's raining it must be rainy season.")

rain_dfm <- dfm(txt)

Then use a dictionary to convert glob matches (the default) with "rain*" to "rain", while keeping the other features. (In this particular case, you are correct that dfm_wordstem() could have accomplished the same thing.)

rain_dfm <- dfm_lookup(rain_dfm, 
                       dictionary(list(rain = "rain*")), 
                       exclusive = FALSE,
                       capkeys = FALSE)
rain_dfm
## Document-feature matrix of: 3 documents, 17 features (52.9% sparse).
## 3 x 17 sparse Matrix of class "dfm"
##        features
## docs    it is rain . a lot during the season today , and yesterday when it's must be
##   text1  2  1    3 1 1   1      1   1      1     0 0   0         0    0    0    0  0
##   text2  1  0    2 1 0   0      0   0      0     1 1   1         1    0    0    0  0
##   text3  1  0    2 1 0   0      0   0      1     0 0   0         0    1    1    1  1

And now, you can compute the cosine similarity for the target feature of "rain":

textstat_simil(rain_dfm, selection = "rain", method = "cosine", margin = "features")
##                rain
## it        0.9901475
## is        0.7276069
## rain      1.0000000
## .         0.9801961
## a         0.7276069
## lot       0.7276069
## during    0.7276069
## the       0.7276069
## season    0.8574929
## today     0.4850713
## ,         0.4850713
## and       0.4850713
## yesterday 0.4850713
## when      0.4850713
## it's      0.4850713
## must      0.4850713
## be        0.4850713

rtexttools package alternative for R version 3.5.2 or newest R version

First, the package(s) are not on CRAN anymore but you can still use them if you want. The easiest way is to install them from the archive:

install.packages("https://cran.r-project.org/src/contrib/Archive/maxent/maxent_1.3.3.1.tar.gz", type = "source", repos = NULL)
install.packages("https://cran.r-project.org/src/contrib/Archive/RTextTools/RTextTools_1.4.2.tar.gz", type = "source", repos = NULL)

I tested it recently against some more modern implementations and especially maxent still holds up pretty well and will maybe find a new home at some point.

Second, there are a number of alternatives for text classification and machine learning. For machine learning itself, the caret package (manual) is not bad and can handle some text classification. However, keep in mind that it is not optimized for text. A really cool new package which will hopefully make it to CRAN soon is quanteda.classifiers while quanteda itself already has Naive Bayes implemented (Tutorial).

Third, there are a lot of other packages out there that I don't know about and I do not dare suggest any one is better suited to whatever you want to do than anything else out there. I found this thread a while ago that discusses some options: https://github.com/bnosac/ruimtehol/issues/11.

Naive Bayes in Quanteda VS Caret: Wildly Different Results