Really Fast Word Ngram Vectorization in R

Really fast word ngram vectorization in R

This is a really interesting problem, and one that I have spent a lot of time grappling with in the quanteda package. It involves three aspects that I will comment on, although it's only the third that really addresses your question. But the first two points explain why I have only focused on the ngram creation function, since -- as you point out -- that is where the speed improvement can be made.

Tokenization. Here you are using string::str_split_fixed() on the space character, which is the fastest, but not the best method for tokenizing. We implemented this almost exactly the same was in quanteda::tokenize(x, what = "fastest word"). It's not the best because stringi can do much smarter implementations of whitespace delimiters. (Even the character class \\s is smarter, but slightly slower -- this is implemented as what = "fasterword"). Your question was not about tokenization though, so this point is just context.
Tabulating the document-feature matrix. Here we also use the Matrix package, and index the documents and features (I call them features, not terms), and create a sparse matrix directly as you do in the code above. But your use of match() is a lot faster than the match/merge methods we were using through data.table. I am going to recode the quanteda::dfm() function since your method is more elegant and faster. Really, really glad I saw this!
ngram creation. Here I think I can actually help in terms of performance. We implement this in quanteda through an argument to quanteda::tokenize(), called grams = c(1) where the value can be any integer set. Our match for unigrams and bigrams would be ngrams = 1:2, for instance. You can examine the code at https://github.com/kbenoit/quanteda/blob/master/R/tokenize.R, see the internal function ngram(). I've reproduced this below and made a wrapper so that we can directly compare it to your find_ngrams() function.

Code:

# wrapper
find_ngrams2 <- function(x, ngrams = 1, concatenator = " ") { 
    if (sum(1:length(ngrams)) == sum(ngrams)) {
        result <- lapply(x, ngram, n = length(ngrams), concatenator = concatenator, include.all = TRUE)
    } else {
        result <- lapply(x, function(x) {
            xnew <- c()
            for (n in ngrams) 
                xnew <- c(xnew, ngram(x, n, concatenator = concatenator, include.all = FALSE))
            xnew
        })
    }
    result
}

# does the work
ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {

    if (length(tokens) < n) 
        return(NULL)

    # start with lower ngrams, or just the specified size if include.all = FALSE
    start <- ifelse(include.all, 
                    1, 
                    ifelse(length(tokens) < n, 1, n))

    # set max size of ngram at max length of tokens
    end <- ifelse(length(tokens) < n, length(tokens), n)

    all_ngrams <- c()
    # outer loop for all ngrams down to 1
    for (width in start:end) {
        new_ngrams <- tokens[1:(length(tokens) - width + 1)]
        # inner loop for ngrams of width > 1
        if (width > 1) {
            for (i in 1:(width - 1)) 
                new_ngrams <- paste(new_ngrams, 
                                    tokens[(i + 1):(length(tokens) - width + 1 + i)], 
                                    sep = concatenator)
        }
        # paste onto previous results and continue
        all_ngrams <- c(all_ngrams, new_ngrams)
    }

    all_ngrams
}

Here is the comparison for a simple text:

txt <- c("The quick brown fox named Seamus jumps over the lazy dog.", 
         "The dog brings a newspaper from a boy named Seamus.")
tokens <- tokenize(toLower(txt), removePunct = TRUE)
tokens
# [[1]]
# [1] "the"    "quick"  "brown"  "fox"    "named"  "seamus" "jumps"  "over"   "the"    "lazy"   "dog"   
# 
# [[2]]
# [1] "the"       "dog"       "brings"    "a"         "newspaper" "from"      "a"         "boy"       "named"     "seamus"   
# 
# attr(,"class")
# [1] "tokenizedTexts" "list"     

microbenchmark::microbenchmark(zach_ng <- find_ngrams(tokens, 2),
                               ken_ng <- find_ngrams2(tokens, 1:2))
# Unit: microseconds
#                                expr     min       lq     mean   median       uq     max neval
#   zach_ng <- find_ngrams(tokens, 2) 288.823 326.0925 433.5831 360.1815 542.9585 897.469   100
# ken_ng <- find_ngrams2(tokens, 1:2)  74.216  87.5150 130.0471 100.4610 146.3005 464.794   100

str(zach_ng)
# List of 2
# $ : chr [1:21] "the" "quick" "brown" "fox" ...
# $ : chr [1:19] "the" "dog" "brings" "a" ...
str(ken_ng)
# List of 2
# $ : chr [1:21] "the" "quick" "brown" "fox" ...
# $ : chr [1:19] "the" "dog" "brings" "a" ...

For your really large, simulated text, here is the comparison:

tokens <- stri_split_fixed(sents1, ' ')
zach_ng1_t1 <- system.time(zach_ng1 <- find_ngrams(tokens, 2))
ken_ng1_t1 <- system.time(ken_ng1 <- find_ngrams2(tokens, 1:2))
zach_ng1_t1
#    user  system elapsed 
# 230.176   5.243 246.389 
ken_ng1_t1
#   user  system elapsed 
# 58.264   1.405  62.889

Already an improvement, I'd be delighted if this could be improved further. I also should be able to implement the faster dfm() method into quanteda so that you can get what you want simply through:

dfm(sents1, ngrams = 1:2, what = "fastestword",
    toLower = FALSE, removePunct = FALSE, removeNumbers = FALSE, removeTwitter = TRUE))

(That already works but is slower than your overall result, because the way you create the final sparse matrix object is faster - but I will change this soon.)

Vectorized version of charToRaw with good performance

This is a version that uses the internal C source for charToRaw without any of the error checking. The loop in Rcpp should be as fast as you can get, although I don't know if there's a better way to handle the memory allocation. As you can see, you don't get a statistically significant performance bump over purrr::map, but it is better than sapply.

library(Rcpp)

Rcpp::cppFunction('List charToRaw_cpp(CharacterVector x) {
  int n = x.size();
  List l = List(n);

  for (int i = 0; i < n; ++i) {
    int nc = LENGTH(x[i]);
    RawVector ans = RawVector(nc);
    memcpy(RAW(ans), CHAR(x[i]), nc);
    l[i] = ans;
  }
  return l;
}')

# Random vector of 5000 strings of 5000 characters each
x <- unlist(purrr::rerun(5000, stringr::str_c(sample(c(letters, LETTERS), 5000, replace = T), collapse = "")))

microbenchmark::microbenchmark(
  sapply(x, charToRaw),
  purrr::map(x, charToRaw),
  charToRaw_cpp(x)
)

Unit: milliseconds
                    expr       min        lq      mean    median       uq       max neval cld
    sapply(x, charToRaw) 60.337729 69.313684 76.908557 73.232365 78.99251 398.00732   100   b
purrr::map(x, charToRaw)  8.849688  9.201125 17.117435  9.376843 10.09294 292.74068   100  a 
        charToRaw_cpp(x)  5.578212  5.827794  7.998507  6.151864  7.10292  23.81905   100  a

With 1000 iterations you start to see an effect:

Unit: milliseconds
                    expr      min       lq      mean   median        uq      max neval cld
purrr::map(x, charToRaw) 8.773802 9.191173 13.674963 9.425828 10.602676 302.7293  1000   b
        charToRaw_cpp(x) 5.591585 5.868381  9.370648 6.119673  7.445649 295.1833  1000  a

Edited note on performance:

I assumed you would see a bigger difference in performance with larger strings and vectors. But actually the biggest difference so far is for a 50-length vector of 50-character strings:

Unit: microseconds
                       expr    min     lq     mean median      uq     max neval cld
       sapply(x, charToRaw) 66.245 69.045 77.44593 70.288 72.4650 862.110   500   b
   purrr::map(x, charToRaw) 65.313 68.733 75.85236 70.599 72.7765 621.392   500   b
          charToRaw_cpp(x)  4.666  6.221  7.47512  6.844  7.7770  58.159   500  a

Compute ngrams for each row of text data in R

Is this what you're after?

library("RWeka")
library("tm")

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                Weka_control(min = 3, max = 3))
# Using Tyler's method of making the 'Text' object here
tdm <- TermDocumentMatrix(Corpus(VectorSource(Text)), 
                          control = list(tokenize = TrigramTokenizer))

inspect(tdm)

A term-document matrix (4 terms, 5 documents)

Non-/sparse entries: 4/16
Sparsity           : 80%
Maximal term length: 20 
Weighting          : term frequency (tf)

                      Docs
Terms                  1 2 3 4 5
  are you today        0 0 1 0 0
  blah blah blahdy     0 0 0 0 1
  how are you          0 0 1 0 0
  i love stackoverflow 0 0 0 1 0

Finding words/phrases from a big dataframe with faster way

You can use the tm package and create a document term matrix and use a tokeniser from RWeka.

library(tm)
library(RWeka)

First, create the bigram tokeniser:

bigram_tokeniser <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))

Then create a corpus from phrases:

corpus <- VCorpus(VectorSource(phrases))

In this case, only words in the vector words will be considered, you can change that by changing the control:

dtm <- DocumentTermMatrix(corpus, 
                          control = list(tokenize = bigram_tokeniser,
                                         dictionary = words))

You can then convert the document term matrix to a matrix and get the desired output:

as.matrix(dtm)

    Terms
Docs continuous improvement revenue stock
   1                      0       1     1
   2                      0       1     0
   3                      1       0     0
   4                      0       0     0

R Looking for faster alternative for sapply()

Depending on your might want to consider alternative packages (while ngram proclaims to be fast). The fastest alternative here (while ng = 1) is to split the word and find unique indices.

stringi_get_unigrams <- function(text)
  lengths(lapply(stri_split(text, fixed = " "), unique))

system.time(res3 <- stringi_get_unigrams(df$text))
#   user  system elapsed 
#   0.84    0.00    0.86

If you want to be more complex (eg. ng != 1) you'd need to compare all pairwise combinations of string, which is a bit more complex.

stringi_get_duograms <- function(text){
  splits <- stri_split(text, fixed = " ")
  comp <- function(x)
    nrow(unique(matrix(c(x[-1], x[-length(x)]), ncol = 2)))
  res <- sapply(splits, comp)
  res[res == 0] <- NA_integer_
  res
}
system.time(res <- stringi_get_duograms(df$text))
#   user  system elapsed 
#   5.94    0.02    5.93

Here we have the added benefit of not crashing when there's no word combinations that are matching in the corpus of the specific words.

Times on my CPU

system.time({
  res <- get_unigrams(df$text)
})
#   user  system elapsed 
#  12.72    0.16   12.94

alternative parallel implementation:

get_unigrams_par <- function(text) {
  require(purrr)
  require(ngram)
  sapply(text, function(text)
    ngram(text, n = 1) %>% get.ngrams() %>% length()
  )
}
cl <- parallel::makeCluster(nc <- parallel::detectCores())
print(nc)
# [1] 12
system.time(
res2 <- unname(unlist(parallel::parLapply(cl, 
                                         split(df$text, 
                                               sort(1:nrow(df)%%nc)), 
                                         get_unigrams_par)))
)
#   user  system elapsed 
#   0.20    0.11    2.95 
parallel::stopCluster(cl)

And just to check that all results are identical:

identical(unname(res), res2)
# TRUE
identical(res2, res3)
# TRUE

Edit:

Of course there's nothing stopping us from combining parallelization with any result above:

cl <- parallel::makeCluster(nc <- parallel::detectCores())
clusterEvalQ(cl, library(stringi))
system.time(
  res4 <- unname(unlist(parallel::parLapply(cl, 
                                            split(df$text, 
                                                  sort(1:nrow(df)%%nc)), 
                                            stringi_get_unigrams)))
)
#   user  system elapsed 
#   0.01    0.16    0.27
stopCluster(cl)

Converting a list of tokens to n-grams

Here's one way with embed.

find_ngrams <- function(x, n) {
    if (n == 1) return(x)
    c(x, apply(embed(x, n), 1, function(row) paste(rev(row), collapse=' ')))
}

There seems to be a bug in your function. If you fix that, we can do a benchmark.

Compute unweighted bag-of-words based TCM using text2vec in R?

UPDATE 2017-02-01
Pushed update to github - now you can specify weighting vector directly in create_tcm.

Weighting function is defined here.
If you need equal weight for each term within window, you need to adjust weighting function to always return 1 (just clone repo, change function definition and build package from source with devtools or R CMD build):

inline float weighting_fun(uint32_t offset) {
  return 1.0;
}

However several people already asked for this feature and I will probably include such option in next release.

Quickly Write Vector to File r

After trying several options I found the fastest to be data.table::fwrite. Like @Gregor says in his first comment, it is faster by an order of magnitude, which is worth the extra package loaded. It is also one of the ones that produces bigger files. (The other one is readr::write_lines. Thanks to the comment by Calum You, I had forgotten this one.)

library(data.table)
library(readr)

set.seed(1)    # make the results reproducible
n <- 1e6
x <- rnorm(n)

t1 <- system.time({
    sink(file = "test_sink.txt")
    cat(x, "\n")
    sink()
})
t2 <- system.time({
    cat(x, "\n", file = "test_cat.txt")
})
t3 <- system.time({
    write(x, file = "test_write.txt")
})
t4 <- system.time({
    fwrite(list(x), file = "test_fwrite.txt")
})
t5 <- system.time({
    write_lines(x, "test_write_lines.txt")
})

rbind(sink = t1[1:3], cat = t2[1:3], 
      write = t3[1:3], fwrite = t4[1:3],
      readr = t5[1:3])
#       user.self sys.self elapsed
#sink        4.18    11.64   15.96
#cat         3.70     4.80    8.57
#write       3.71     4.87    8.64
#fwrite      0.42     0.02    0.51
#readr       2.37     0.03    6.66

In his second comment, Gregor notes that as.list and list behave differently. The difference is important. The former writes the vector as one row and many columns, the latter writes one column and many rows.

The speed difference is also noticeable:

fw1 <- system.time({
    fwrite(as.list(x), file = "test_fwrite.txt")
})
fw2 <- system.time({
    fwrite(list(x), file = "test_fwrite2.txt")
})

rbind(as.list = fw1[1:3], list = fw2[1:3])
#        user.self sys.self elapsed
#as.list      0.67     0.00    0.75
#list         0.19     0.03    0.11

Final clean up.

unlink(c("test_sink.txt", "test_cat.txt", "test_write.txt",
         "test_fwrite.txt", "test_fwrite2.txt", "test_write_lines.txt"))

Really Fast Word Ngram Vectorization in R