Really Fast Word Ngram Vectorization in R

Really fast word ngram vectorization in R

This is a really interesting problem, and one that I have spent a lot of time grappling with in the quanteda package. It involves three aspects that I will comment on, although it's only the third that really addresses your question. But the first two points explain why I have only focused on the ngram creation function, since -- as you point out -- that is where the speed improvement can be made.

  1. Tokenization. Here you are using string::str_split_fixed() on the space character, which is the fastest, but not the best method for tokenizing. We implemented this almost exactly the same was in quanteda::tokenize(x, what = "fastest word"). It's not the best because stringi can do much smarter implementations of whitespace delimiters. (Even the character class \\s is smarter, but slightly slower -- this is implemented as what = "fasterword"). Your question was not about tokenization though, so this point is just context.

  2. Tabulating the document-feature matrix. Here we also use the Matrix package, and index the documents and features (I call them features, not terms), and create a sparse matrix directly as you do in the code above. But your use of match() is a lot faster than the match/merge methods we were using through data.table. I am going to recode the quanteda::dfm() function since your method is more elegant and faster. Really, really glad I saw this!

  3. ngram creation. Here I think I can actually help in terms of performance. We implement this in quanteda through an argument to quanteda::tokenize(), called grams = c(1) where the value can be any integer set. Our match for unigrams and bigrams would be ngrams = 1:2, for instance. You can examine the code at https://github.com/kbenoit/quanteda/blob/master/R/tokenize.R, see the internal function ngram(). I've reproduced this below and made a wrapper so that we can directly compare it to your find_ngrams() function.

Code:

# wrapper
find_ngrams2 <- function(x, ngrams = 1, concatenator = " ") {
if (sum(1:length(ngrams)) == sum(ngrams)) {
result <- lapply(x, ngram, n = length(ngrams), concatenator = concatenator, include.all = TRUE)
} else {
result <- lapply(x, function(x) {
xnew <- c()
for (n in ngrams)
xnew <- c(xnew, ngram(x, n, concatenator = concatenator, include.all = FALSE))
xnew
})
}
result
}

# does the work
ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {

if (length(tokens) < n)
return(NULL)

# start with lower ngrams, or just the specified size if include.all = FALSE
start <- ifelse(include.all,
1,
ifelse(length(tokens) < n, 1, n))

# set max size of ngram at max length of tokens
end <- ifelse(length(tokens) < n, length(tokens), n)

all_ngrams <- c()
# outer loop for all ngrams down to 1
for (width in start:end) {
new_ngrams <- tokens[1:(length(tokens) - width + 1)]
# inner loop for ngrams of width > 1
if (width > 1) {
for (i in 1:(width - 1))
new_ngrams <- paste(new_ngrams,
tokens[(i + 1):(length(tokens) - width + 1 + i)],
sep = concatenator)
}
# paste onto previous results and continue
all_ngrams <- c(all_ngrams, new_ngrams)
}

all_ngrams
}

Here is the comparison for a simple text:

txt <- c("The quick brown fox named Seamus jumps over the lazy dog.", 
"The dog brings a newspaper from a boy named Seamus.")
tokens <- tokenize(toLower(txt), removePunct = TRUE)
tokens
# [[1]]
# [1] "the" "quick" "brown" "fox" "named" "seamus" "jumps" "over" "the" "lazy" "dog"
#
# [[2]]
# [1] "the" "dog" "brings" "a" "newspaper" "from" "a" "boy" "named" "seamus"
#
# attr(,"class")
# [1] "tokenizedTexts" "list"

microbenchmark::microbenchmark(zach_ng <- find_ngrams(tokens, 2),
ken_ng <- find_ngrams2(tokens, 1:2))
# Unit: microseconds
# expr min lq mean median uq max neval
# zach_ng <- find_ngrams(tokens, 2) 288.823 326.0925 433.5831 360.1815 542.9585 897.469 100
# ken_ng <- find_ngrams2(tokens, 1:2) 74.216 87.5150 130.0471 100.4610 146.3005 464.794 100

str(zach_ng)
# List of 2
# $ : chr [1:21] "the" "quick" "brown" "fox" ...
# $ : chr [1:19] "the" "dog" "brings" "a" ...
str(ken_ng)
# List of 2
# $ : chr [1:21] "the" "quick" "brown" "fox" ...
# $ : chr [1:19] "the" "dog" "brings" "a" ...

For your really large, simulated text, here is the comparison:

tokens <- stri_split_fixed(sents1, ' ')
zach_ng1_t1 <- system.time(zach_ng1 <- find_ngrams(tokens, 2))
ken_ng1_t1 <- system.time(ken_ng1 <- find_ngrams2(tokens, 1:2))
zach_ng1_t1
# user system elapsed
# 230.176 5.243 246.389
ken_ng1_t1
# user system elapsed
# 58.264 1.405 62.889

Already an improvement, I'd be delighted if this could be improved further. I also should be able to implement the faster dfm() method into quanteda so that you can get what you want simply through:

dfm(sents1, ngrams = 1:2, what = "fastestword",
toLower = FALSE, removePunct = FALSE, removeNumbers = FALSE, removeTwitter = TRUE))

(That already works but is slower than your overall result, because the way you create the final sparse matrix object is faster - but I will change this soon.)

Vectorized version of charToRaw with good performance

This is a version that uses the internal C source for charToRaw without any of the error checking. The loop in Rcpp should be as fast as you can get, although I don't know if there's a better way to handle the memory allocation. As you can see, you don't get a statistically significant performance bump over purrr::map, but it is better than sapply.

library(Rcpp)

Rcpp::cppFunction('List charToRaw_cpp(CharacterVector x) {
int n = x.size();
List l = List(n);

for (int i = 0; i < n; ++i) {
int nc = LENGTH(x[i]);
RawVector ans = RawVector(nc);
memcpy(RAW(ans), CHAR(x[i]), nc);
l[i] = ans;
}
return l;
}')

# Random vector of 5000 strings of 5000 characters each
x <- unlist(purrr::rerun(5000, stringr::str_c(sample(c(letters, LETTERS), 5000, replace = T), collapse = "")))

microbenchmark::microbenchmark(
sapply(x, charToRaw),
purrr::map(x, charToRaw),
charToRaw_cpp(x)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
sapply(x, charToRaw) 60.337729 69.313684 76.908557 73.232365 78.99251 398.00732 100 b
purrr::map(x, charToRaw) 8.849688 9.201125 17.117435 9.376843 10.09294 292.74068 100 a
charToRaw_cpp(x) 5.578212 5.827794 7.998507 6.151864 7.10292 23.81905 100 a

With 1000 iterations you start to see an effect:

Unit: milliseconds
expr min lq mean median uq max neval cld
purrr::map(x, charToRaw) 8.773802 9.191173 13.674963 9.425828 10.602676 302.7293 1000 b
charToRaw_cpp(x) 5.591585 5.868381 9.370648 6.119673 7.445649 295.1833 1000 a

Edited note on performance:

I assumed you would see a bigger difference in performance with larger strings and vectors. But actually the biggest difference so far is for a 50-length vector of 50-character strings:

Unit: microseconds
expr min lq mean median uq max neval cld
sapply(x, charToRaw) 66.245 69.045 77.44593 70.288 72.4650 862.110 500 b
purrr::map(x, charToRaw) 65.313 68.733 75.85236 70.599 72.7765 621.392 500 b
charToRaw_cpp(x) 4.666 6.221 7.47512 6.844 7.7770 58.159 500 a

Compute ngrams for each row of text data in R

Is this what you're after?

library("RWeka")
library("tm")

TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 3, max = 3))
# Using Tyler's method of making the 'Text' object here
tdm <- TermDocumentMatrix(Corpus(VectorSource(Text)),
control = list(tokenize = TrigramTokenizer))

inspect(tdm)

A term-document matrix (4 terms, 5 documents)

Non-/sparse entries: 4/16
Sparsity : 80%
Maximal term length: 20
Weighting : term frequency (tf)

Docs
Terms 1 2 3 4 5
are you today 0 0 1 0 0
blah blah blahdy 0 0 0 0 1
how are you 0 0 1 0 0
i love stackoverflow 0 0 0 1 0

Finding words/phrases from a big dataframe with faster way

You can use the tm package and create a document term matrix and use a tokeniser from RWeka.

library(tm)
library(RWeka)

First, create the bigram tokeniser:

bigram_tokeniser <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))

Then create a corpus from phrases:

corpus <- VCorpus(VectorSource(phrases)) 

In this case, only words in the vector words will be considered, you can change that by changing the control:

dtm <- DocumentTermMatrix(corpus, 
control = list(tokenize = bigram_tokeniser,
dictionary = words))

You can then convert the document term matrix to a matrix and get the desired output:

as.matrix(dtm)

Terms
Docs continuous improvement revenue stock
1 0 1 1
2 0 1 0
3 1 0 0
4 0 0 0

R Looking for faster alternative for sapply()

Depending on your might want to consider alternative packages (while ngram proclaims to be fast). The fastest alternative here (while ng = 1) is to split the word and find unique indices.

stringi_get_unigrams <- function(text)
lengths(lapply(stri_split(text, fixed = " "), unique))

system.time(res3 <- stringi_get_unigrams(df$text))
# user system elapsed
# 0.84 0.00 0.86

If you want to be more complex (eg. ng != 1) you'd need to compare all pairwise combinations of string, which is a bit more complex.

stringi_get_duograms <- function(text){
splits <- stri_split(text, fixed = " ")
comp <- function(x)
nrow(unique(matrix(c(x[-1], x[-length(x)]), ncol = 2)))
res <- sapply(splits, comp)
res[res == 0] <- NA_integer_
res
}
system.time(res <- stringi_get_duograms(df$text))
# user system elapsed
# 5.94 0.02 5.93

Here we have the added benefit of not crashing when there's no word combinations that are matching in the corpus of the specific words.

Times on my CPU

system.time({
res <- get_unigrams(df$text)
})
# user system elapsed
# 12.72 0.16 12.94

alternative parallel implementation:

get_unigrams_par <- function(text) {
require(purrr)
require(ngram)
sapply(text, function(text)
ngram(text, n = 1) %>% get.ngrams() %>% length()
)
}
cl <- parallel::makeCluster(nc <- parallel::detectCores())
print(nc)
# [1] 12
system.time(
res2 <- unname(unlist(parallel::parLapply(cl,
split(df$text,
sort(1:nrow(df)%%nc)),
get_unigrams_par)))
)
# user system elapsed
# 0.20 0.11 2.95
parallel::stopCluster(cl)

And just to check that all results are identical:

identical(unname(res), res2)
# TRUE
identical(res2, res3)
# TRUE

Edit:

Of course there's nothing stopping us from combining parallelization with any result above:

cl <- parallel::makeCluster(nc <- parallel::detectCores())
clusterEvalQ(cl, library(stringi))
system.time(
res4 <- unname(unlist(parallel::parLapply(cl,
split(df$text,
sort(1:nrow(df)%%nc)),
stringi_get_unigrams)))
)
# user system elapsed
# 0.01 0.16 0.27
stopCluster(cl)

Converting a list of tokens to n-grams

Here's one way with embed.

find_ngrams <- function(x, n) {
if (n == 1) return(x)
c(x, apply(embed(x, n), 1, function(row) paste(rev(row), collapse=' ')))
}

There seems to be a bug in your function. If you fix that, we can do a benchmark.

Compute unweighted bag-of-words based TCM using text2vec in R?

UPDATE 2017-02-01
Pushed update to github - now you can specify weighting vector directly in create_tcm.

Weighting function is defined here.
If you need equal weight for each term within window, you need to adjust weighting function to always return 1 (just clone repo, change function definition and build package from source with devtools or R CMD build):

inline float weighting_fun(uint32_t offset) {
return 1.0;
}

However several people already asked for this feature and I will probably include such option in next release.

Quickly Write Vector to File r

After trying several options I found the fastest to be data.table::fwrite. Like @Gregor says in his first comment, it is faster by an order of magnitude, which is worth the extra package loaded. It is also one of the ones that produces bigger files. (The other one is readr::write_lines. Thanks to the comment by Calum You, I had forgotten this one.)

library(data.table)
library(readr)

set.seed(1) # make the results reproducible
n <- 1e6
x <- rnorm(n)

t1 <- system.time({
sink(file = "test_sink.txt")
cat(x, "\n")
sink()
})
t2 <- system.time({
cat(x, "\n", file = "test_cat.txt")
})
t3 <- system.time({
write(x, file = "test_write.txt")
})
t4 <- system.time({
fwrite(list(x), file = "test_fwrite.txt")
})
t5 <- system.time({
write_lines(x, "test_write_lines.txt")
})

rbind(sink = t1[1:3], cat = t2[1:3],
write = t3[1:3], fwrite = t4[1:3],
readr = t5[1:3])
# user.self sys.self elapsed
#sink 4.18 11.64 15.96
#cat 3.70 4.80 8.57
#write 3.71 4.87 8.64
#fwrite 0.42 0.02 0.51
#readr 2.37 0.03 6.66

In his second comment, Gregor notes that as.list and list behave differently. The difference is important. The former writes the vector as one row and many columns, the latter writes one column and many rows.

The speed difference is also noticeable:

fw1 <- system.time({
fwrite(as.list(x), file = "test_fwrite.txt")
})
fw2 <- system.time({
fwrite(list(x), file = "test_fwrite2.txt")
})

rbind(as.list = fw1[1:3], list = fw2[1:3])
# user.self sys.self elapsed
#as.list 0.67 0.00 0.75
#list 0.19 0.03 0.11

Final clean up.

unlink(c("test_sink.txt", "test_cat.txt", "test_write.txt",
"test_fwrite.txt", "test_fwrite2.txt", "test_write_lines.txt"))


Related Topics



Leave a reply



Submit