Tm_Map Has Parallel::Mclapply Error in R 3.0.1 on MAC

tm_map has parallel::mclapply error in R 3.0.1 on Mac

I suspect you don't have the SnowballC package installed, which seems to be required. tm_map is supposed to run stemDocument on all the documents using mclapply. Try just running the stemDocument function on one document, so you can extract the error:

stemDocument(crude[[1]])

For me, I got an error:

Error in loadNamespace(name) : there is no package called ‘SnowballC’

So I just went ahead and installed SnowballC and it worked. Clearly, SnowballC should be a dependency.

parallel foreach loops produce mclapply error

You're getting that error because registerDoMC expects an integer argument, not a cluster object, while registerDoParallel expects either an integer or a cluster object. Basically, you need to decide which package to use and not mix them.

If you use doMC, then you never create a cluster object. A minimal doMC example looks like:

library(doMC)
registerDoMC(3)
foreach(i=1:10) %dopar% sqrt(i)

The doParallel package is a mashup of the doMC and doSNOW packages, and so you don't need to use cluster objects. Converting the previous example to doParallel is very simple:

library(doParallel)
registerDoParallel(3)
foreach(i=1:10) %dopar% sqrt(i)

The confusing thing is that on Windows, doParallel will actually create and use a cluster object behind the scenes, while on Linux and Mac OS X, it doesn't use a cluster object because it uses mclapply just as in the doMC package. I think that is convenient, but it can be a source of confusion.

tm_map is error in R

tm_map has to be applied to a Corpus object, not a character vector. But iconv turns your TweetCorpus object from a Corpus back into a character vector.

To fix this, switch the order of your pre-processing, so that you use iconv before you turn the tweets into a Corpus object:

TweetList <- c("hello", "world", "Hooray", "yep")
TweetList <-  iconv(TweetList, to ="utf-8")
TweetCorpus <- Corpus(VectorSource(TweetList))

R tm package Upgrade - Error in converting corpus to data frame

Looks very complicated. How about:

data <- c("Lorem ipsum dolor sit amet account: 999 red balloons.",
          "Some English words are just made for stemming!")

require(quanteda)

# makes the texts into a list of tokens with the same treatment
# as your tm mapped functions
toks <- tokenize(toLower(data), removePunct = TRUE, removeNumbers = TRUE)
# toks is just a named list
toks
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "lorem"    "ipsum"    "dolor"    "sit"      "amet"     "account"  "red"      "balloons"
## 
## Component 2 :
## [1] "some"     "english"  "words"    "are"      "just"     "made"     "for"      "stemming"

# remove selected terms
toks <- removeFeatures(toks, c(stopwords("english"), "hi", "account", "can"))

# apply stemming
toks <- wordstem(toks)

# make into a data frame by reassembling the cleaned tokens
(df <- data.frame(text = sapply(toks, paste, collapse = " ")))
##                                     text
## 1 lorem ipsum dolor sit amet red balloon
## 2            english word just made stem

Parallelization: package parallel instead of mclapply

Your first code calls a function

function(file, fID)

Your second code, by contrast, uses

function(dirPath,fID)

That’s the error.

Why is DocumentTermMatrix running out of memory when plenty left?

From this post, I figured out how to fix this by limiting the number of cores used. Since there is no explicit option via DocumentTermMatrix, I had to do it via options:

num.cores <- getOption("mc.cores")
options(mc.cores=1)
dtm <- DocumentTermMatrix(vct)
options(mc.cores=num.cores)

Tm_Map Has Parallel::Mclapply Error in R 3.0.1 on MAC