Tm_Map Has Parallel::Mclapply Error in R 3.0.1 on MAC

tm_map has parallel::mclapply error in R 3.0.1 on Mac

I suspect you don't have the SnowballC package installed, which seems to be required. tm_map is supposed to run stemDocument on all the documents using mclapply. Try just running the stemDocument function on one document, so you can extract the error:

stemDocument(crude[[1]])

For me, I got an error:

Error in loadNamespace(name) : there is no package called ‘SnowballC’

So I just went ahead and installed SnowballC and it worked. Clearly, SnowballC should be a dependency.

parallel foreach loops produce mclapply error

You're getting that error because registerDoMC expects an integer argument, not a cluster object, while registerDoParallel expects either an integer or a cluster object. Basically, you need to decide which package to use and not mix them.

If you use doMC, then you never create a cluster object. A minimal doMC example looks like:

library(doMC)
registerDoMC(3)
foreach(i=1:10) %dopar% sqrt(i)

The doParallel package is a mashup of the doMC and doSNOW packages, and so you don't need to use cluster objects. Converting the previous example to doParallel is very simple:

library(doParallel)
registerDoParallel(3)
foreach(i=1:10) %dopar% sqrt(i)

The confusing thing is that on Windows, doParallel will actually create and use a cluster object behind the scenes, while on Linux and Mac OS X, it doesn't use a cluster object because it uses mclapply just as in the doMC package. I think that is convenient, but it can be a source of confusion.

tm_map is error in R

tm_map has to be applied to a Corpus object, not a character vector. But iconv turns your TweetCorpus object from a Corpus back into a character vector.

To fix this, switch the order of your pre-processing, so that you use iconv before you turn the tweets into a Corpus object:

TweetList <- c("hello", "world", "Hooray", "yep")
TweetList <- iconv(TweetList, to ="utf-8")
TweetCorpus <- Corpus(VectorSource(TweetList))

R tm package Upgrade - Error in converting corpus to data frame

Looks very complicated. How about:

data <- c("Lorem ipsum dolor sit amet account: 999 red balloons.",
"Some English words are just made for stemming!")

require(quanteda)

# makes the texts into a list of tokens with the same treatment
# as your tm mapped functions
toks <- tokenize(toLower(data), removePunct = TRUE, removeNumbers = TRUE)
# toks is just a named list
toks
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "lorem" "ipsum" "dolor" "sit" "amet" "account" "red" "balloons"
##
## Component 2 :
## [1] "some" "english" "words" "are" "just" "made" "for" "stemming"

# remove selected terms
toks <- removeFeatures(toks, c(stopwords("english"), "hi", "account", "can"))

# apply stemming
toks <- wordstem(toks)

# make into a data frame by reassembling the cleaned tokens
(df <- data.frame(text = sapply(toks, paste, collapse = " ")))
## text
## 1 lorem ipsum dolor sit amet red balloon
## 2 english word just made stem

Parallelization: package parallel instead of mclapply

Your first code calls a function

function(file, fID)

Your second code, by contrast, uses

function(dirPath,fID)

That’s the error.

Why is DocumentTermMatrix running out of memory when plenty left?

From this post, I figured out how to fix this by limiting the number of cores used. Since there is no explicit option via DocumentTermMatrix, I had to do it via options:

num.cores <- getOption("mc.cores")
options(mc.cores=1)
dtm <- DocumentTermMatrix(vct)
options(mc.cores=num.cores)


Related Topics



Leave a reply



Submit