tm_map has parallel::mclapply error in R 3.0.1 on Mac
I suspect you don't have the SnowballC
package installed, which seems to be required. tm_map
is supposed to run stemDocument
on all the documents using mclapply
. Try just running the stemDocument
function on one document, so you can extract the error:
stemDocument(crude[[1]])
For me, I got an error:
Error in loadNamespace(name) : there is no package called ‘SnowballC’
So I just went ahead and installed SnowballC
and it worked. Clearly, SnowballC
should be a dependency.
parallel foreach loops produce mclapply error
You're getting that error because registerDoMC
expects an integer argument, not a cluster object, while registerDoParallel
expects either an integer or a cluster object. Basically, you need to decide which package to use and not mix them.
If you use doMC
, then you never create a cluster object. A minimal doMC
example looks like:
library(doMC)
registerDoMC(3)
foreach(i=1:10) %dopar% sqrt(i)
The doParallel
package is a mashup of the doMC
and doSNOW
packages, and so you don't need to use cluster objects. Converting the previous example to doParallel
is very simple:
library(doParallel)
registerDoParallel(3)
foreach(i=1:10) %dopar% sqrt(i)
The confusing thing is that on Windows, doParallel
will actually create and use a cluster object behind the scenes, while on Linux and Mac OS X, it doesn't use a cluster object because it uses mclapply
just as in the doMC
package. I think that is convenient, but it can be a source of confusion.
tm_map is error in R
tm_map
has to be applied to a Corpus object, not a character vector. But iconv
turns your TweetCorpus
object from a Corpus back into a character vector.
To fix this, switch the order of your pre-processing, so that you use iconv
before you turn the tweets into a Corpus object:
TweetList <- c("hello", "world", "Hooray", "yep")
TweetList <- iconv(TweetList, to ="utf-8")
TweetCorpus <- Corpus(VectorSource(TweetList))
R tm package Upgrade - Error in converting corpus to data frame
Looks very complicated. How about:
data <- c("Lorem ipsum dolor sit amet account: 999 red balloons.",
"Some English words are just made for stemming!")
require(quanteda)
# makes the texts into a list of tokens with the same treatment
# as your tm mapped functions
toks <- tokenize(toLower(data), removePunct = TRUE, removeNumbers = TRUE)
# toks is just a named list
toks
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "lorem" "ipsum" "dolor" "sit" "amet" "account" "red" "balloons"
##
## Component 2 :
## [1] "some" "english" "words" "are" "just" "made" "for" "stemming"
# remove selected terms
toks <- removeFeatures(toks, c(stopwords("english"), "hi", "account", "can"))
# apply stemming
toks <- wordstem(toks)
# make into a data frame by reassembling the cleaned tokens
(df <- data.frame(text = sapply(toks, paste, collapse = " ")))
## text
## 1 lorem ipsum dolor sit amet red balloon
## 2 english word just made stem
Parallelization: package parallel instead of mclapply
Your first code calls a function
function(file, fID)
Your second code, by contrast, uses
function(dirPath,fID)
That’s the error.
Why is DocumentTermMatrix running out of memory when plenty left?
From this post, I figured out how to fix this by limiting the number of cores used. Since there is no explicit option via DocumentTermMatrix
, I had to do it via options
:
num.cores <- getOption("mc.cores")
options(mc.cores=1)
dtm <- DocumentTermMatrix(vct)
options(mc.cores=num.cores)
Related Topics
R: Interpolation of Nas by Group
Match Two Columns with Two Other Columns
Accessing Y Columns with Duplicated Names in J of X[Y, J] Merges
R Specify Function Environment
Back-To-Back Barplot with Independent Axes R
Warning "The Condition Has Length > 1 and Only the First Element Will Be Used"
Row Not Consolidating Duplicates in R When Using Multiple Months in Date Filter
Plot Scatterplot on a Map in Shiny
Significance Level Added to Matrix Correlation Heatmap Using Ggplot2
Combining Vectors of Unequal Length into a Data Frame
Robust and Clustered Standard Error in R for Probit and Logit Regression
R Ggplot Ordering Bars in "Barplot-Like " Plot
Colors Lost in Legend When Using Scale_Shape_Manual
Print a List of Dynamically-Sized Plots in Knitr
R Data.Table Fread Command:How to Read Large Files with Irregular Separators