DocumentTermMatrix error on Corpus argument
It seems this would have worked just fine in tm 0.5.10
but changes in tm 0.6.0
seems to have broken it. The problem is that the functions tolower
and trim
won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters.
So you could change to
corpus_clean <- tm_map(news_corpus, content_transformer(tolower))
Or you can run
corpus_clean <- tm_map(corpus_clean, PlainTextDocument)
after all of your non-standard transformations (those not in getTransformations()
) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.
Error while using DocumenttermMatrix function in R
def<-read.csv("Defect.csv",header = T)
docs<-Corpus(VectorSource(def$Summary))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,removeNumbers)
docs<-tm_map(docs,removeWords,stopwords("english"))
docs<-tm_map(docs,removePunctuation)
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,stemDocument,language = "english")
Note : use TermDocumentMatrix
over DocumentTermMatrix
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
rownames(d) <- NULL
Now, your dataframe should look like..
> head(d,10)
word freq
1 file 157
2 data 151
3 incorrect 136
4 target 120
5 issu 95
6 tabl 82
7 sourc 69
8 column 63
9 get 61
10 process 56
R Error: inherits(x, c( DocumentTermMatrix , TermDocumentMatrix )) is not TRUE
removeSparseTerms and findFreqTerms are expecting a DocumentTermMatrix or a TermDocumentMatrix object not a matrix.
Create the DocumentTermMatrix without converting to a matrix and you won't get the error.
dtm <- DocumentTermMatrix(corpus)
sparse <- removeSparseTerms(dtm, 0.80)
freq <- findFreqTerms(dtm, 2)
TermDocumentMatrix Error after Cleaning Corpus
Removingcorpus <- tm_map(corpus, bracketX)
did the job and the code is now functioning correctly
R DocumentTermMatrix loses results less than 100
If you read the ?TermDocumentMatrix
help page you can see that additional control=
options are listed in in the ?termFreq
help page.
There is a wordLengths parameter which filters the length of the words used in the matrix. It defaults to c(3,Inf)
so it excludes two-character words. Try setting the value to control=list(wordLengths=c(2,Inf)
to include those short words. (Note that when passing control parameters, you should name the parameters in the list.)
DocumentTermMatrix needs to have a term frequency weighting Error
Part of the problem is that you are weighting the document-term matrix by tf-idf, but LDA requires term counts. In addition, this method of removing sparse terms seems to be creating some documents where all terms have been removed.
Easier to get from your text to topic models using the quanteda package. Here's how:
require(quanteda)
myCorpus <- corpus(textfile("http://homepage.stat.uiowa.edu/~thanhtran/askreddit201508.csv",
textField = "title"))
myDfm <- dfm(myCorpus, stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 160,707 documents
## ... indexing features: 39,505 feature types
## ... stemming features (English), trimmed 12563 feature variants
## ... created a 160707 x 26942 sparse dfm
## ... complete.
# remove infrequent terms: see http://stats.stackexchange.com/questions/160539/is-this-interpretation-of-sparsity-accurate/160599#160599
sparsityThreshold <- round(ndoc(myDfm) * (1 - 0.99999))
myDfm2 <- trim(myDfm, minDoc = sparsityThreshold)
## Features occurring in fewer than 1.60707 documents: 12579
nfeature(myDfm2)
## [1] 14363
# fit the LDA model
require(topicmodels)
LDA2 <- LDA(quantedaformat2dtm(myDfm2), 100)
Related Topics
Annotate Ggplot with an Extra Tick and Label
Label X Axis in Time Series Plot Using R
List of Word Frequencies Using R
How to Cross-Paste All Combinations of Two Vectors (Each-To-Each)
Get Connected Components Using Igraph in R
Make Sequential Numeric Column Names Prefixed with a Letter
Crop for Spatialpolygonsdataframe
Output Error/Warning Log (Txt File) When Running R Script Under Command Line
Avoiding the Infamous "Eval(Parse())" Construct
Read CSV File in R with Currency Column as Numeric
Ggplot: Boxplot of Multiple Column Values
Check If a Date Is Within an Interval in R
Make R Exit with Non-Zero Status Code
Selecting a Subset of Columns in a Data.Table
Get a List of the Data Sets in a Particular Package