Documenttermmatrix Error on Corpus Argument

DocumentTermMatrix error on Corpus argument

It seems this would have worked just fine in tm 0.5.10 but changes in tm 0.6.0 seems to have broken it. The problem is that the functions tolower and trim won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters.

So you could change to

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

Or you can run

corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

after all of your non-standard transformations (those not in getTransformations()) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.

Error while using DocumenttermMatrix function in R

def<-read.csv("Defect.csv",header = T)
docs<-Corpus(VectorSource(def$Summary))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,removeNumbers)
docs<-tm_map(docs,removeWords,stopwords("english"))
docs<-tm_map(docs,removePunctuation)
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,stemDocument,language = "english")

Note : use TermDocumentMatrix over DocumentTermMatrix

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
rownames(d) <- NULL

Now, your dataframe should look like..

> head(d,10)
        word freq
1       file  157
2       data  151
3  incorrect  136
4     target  120
5       issu   95
6       tabl   82
7      sourc   69
8     column   63
9        get   61
10   process   56

R Error: inherits(x, c( DocumentTermMatrix , TermDocumentMatrix )) is not TRUE

removeSparseTerms and findFreqTerms are expecting a DocumentTermMatrix or a TermDocumentMatrix object not a matrix.

Create the DocumentTermMatrix without converting to a matrix and you won't get the error.

dtm <- DocumentTermMatrix(corpus)
sparse <- removeSparseTerms(dtm, 0.80)
freq <- findFreqTerms(dtm, 2)

TermDocumentMatrix Error after Cleaning Corpus

Removing
corpus <- tm_map(corpus, bracketX)
did the job and the code is now functioning correctly

R DocumentTermMatrix loses results less than 100

If you read the ?TermDocumentMatrix help page you can see that additional control= options are listed in in the ?termFreq help page.

There is a wordLengths parameter which filters the length of the words used in the matrix. It defaults to c(3,Inf) so it excludes two-character words. Try setting the value to control=list(wordLengths=c(2,Inf) to include those short words. (Note that when passing control parameters, you should name the parameters in the list.)

DocumentTermMatrix needs to have a term frequency weighting Error

Part of the problem is that you are weighting the document-term matrix by tf-idf, but LDA requires term counts. In addition, this method of removing sparse terms seems to be creating some documents where all terms have been removed.

Easier to get from your text to topic models using the quanteda package. Here's how:

require(quanteda)
myCorpus <- corpus(textfile("http://homepage.stat.uiowa.edu/~thanhtran/askreddit201508.csv",
                            textField = "title"))
myDfm <- dfm(myCorpus, stem = TRUE)
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 160,707 documents
##    ... indexing features: 39,505 feature types
##    ... stemming features (English), trimmed 12563 feature variants
##    ... created a 160707 x 26942 sparse dfm
##    ... complete. 

# remove infrequent terms: see http://stats.stackexchange.com/questions/160539/is-this-interpretation-of-sparsity-accurate/160599#160599
sparsityThreshold <- round(ndoc(myDfm) * (1 - 0.99999))
myDfm2 <- trim(myDfm, minDoc = sparsityThreshold)
## Features occurring in fewer than 1.60707 documents: 12579
nfeature(myDfm2)
## [1] 14363

# fit the LDA model
require(topicmodels)
LDA2 <- LDA(quantedaformat2dtm(myDfm2), 100)

Documenttermmatrix Error on Corpus Argument