R Tm Package Vcorpus: Error in Converting Corpus to Data Frame

R tm package vcorpus: Error in converting corpus to data frame

Your corpus is really just a character vector with some extra attributes. So it's best to convert it to character, then you can save that to a data.frame like so:

library(tm)
x <- c("Hello. Sir!","Tacos? On Tuesday?!?")
mycorpus <- Corpus(VectorSource(x))
mycorpus <- tm_map(mycorpus, removePunctuation)

dataframe <- data.frame(text=unlist(sapply(mycorpus, `[`, "content")),
stringsAsFactors=F)

which returns

              text
1 Hello Sir
2 Tacos On Tuesday

UPDATE: With newer version of tm, they seem to have updated the as.list.SimpleCorpus method which really messes with using sapplyand lapply. Now I guess you'd have to use

dataframe <- data.frame(text=sapply(mycorpus, identity), 
stringsAsFactors=F)

Unable to convert a Corpus to Data Frame in R

This ought to do it:

data.frame(text = sapply(myCorpus, as.character), stringsAsFactors = FALSE)

edited with working solution, using crude as example

The problem here is that you cannot apply stemCompletion as a transformation.

getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords" "stemDocument" "stripWhitespace"

does not include stemCompletion, which takes a vector of stemmed tokens as input.

So this should do it: first you extract the transformed texts and tokenise them, then complete the stems, then paste back together. Here I have illustrated the solution using the built-in crude corpus.

data(crude)
myCorpus <- crude
myCorpus <- tm_map(myCorpus, removeWords, stopwords('english'))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
dictCorpus <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)
# tokenize the corpus
myCorpusTokenized <- lapply(myCorpus, scan_tokenizer)
# stem complete each token vector
myTokensStemCompleted <- lapply(myCorpusTokenized, stemCompletion, dictCorpus)
# concatenate tokens by document, create data frame
myDf <- data.frame(text = sapply(myTokensStemCompleted, paste, collapse = " "), stringsAsFactors = FALSE)

convert corpus into data.frame in R

By applying

gsub("http\\w+", "", mycorpus)

the output has class = character, so it works in my case.

Convert Corpus from quanteda to tm

You can construct a tm Corpus/VCorpus directly from a VectorSource wrapped in VCorpus, because a quanteda corpus is just a special character vector.

library("tm")
## Loading required package: NLP

# from version 3.0 of quanteda
data(data_corpus_inaugural, package = "quanteda")

VCorpus(VectorSource(data_corpus_inaugural))
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 59

However... Do you really want/need to do this?

Error when importing tm Vcorpus into Quanteda corpus

Impossible to know the exact problem without a) version information on your packages and b) a reproducible example.

Why use tm at all? You could have created a quanteda corpus directly as:

corpus(citiesText)

Converting a VCorpus works fine for me.

library("quanteda")
## Package version: 2.0.1

library("tm")
packageVersion("tm")
## [1] ‘0.7.7’

reut21578 <- system.file("texts", "crude", package = "tm")
VCorp <- VCorpus(
DirSource(reut21578, mode = "binary"),
list(reader = readReut21578XMLasPlain)
)

corpus(VCorp)
## Corpus consisting of 20 documents and 16 docvars.
## text1 :
## "Diamond Shamrock Corp said that effective today it had cut i..."
##
## text2 :
## "OPEC may be forced to meet before a scheduled June session t..."
##
## text3 :
## "Texaco Canada said it lowered the contract price it will pay..."
##
## text4 :
## "Marathon Petroleum Co said it reduced the contract price it ..."
##
## text5 :
## "Houston Oil Trust said that independent petroleum engineers ..."
##
## text6 :
## "Kuwait"s Oil Minister, in remarks published today, said ther..."
##
## [ reached max_ndoc ... 14 more documents ]

How can I convert an R data frame with a single column into a corpus for tm such that each row is taken as a document?

I would recommend you read the tm-vignette first before proceeding. Answer to your specific question below.

Create example data:

txt <- strsplit("I wanted to use the findAssocs of the tm package. but it works only when there are more than one documents in the corpus. I have a data frame table which has one column and each row has a tweet text. Is it possible to convert the into a corpus which takes each row as a new document?", split=" ")[[1]]
data <- data.frame(text=txt, stringsAsFactors=FALSE)
data[1:5, ]

Import your data into a "Source", your "Source" into a "Corpus", and then make a TDM out of your "Corpus":

library(tm)
tdm <- TermDocumentMatrix(Corpus(DataframeSource(data)))

show(tdm)
#A term-document matrix (35 terms, 58 documents)
#
#Non-/sparse entries: 43/1987
#Sparsity : 98%
#Maximal term length: 10
#Weighting : term frequency (tf)

str(tdm)
#List of 6
# $ i : int [1:43] 32 31 28 12 28 21 3 35 20 33 ...
# $ j : int [1:43] 2 4 5 6 8 10 11 13 14 15 ...
# $ v : num [1:43] 1 1 1 1 1 1 1 1 1 1 ...
# $ nrow : int 35
# $ ncol : int 58
# $ dimnames:List of 2
# ..$ Terms: chr [1:35] "and" "are" "but" "column" ...
# ..$ Docs : chr [1:58] "1" "2" "3" "4" ...
# - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
# - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"


Related Topics



Leave a reply



Submit