R tm package vcorpus: Error in converting corpus to data frame
Your corpus is really just a character vector with some extra attributes. So it's best to convert it to character, then you can save that to a data.frame like so:
library(tm)
x <- c("Hello. Sir!","Tacos? On Tuesday?!?")
mycorpus <- Corpus(VectorSource(x))
mycorpus <- tm_map(mycorpus, removePunctuation)
dataframe <- data.frame(text=unlist(sapply(mycorpus, `[`, "content")),
stringsAsFactors=F)
which returns
text
1 Hello Sir
2 Tacos On Tuesday
UPDATE: With newer version of tm
, they seem to have updated the as.list.SimpleCorpus
method which really messes with using sapply
and lapply
. Now I guess you'd have to use
dataframe <- data.frame(text=sapply(mycorpus, identity),
stringsAsFactors=F)
Unable to convert a Corpus to Data Frame in R
This ought to do it:
data.frame(text = sapply(myCorpus, as.character), stringsAsFactors = FALSE)
edited with working solution, using crude
as example
The problem here is that you cannot apply stemCompletion
as a transformation.
getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords" "stemDocument" "stripWhitespace"
does not include stemCompletion
, which takes a vector of stemmed tokens as input.
So this should do it: first you extract the transformed texts and tokenise them, then complete the stems, then paste back together. Here I have illustrated the solution using the built-in crude
corpus.
data(crude)
myCorpus <- crude
myCorpus <- tm_map(myCorpus, removeWords, stopwords('english'))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
dictCorpus <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)
# tokenize the corpus
myCorpusTokenized <- lapply(myCorpus, scan_tokenizer)
# stem complete each token vector
myTokensStemCompleted <- lapply(myCorpusTokenized, stemCompletion, dictCorpus)
# concatenate tokens by document, create data frame
myDf <- data.frame(text = sapply(myTokensStemCompleted, paste, collapse = " "), stringsAsFactors = FALSE)
convert corpus into data.frame in R
By applying
gsub("http\\w+", "", mycorpus)
the output has class = character, so it works in my case.
Convert Corpus from quanteda to tm
You can construct a tm Corpus/VCorpus directly from a VectorSource wrapped in VCorpus, because a quanteda corpus is just a special character vector.
library("tm")
## Loading required package: NLP
# from version 3.0 of quanteda
data(data_corpus_inaugural, package = "quanteda")
VCorpus(VectorSource(data_corpus_inaugural))
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 59
However... Do you really want/need to do this?
Error when importing tm Vcorpus into Quanteda corpus
Impossible to know the exact problem without a) version information on your packages and b) a reproducible example.
Why use tm at all? You could have created a quanteda corpus directly as:
corpus(citiesText)
Converting a VCorpus works fine for me.
library("quanteda")
## Package version: 2.0.1
library("tm")
packageVersion("tm")
## [1] ‘0.7.7’
reut21578 <- system.file("texts", "crude", package = "tm")
VCorp <- VCorpus(
DirSource(reut21578, mode = "binary"),
list(reader = readReut21578XMLasPlain)
)
corpus(VCorp)
## Corpus consisting of 20 documents and 16 docvars.
## text1 :
## "Diamond Shamrock Corp said that effective today it had cut i..."
##
## text2 :
## "OPEC may be forced to meet before a scheduled June session t..."
##
## text3 :
## "Texaco Canada said it lowered the contract price it will pay..."
##
## text4 :
## "Marathon Petroleum Co said it reduced the contract price it ..."
##
## text5 :
## "Houston Oil Trust said that independent petroleum engineers ..."
##
## text6 :
## "Kuwait"s Oil Minister, in remarks published today, said ther..."
##
## [ reached max_ndoc ... 14 more documents ]
How can I convert an R data frame with a single column into a corpus for tm such that each row is taken as a document?
I would recommend you read the tm
-vignette first before proceeding. Answer to your specific question below.
Create example data:
txt <- strsplit("I wanted to use the findAssocs of the tm package. but it works only when there are more than one documents in the corpus. I have a data frame table which has one column and each row has a tweet text. Is it possible to convert the into a corpus which takes each row as a new document?", split=" ")[[1]]
data <- data.frame(text=txt, stringsAsFactors=FALSE)
data[1:5, ]
Import your data into a "Source", your "Source" into a "Corpus", and then make a TDM out of your "Corpus":
library(tm)
tdm <- TermDocumentMatrix(Corpus(DataframeSource(data)))
show(tdm)
#A term-document matrix (35 terms, 58 documents)
#
#Non-/sparse entries: 43/1987
#Sparsity : 98%
#Maximal term length: 10
#Weighting : term frequency (tf)
str(tdm)
#List of 6
# $ i : int [1:43] 32 31 28 12 28 21 3 35 20 33 ...
# $ j : int [1:43] 2 4 5 6 8 10 11 13 14 15 ...
# $ v : num [1:43] 1 1 1 1 1 1 1 1 1 1 ...
# $ nrow : int 35
# $ ncol : int 58
# $ dimnames:List of 2
# ..$ Terms: chr [1:35] "and" "are" "but" "column" ...
# ..$ Docs : chr [1:58] "1" "2" "3" "4" ...
# - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
# - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
Related Topics
Output Error/Warning Log (Txt File) When Running R Script Under Command Line
Make a Rectangular Legend, with Rows and Columns Labeled, in Grid
Change Geom_Text's Default "A" Legend to Label String Itself
Replace Missing Values (Na) in One Data Set with Values from Another Where Columns Match
What Type of Graph Is This? and Can It Be Created Using Ggplot2
If {...} Else {...}:Does the Line Break Between "}" and "Else" Really Matters
Reset Par to the Default Values at Startup
How to Install Rjava for Use with 64Bit R on a 64 Bit Windows Computer
Issue with Ggplot2, Geom_Bar, and Position="Dodge": Stacked Has Correct Y Values, Dodged Does Not
Remove Extra Space and Ring at the Edge of a Polar Plot
How to Strip Dollar Signs ($) from Data/ Escape Special Characters in R
Return Df with a Columns Values That Occur More Than Once
Update Shiny's 'Selectinput' Dropdown with New Values After Uploading New Data Using Fileinput