How do I resolve dataloss & error with TermDocumentMatrix() and DocumentTermMatrix(), respectively?
I slightly changed your original data as your emoticons each only appear once in the text, which turns all values in tfidf to 1 (see below, I just randomly added a few ). I'm using quanteda
instead of tm
as it is faster and has far less problems with encoding.
library(dplyr)
library(quanteda)
tweets_dfm <- dfm(TextSet$tweet) # convert to document-feature matrix
tweets_dfm %>%
dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
dfm_tfidf() %>% # weight with tfidf
convert("data.frame") # turn into data.frame to display more easily
#> document <U+0001F914> <U+0001F4AA> <U+0001F603> <U+0001F953> <U+0001F37A>
#> 1 text1 1.39794 1 0 0 0
#> 2 text2 0.00000 0 1 0 0
#> 3 text3 0.00000 0 0 0 0
#> 4 text4 0.00000 0 0 0 0
#> 5 text5 0.00000 0 0 0 0
#> 6 text6 0.69897 0 0 0 0
#> 7 text7 0.00000 0 0 1 1
#> 8 text8 0.00000 0 0 0 0
#> 9 text9 0.00000 0 0 0 0
#> 10 text10 0.00000 0 0 0 0
The column names (i.e., emojis) are displayed correctly in my Viewer and it should be possible to export the resulting data.frame.
data
TagSet <- data.frame(emoticon = c(",",",","),
stringsAsFactors = FALSE)
TextSet <- data.frame(tweet = c("Sharp, adversarial⚔️~pro choice~ban Pit Bulls☠️~BSL️~aberant psychology~common sense~the Piper will lead us to reason~sealskin woman,
"Blocked by Owen, Adonis. Abbott & many #FBPE Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement ,
" #healthy #vegetarian #beatchronicillness fix infrastructure",
"LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
"I #BackTheBlue for my son! Facts Over Feelings. Border Security saves lives! #ThankYouICE",
" I play Pedal Steel @CooderGraw & #CharlieShafter #GoStars #LiberalismIsAMentalDisorder",
"#Englishman #Londoner @Chelseafc ️♂️ ,
"F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
"❄️Do not dwell in the past, do not dream of the future, concentrate the mind on the present moment.️❄️",
"Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro | Hello intro on the Minds Link |"),
stringsAsFactors = FALSE)
DocumentTermMatrix error on Corpus argument
It seems this would have worked just fine in tm 0.5.10
but changes in tm 0.6.0
seems to have broken it. The problem is that the functions tolower
and trim
won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters.
So you could change to
corpus_clean <- tm_map(news_corpus, content_transformer(tolower))
Or you can run
corpus_clean <- tm_map(corpus_clean, PlainTextDocument)
after all of your non-standard transformations (those not in getTransformations()
) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.
big document term matrix - error when counting the number of characters of documents
You might be able to work around this if you keep your data in the dtm, which uses a sparse matrix representation that is much more memory efficient than a regular matrix.
The reason why the apply
function gives an error is because it converts the sparse matrix into a regular matrix (the matrix
object in your Q - btw it's poor style to give data objects names that are also names of functions, especially base functions). This means that R has to allocate memory for all the zeros in the dtm (which are typically mostly zeros, so that's a lot of memory with zeros in it). With a sparse matrix R doesn't need to store any of the zeros.
Here's the first few lines of the source for apply
, see the last line here for the conversion to regular matrix:
apply
function (X, MARGIN, FUN, ...)
{
FUN <- match.fun(FUN)
dl <- length(dim(X))
if (!dl)
stop("dim(X) must have a positive length")
if (is.object(X))
X <- if (dl == 2L)
as.matrix(X) # this is where your memory gets filled with zeros
So how to avoid that conversion? Here's one way to loop over the rows to get their sums while keeping the sparse matrix format:
sapply(seq(nrow(matrix)), function(i) sum(matrix[i,]))
[1] 2 1 2 2 1
Subsetting this way preserves the sparse format and does not convert the object to the more memory expensive common matrix representation. We can check the representation:
str(matrix[1,])
List of 6
$ i : int [1:2] 1 1
$ j : int [1:2] 1 3
$ v : num [1:2] 1 1
$ nrow : int 1
$ ncol : int 6
$ dimnames:List of 2
..$ Docs : chr "1"
..$ Terms: chr [1:6] "document" "file" "first" "second" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
So in the sapply
function we are always working on a sparse matrix. And even if sum
(or whatever function you use there) does some kind of conversion, it's only going to be converting one row of the dtm, rather than the entire thing.
The general principle when working with largish text data in R is to keep your dtm as a sparse matrix and then you should be able to keep within memory limits.
TermDocumentMatrix sometimes throwing error
So after a bit of playing around, the following line of code has completely fixed my issue:
t <- iconv(t,to="utf-8-mac")
Bigram analysis and Term document Matrix
As far as my experience goes the order of words in n-grams is critical. You would not want to consider the n-grams 'Putin attacked' and "attacked Putin" to be the same as they have very different contextual meaning.
So no you are not messing up the code. You just may want to do a little more research into n-gram models. A good start may be with Chapter 4 in Speech and Language Processing by Jurafsky and Martin
Related Topics
How to Select Columns Programmatically in a Data.Table
Two Horizontal Bar Charts with Shared Axis in Ggplot2 (Similar to Population Pyramid)
Dplyr Summarise Multiple Columns Using T.Test
R - Ggplot2 - Highlighting Selected Points and Strange Behavior
Knitr: Include Figures in Report *And* Output Figures to Separate Files
Looping Through List of Data Frames in R
Passing Parameters to R Markdown
Use an Image as Area Fill in an R Plot
Assigning Null to a List Element in R
Remove 'Search' Option But Leave 'Search Columns' Option
Sum of Two Columns of Data Frame with Na Values
Coloring Boxplot Outlier Points in Ggplot2
How to Remove Rows of a Matrix by Row Name, Rather Than Numerical Index
How to Update a Shiny Fileinput Object
Convert Accented Characters into Ascii Character