Twitter Data Analysis - Error in Term Document Matrix

How do I resolve dataloss & error with TermDocumentMatrix() and DocumentTermMatrix(), respectively?

I slightly changed your original data as your emoticons each only appear once in the text, which turns all values in tfidf to 1 (see below, I just randomly added a few ). I'm using quanteda instead of tm as it is faster and has far less problems with encoding.

library(dplyr)
library(quanteda)

tweets_dfm <- dfm(TextSet$tweet)  # convert to document-feature matrix

tweets_dfm %>% 
  dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
  dfm_tfidf() %>%                 # weight with tfidf
  convert("data.frame")           # turn into data.frame to display more easily
#>    document <U+0001F914> <U+0001F4AA> <U+0001F603> <U+0001F953> <U+0001F37A>
#> 1     text1      1.39794            1            0            0            0
#> 2     text2      0.00000            0            1            0            0
#> 3     text3      0.00000            0            0            0            0
#> 4     text4      0.00000            0            0            0            0
#> 5     text5      0.00000            0            0            0            0
#> 6     text6      0.69897            0            0            0            0
#> 7     text7      0.00000            0            0            1            1
#> 8     text8      0.00000            0            0            0            0
#> 9     text9      0.00000            0            0            0            0
#> 10   text10      0.00000            0            0            0            0

The column names (i.e., emojis) are displayed correctly in my Viewer and it should be possible to export the resulting data.frame.

data

TagSet <- data.frame(emoticon = c(",",",","),
                     stringsAsFactors = FALSE)

TextSet <- data.frame(tweet = c("Sharp, adversarial⚔️~pro choice~ban Pit Bulls☠️~BSL️~aberant psychology~common sense~the Piper will lead us to reason~sealskin woman,
                                "Blocked by Owen, Adonis. Abbott & many #FBPE Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement ,
                                " #healthy #vegetarian #beatchronicillness fix infrastructure",
                                "LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
                                "I #BackTheBlue for my son! Facts Over Feelings. Border Security saves lives! #ThankYouICE",
                                " I play Pedal Steel @CooderGraw & #CharlieShafter #GoStars #LiberalismIsAMentalDisorder",
                                "#Englishman  #Londoner  @Chelseafc  ️‍♂️ ,
                                "F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
                                "❄️Do not dwell in the past, do not dream of the future, concentrate the mind on the present moment.️❄️",
                                "Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro | Hello intro on the Minds Link |"),
                      stringsAsFactors = FALSE)

DocumentTermMatrix error on Corpus argument

It seems this would have worked just fine in tm 0.5.10 but changes in tm 0.6.0 seems to have broken it. The problem is that the functions tolower and trim won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters.

So you could change to

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

Or you can run

corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

after all of your non-standard transformations (those not in getTransformations()) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.

big document term matrix - error when counting the number of characters of documents

You might be able to work around this if you keep your data in the dtm, which uses a sparse matrix representation that is much more memory efficient than a regular matrix.

The reason why the apply function gives an error is because it converts the sparse matrix into a regular matrix (the matrix object in your Q - btw it's poor style to give data objects names that are also names of functions, especially base functions). This means that R has to allocate memory for all the zeros in the dtm (which are typically mostly zeros, so that's a lot of memory with zeros in it). With a sparse matrix R doesn't need to store any of the zeros.

Here's the first few lines of the source for apply, see the last line here for the conversion to regular matrix:

apply
function (X, MARGIN, FUN, ...) 
{
    FUN <- match.fun(FUN)
    dl <- length(dim(X))
    if (!dl) 
        stop("dim(X) must have a positive length")
    if (is.object(X)) 
        X <- if (dl == 2L) 
            as.matrix(X) # this is where your memory gets filled with zeros

So how to avoid that conversion? Here's one way to loop over the rows to get their sums while keeping the sparse matrix format:

sapply(seq(nrow(matrix)), function(i) sum(matrix[i,]))
[1] 2 1 2 2 1

Subsetting this way preserves the sparse format and does not convert the object to the more memory expensive common matrix representation. We can check the representation:

str(matrix[1,])
List of 6
 $ i       : int [1:2] 1 1
 $ j       : int [1:2] 1 3
 $ v       : num [1:2] 1 1
 $ nrow    : int 1
 $ ncol    : int 6
 $ dimnames:List of 2
  ..$ Docs : chr "1"
  ..$ Terms: chr [1:6] "document" "file" "first" "second" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

So in the sapply function we are always working on a sparse matrix. And even if sum (or whatever function you use there) does some kind of conversion, it's only going to be converting one row of the dtm, rather than the entire thing.

The general principle when working with largish text data in R is to keep your dtm as a sparse matrix and then you should be able to keep within memory limits.

TermDocumentMatrix sometimes throwing error

So after a bit of playing around, the following line of code has completely fixed my issue:

t <- iconv(t,to="utf-8-mac")

Bigram analysis and Term document Matrix

As far as my experience goes the order of words in n-grams is critical. You would not want to consider the n-grams 'Putin attacked' and "attacked Putin" to be the same as they have very different contextual meaning.

So no you are not messing up the code. You just may want to do a little more research into n-gram models. A good start may be with Chapter 4 in Speech and Language Processing by Jurafsky and Martin