Remove Empty Documents from Documenttermmatrix in R Topicmodels

Remove empty documents from DocumentTermMatrix in R topicmodels?


"Each row of the input matrix needs to contain at least one non-zero entry"

The error means that sparse matrix contain a row without entries(words). one Idea is to compute the sum of words by row

rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new <- dtm[rowTotals> 0, ] #remove all docs without words

Trying to remove words from a DocumentTermMatrix in order to use topicmodels

The answer to your question is over here: https://stackoverflow.com/a/13370840/1036500 (give it an upvote!)

In brief, more recent versions of the tm package do not include minDocFreq but instead use bounds, for example, your

smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))

should now be

require(tm)
data("crude")

smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(5,Inf))))
dim(smaller) # after Terms that appear in <5 documents are discarded
[1] 20 67
smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(10,Inf))))
dim(smaller) # after Terms that appear in <10 documents are discarded
[1] 20 17

Removing an empty character item from a corpus of documents in R?

I deal with text a lot but not tm so this is 2 approaches to get rid of the "" you have. likely the extra "" characters are because of a double space bar between sentences. You can treat this condition before or after you turn the text into a bag of words. You could replace all " "x2 with " "x1 before the strsplit or you could do it afterward (you have to unlist after strsplit).

x <- "I like to ride my bicycle.  Do you like to ride too?"

#TREAT BEFORE(OPTION):
a <- gsub(" +", " ", x)
strsplit(a, " ")

#TREAT AFTER OPTION:
y <- unlist(strsplit(x, " "))
y[!y%in%""]

You might also try:

newtext <- lapply(newtext, function(x) gsub(" +", " ", x))

Again I don't use tm so this may not be of help but this post hadn't seen any action so I figured I'd share possibilities.

Removing Stop Phrases from DocumentTermMatrix

I came across this solution from the "gofastR" package in R:

dtm2 <- remove_stopwords(dtm, stopwords = stopwords)

However, I still saw stop phrases in the results. After reviewing the documentation, remove_stopwords assumes it has a sorted list -- you can prep your stopwords/phrases using the prep_stopwords() function from the same package.

stopwords<-prep_stopwords(stopwords)
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)

In order to do this and stem. We can perform the stemming in the tm_map part of the code and remove the stepwords as follows:

stopwords<-prep_stopwords(stemDocument(stopwords))
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)

as this will stem the stopwords which will then match the already stemmed words in the dtm.

Remove terms in a DocumentTermMatrix that appear in ALL documents

No minimal example but this should work:

library(slam)
dtm[slam::col_sums(dtm > 0) != nrow(dtm), ]

Remove rows with character(0) from a data.frame before proceeding to dtm

This is a situation where embracing tidy data principles can really offer a nice solution. To start with, "annotate" the dataframe you presented with a new column that keeps track of doc_id, which document each word belongs to, and then use unnest_tokens() to transform this to a tidy data structure.

library(tidyverse)
library(tidytext)
library(stm)

df <- tibble(reviews = c("buenisimoooooo", "excelente", "excelent",
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone",
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase",
"//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late",
"good phone good reception home fringe area screen lovely just right size good buy", "@#haha", "phone verizon contract phone buyer beware", "这东西太棒了",
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund",
"good phone price fine", "phone star battery little soon yes"),
rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1),
source = c("amazon", "bestbuy", "amazon", "newegg", "amazon",
"amazon", "zappos", "newegg", "amazon", "amazon",
"amazon", "amazon", "amazon", "zappos", "amazon",
"amazon", "newegg", "amazon", "amazon", "amazon"))


tidy_df <- df %>%
mutate(doc_id = row_number()) %>%
unnest_tokens(word, reviews)

tidy_df
#> # A tibble: 154 x 4
#> rating source doc_id word
#> <dbl> <chr> <int> <chr>
#> 1 4 amazon 1 buenisimoooooo
#> 2 4 bestbuy 2 excelente
#> 3 4 amazon 3 excelent
#> 4 4 newegg 4 awesome
#> 5 4 newegg 4 phone
#> 6 4 newegg 4 awesome
#> 7 4 newegg 4 price
#> 8 4 newegg 4 almost
#> 9 4 newegg 4 month
#> 10 4 newegg 4 issue
#> # … with 144 more rows

Notice that you still have all the information you had before; all the information is still there, but it is arranged in a different structure. You can fine-tune the tokenization process to fit your particular analysis needs, perhaps dealing with non-English however you need, or keeping/not keeping punctuation, etc. This is where empty documents get thrown out, if appropriate for you.

Next, transform this tidy data structure into a sparse matrix, for use in topic modeling. The columns correspond to the words and the rows correspond to the documents.

sparse_reviews <- tidy_df %>%
count(doc_id, word) %>%
cast_sparse(doc_id, word, n)

colnames(sparse_reviews) %>% head()
#> [1] "buenisimoooooo" "excelente" "excelent" "almost"
#> [5] "awesome" "blu"
rownames(sparse_reviews) %>% head()
#> [1] "1" "2" "3" "4" "5" "8"

Next, make a dataframe of covariate (i.e. meta) information to use in topic modeling from the tidy dataset you already have.

covariates <- tidy_df %>%
distinct(doc_id, rating, source)

covariates
#> # A tibble: 18 x 3
#> doc_id rating source
#> <int> <dbl> <chr>
#> 1 1 4 amazon
#> 2 2 4 bestbuy
#> 3 3 4 amazon
#> 4 4 4 newegg
#> 5 5 4 amazon
#> 6 8 4 newegg
#> 7 9 1 amazon
#> 8 10 4 amazon
#> 9 11 3 amazon
#> 10 12 1 amazon
#> 11 13 4 amazon
#> 12 14 3 zappos
#> 13 15 1 amazon
#> 14 16 2 amazon
#> 15 17 4 newegg
#> 16 18 4 amazon
#> 17 19 1 amazon
#> 18 20 1 amazon

Now you can put this together into stm(). For example, if you want to train a topic model with the document-level covariates looking at whether topics change a) with source and b) smoothly with rating, you would do something like this:

topic_model <- stm(sparse_reviews, K = 0, init.type = "Spectral",
prevalence = ~source + s(rating),
data = covariates,
verbose = FALSE)

Created on 2019-08-03 by the reprex package (v0.3.0)



Related Topics



Leave a reply



Submit