Remove empty documents from DocumentTermMatrix in R topicmodels?
"Each row of the input matrix needs to contain at least one non-zero entry"
The error means that sparse matrix contain a row without entries(words). one Idea is to compute the sum of words by row
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new <- dtm[rowTotals> 0, ] #remove all docs without words
Trying to remove words from a DocumentTermMatrix in order to use topicmodels
The answer to your question is over here: https://stackoverflow.com/a/13370840/1036500 (give it an upvote!)
In brief, more recent versions of the tm
package do not include minDocFreq
but instead use bounds
, for example, your
smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))
should now be
require(tm)
data("crude")
smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(5,Inf))))
dim(smaller) # after Terms that appear in <5 documents are discarded
[1] 20 67
smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(10,Inf))))
dim(smaller) # after Terms that appear in <10 documents are discarded
[1] 20 17
Removing an empty character item from a corpus of documents in R?
I deal with text a lot but not tm so this is 2 approaches to get rid of the "" you have. likely the extra "" characters are because of a double space bar between sentences. You can treat this condition before or after you turn the text into a bag of words. You could replace all " "x2 with " "x1 before the strsplit or you could do it afterward (you have to unlist after strsplit).
x <- "I like to ride my bicycle. Do you like to ride too?"
#TREAT BEFORE(OPTION):
a <- gsub(" +", " ", x)
strsplit(a, " ")
#TREAT AFTER OPTION:
y <- unlist(strsplit(x, " "))
y[!y%in%""]
You might also try:
newtext <- lapply(newtext, function(x) gsub(" +", " ", x))
Again I don't use tm so this may not be of help but this post hadn't seen any action so I figured I'd share possibilities.
Removing Stop Phrases from DocumentTermMatrix
I came across this solution from the "gofastR" package in R:
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)
However, I still saw stop phrases in the results. After reviewing the documentation, remove_stopwords assumes it has a sorted list -- you can prep your stopwords/phrases using the prep_stopwords() function from the same package.
stopwords<-prep_stopwords(stopwords)
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)
In order to do this and stem. We can perform the stemming in the tm_map part of the code and remove the stepwords as follows:
stopwords<-prep_stopwords(stemDocument(stopwords))
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)
as this will stem the stopwords which will then match the already stemmed words in the dtm.
Remove terms in a DocumentTermMatrix that appear in ALL documents
No minimal example but this should work:
library(slam)
dtm[slam::col_sums(dtm > 0) != nrow(dtm), ]
Remove rows with character(0) from a data.frame before proceeding to dtm
This is a situation where embracing tidy data principles can really offer a nice solution. To start with, "annotate" the dataframe you presented with a new column that keeps track of doc_id
, which document each word belongs to, and then use unnest_tokens()
to transform this to a tidy data structure.
library(tidyverse)
library(tidytext)
library(stm)
df <- tibble(reviews = c("buenisimoooooo", "excelente", "excelent",
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone",
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase",
"//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late",
"good phone good reception home fringe area screen lovely just right size good buy", "@#haha", "phone verizon contract phone buyer beware", "这东西太棒了",
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund",
"good phone price fine", "phone star battery little soon yes"),
rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1),
source = c("amazon", "bestbuy", "amazon", "newegg", "amazon",
"amazon", "zappos", "newegg", "amazon", "amazon",
"amazon", "amazon", "amazon", "zappos", "amazon",
"amazon", "newegg", "amazon", "amazon", "amazon"))
tidy_df <- df %>%
mutate(doc_id = row_number()) %>%
unnest_tokens(word, reviews)
tidy_df
#> # A tibble: 154 x 4
#> rating source doc_id word
#> <dbl> <chr> <int> <chr>
#> 1 4 amazon 1 buenisimoooooo
#> 2 4 bestbuy 2 excelente
#> 3 4 amazon 3 excelent
#> 4 4 newegg 4 awesome
#> 5 4 newegg 4 phone
#> 6 4 newegg 4 awesome
#> 7 4 newegg 4 price
#> 8 4 newegg 4 almost
#> 9 4 newegg 4 month
#> 10 4 newegg 4 issue
#> # … with 144 more rows
Notice that you still have all the information you had before; all the information is still there, but it is arranged in a different structure. You can fine-tune the tokenization process to fit your particular analysis needs, perhaps dealing with non-English however you need, or keeping/not keeping punctuation, etc. This is where empty documents get thrown out, if appropriate for you.
Next, transform this tidy data structure into a sparse matrix, for use in topic modeling. The columns correspond to the words and the rows correspond to the documents.
sparse_reviews <- tidy_df %>%
count(doc_id, word) %>%
cast_sparse(doc_id, word, n)
colnames(sparse_reviews) %>% head()
#> [1] "buenisimoooooo" "excelente" "excelent" "almost"
#> [5] "awesome" "blu"
rownames(sparse_reviews) %>% head()
#> [1] "1" "2" "3" "4" "5" "8"
Next, make a dataframe of covariate (i.e. meta) information to use in topic modeling from the tidy dataset you already have.
covariates <- tidy_df %>%
distinct(doc_id, rating, source)
covariates
#> # A tibble: 18 x 3
#> doc_id rating source
#> <int> <dbl> <chr>
#> 1 1 4 amazon
#> 2 2 4 bestbuy
#> 3 3 4 amazon
#> 4 4 4 newegg
#> 5 5 4 amazon
#> 6 8 4 newegg
#> 7 9 1 amazon
#> 8 10 4 amazon
#> 9 11 3 amazon
#> 10 12 1 amazon
#> 11 13 4 amazon
#> 12 14 3 zappos
#> 13 15 1 amazon
#> 14 16 2 amazon
#> 15 17 4 newegg
#> 16 18 4 amazon
#> 17 19 1 amazon
#> 18 20 1 amazon
Now you can put this together into stm()
. For example, if you want to train a topic model with the document-level covariates looking at whether topics change a) with source and b) smoothly with rating, you would do something like this:
topic_model <- stm(sparse_reviews, K = 0, init.type = "Spectral",
prevalence = ~source + s(rating),
data = covariates,
verbose = FALSE)
Created on 2019-08-03 by the reprex package (v0.3.0)
Related Topics
Display Exact Value of a Variable in R
In Ggplot2, What Do the End of the Boxplot Lines Represent
Rmarkdown: How to End Tabbed Content
Stop an R Program Without Error
Perform Multiple Paired T-Tests Based on Groups/Categories
Cbind 2 Dataframes with Different Number of Rows
Looping Through T.Tests for Data Frame Subsets in R
How to Make a Discontinuous Axis in R with Ggplot2
Boxplot Show the Value of Mean
Apply a Function to a Subset of Data.Table Columns, by Column-Indices Instead of Name
Ggplot2 Does Not Appear to Work When Inside a Function R
Remove Empty Documents from Documenttermmatrix in R Topicmodels
Analyzing Daily/Weekly Data Using Ts in R
How to Add a Factor Column to Dataframe Based on a Conditional Statement from Another Column
Any Suggestions for How to Plot Mixem Type Data Using Ggplot2
Error in Loading Rgl Package with MAC Os X
Creating a Density Histogram in Ggplot2
How to Specify a Dynamic Position for the Start of Substring