How Does the Removesparseterms in R Work

How does the removeSparseTerms in R work?

In the sense of the sparse argument to removeSparseTerms(), sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed. Relative document frequency here means a proportion. As the help page for the command states (although not very clearly), sparsity is smaller as it approaches 1.0. (Note that sparsity cannot take values of 0 or 1.0, only values in between.)

For example, if you set sparse = 0.99 as the argument to removeSparseTerms(), then this will remove only terms that are more sparse than 0.99.
The exact interpretation for sparse = 0.99 is that for term $j$, you will retain all terms for which
$df_j > N * (1 - 0.99)$, where $N$ is the number of documents -- in this case probably all terms will be retained (see example below).

Near the other extreme, if sparse = .01, then only terms that appear in (nearly) every document will be retained. (Of course this depends on the number of terms and the number of documents, and in natural language, common words like "the" are likely to occur in every document and hence never be "sparse".)

An example of the sparsity threshold of 0.99, where a term that occurs at most in (first example) less than 0.01 documents, and (second example) just over 0.01 documents:

> # second term occurs in just 1 of 101 documents
> myTdm1 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,1), rep(0, 100)), ncol=2)), 
+                                weighting = weightTf)
> removeSparseTerms(myTdm1, .99)
<<DocumentTermMatrix (documents: 101, terms: 1)>>
Non-/sparse entries: 101/0
Sparsity           : 0%
Maximal term length: 2
Weighting          : term frequency (tf)
> 
> # second term occurs in 2 of 101 documents
> myTdm2 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,2), rep(0, 99)), ncol=2)), 
+                                weighting = weightTf)
> removeSparseTerms(myTdm2, .99)
<<DocumentTermMatrix (documents: 101, terms: 2)>>
Non-/sparse entries: 103/99
Sparsity           : 49%
Maximal term length: 2
Weighting          : term frequency (tf)

Here are a few additional examples with actual text and terms:

> myText <- c("the quick brown furry fox jumped over a second furry brown fox",
              "the sparse brown furry matrix",
              "the quick matrix")

> require(tm)
> myVCorpus <- VCorpus(VectorSource(myText))
> myTdm <- DocumentTermMatrix(myVCorpus)
> as.matrix(myTdm)
    Terms
Docs brown fox furry jumped matrix over quick second sparse the
   1     2   2     2      1      0    1     1      1      0   1
   2     1   0     1      0      1    0     0      0      1   1
   3     0   0     0      0      1    0     1      0      0   1
> as.matrix(removeSparseTerms(myTdm, .01))
    Terms
Docs the
   1   1
   2   1
   3   1
> as.matrix(removeSparseTerms(myTdm, .99))
    Terms
Docs brown fox furry jumped matrix over quick second sparse the
   1     2   2     2      1      0    1     1      1      0   1
   2     1   0     1      0      1    0     0      0      1   1
   3     0   0     0      0      1    0     1      0      0   1
> as.matrix(removeSparseTerms(myTdm, .5))
    Terms
Docs brown furry matrix quick the
   1     2     2      0     1   1
   2     1     1      1     0   1
   3     0     0      1     1   1

In the last example with sparse = 0.34, only terms occurring in two-thirds of the documents were retained.

An alternative approach for trimming terms from document-term matrixes based on a document frequency is the text analysis package quanteda. The same functionality here refers not to sparsity but rather directly to the document frequency of terms (as in tf-idf).

> require(quanteda)
> myDfm <- dfm(myText, verbose = FALSE)
> docfreq(myDfm)
     a  brown    fox  furry jumped matrix   over  quick second sparse    the 
     1      2      1      2      1      2      1      2      1      1      3 
> dfm_trim(myDfm, minDoc = 2)
Features occurring in fewer than 2 documents: 6 
Document-feature matrix of: 3 documents, 5 features.
3 x 5 sparse Matrix of class "dfmSparse"
       features
docs    brown furry the matrix quick
  text1     2     2   1      0     1
  text2     1     1   1      1     0
  text3     0     0   1      1     1

This usage seems much more straightforward to me.

removeSparseTerms with training and testing set

library(tm)
library(Rstem)
data(crude)
set.seed(1)

spl <- runif(length(crude)) < 0.7
train <- crude[spl]
test <- crude[!spl]

controls <- list(
    tolower = TRUE,
    removePunctuation = TRUE,
    stopwords = stopwords("english"),
    stemming = function(word) wordStem(word, language = "english")
    )

train_dtm <- DocumentTermMatrix(train, controls)

train_dtm <- removeSparseTerms(train_dtm, 0.8)

test_dtm <- DocumentTermMatrix(
    test,
    c(controls, dictionary = list(dimnames(train_dtm)$Terms))
    )

## train_dtm
## A document-term matrix (13 documents, 91 terms)
##
## Non-/sparse entries: 405/778
## Sparsity           : 66%
## Maximal term length: 9
## Weighting          : term frequency (tf)

## test_dtm
## A document-term matrix (7 documents, 91 terms)
##
## Non-/sparse entries: 149/488
## Sparsity           : 77%
## Maximal term length: 9
## Weighting          : term frequency (tf)

## all(dimnames(train_dtm)$Terms == dimnames(test_dtm)$Terms)
## [1] TRUE

I had issues using the default stemmer. Also there is a bounds option for controls, but I couldn't get the same results as removeSparseTerms when using it. I tried bounds = list(local = c(0.2 * length(train), Inf)) with floor and ceiling with no luck.

Same value but different result? About removeSparseTerms (R)

The problem seems to be related to the popular question "7.31 Why doesn’t R think these numbers are equal?":

The only numbers that can be represented exactly in R’s numeric type
are integers and fractions whose denominator is a power of 2. All
other numbers are internally rounded to (typically) 53 binary digits
accuracy. As a result, two floating point numbers will not reliably be
equal unless they have been computed by the same algorithm, and not
always even then

Given

(x <- seq(0.45, 0.6, 0.05))
# [1] 0.45 0.50 0.55 0.60
(y <- seq(0.45, 0.8, 0.05))
# [1] 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80

then

x==y
# [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
x[4]==y[4]
# [1] FALSE
x[4]-y[4]
# [1] -1.110223e-16
x[3]-y[3]
# [1] 0

Since

MASS::as.fractions(x)
# [1]  9/20   1/2 11/20   3/5

I guess the two .5 are reliably equal here. Thus, your function may yield different results.

text mining sparse/Non-sparse meaning

By this code you have created a document term matrix of the corpus

frequencies = DocumentTermMatrix(corpus)

Document Term Matrix (DTM) lists all occurrences of words in the corpus, by document. In the DTM, the documents are represented by rows and the terms (or words) by columns. If a word occurs in a particular document, then the matrix entry for corresponding to that row and column is 1, else it is 0 (multiple occurrences within a document are recorded – that is, if a word occurs twice in a document, it is recorded as “2” in the relevant matrix entry).

As an example consider corpus of having two documents.

Doc1: bananas are good

Doc2: bananas are yellow

DTM for the above corpus would look like

              banana          are        yellow       good
Doc1            1               1          1            0

Doc2            1               1          0            1

The output

<<DocumentTermMatrix (documents: 299, terms: 1297)>>
Non-/sparse entries: 6242/381561
Sparsity           : 98%
Maximal term length: 19
Weighting          : term frequency (tf)

The output signifies that DTM has 299 entries which has over 1297 terms which have appeared at least once.

sparse = removeSparseTerms(frequencies, 0.97)

Now you are removing those terms which don't appear too often in your data. We will remove any element that doesn't appear in atleast 3% of the entries (or documents). Relating to the above created DTM we are basically removing those columns whose entries are 1 in least number of documents.

Now if you look at the output

> sparse
<<DocumentTermMatrix (documents: 299, terms: 166)>>
Non-/sparse entries: 3773/45861
Sparsity           : 92%
Maximal term length: 10
Weighting          : term frequency (tf)

The number of entries (documents) are still the same i.e 299 but number of terms terms which have appeared at least once has changed to 166.

r remove sparse terms from more than one tdm

This seems to work:

for(i in 1:2){
  assign(paste0("TDM_", i), 
         removeSparseTerms(get(paste0('tdm_', i)), 0.98))
}
TDM_1
# <<TermDocumentMatrix (terms: 707, documents: 16)>>
# Non-/sparse entries: 1245/10067
# Sparsity           : 89%
# Maximal term length: 13
# Weighting          : term frequency (tf)
TDM_2
# <<TermDocumentMatrix (terms: 308, documents: 4)>>
# Non-/sparse entries: 377/855
# Sparsity           : 69%
# Maximal term length: 16
# Weighting          : term frequency (tf)

Removing Sparsity in matrix

The sparsity parameter helps you to removes those terms which have at least a certain percentage of sparse elements. (very) Roughly speaking if you want to keep the terms that appear 3% of of the time, set the parameter to 0.97. If you want the terms that occur in 70% of the time, set the parameter to 0.3. The values must be bigger than 0 and smaller than 1.

In your case, if you want the term to appear in at least 10% of the time, you need to set the sparsity to 0.9.

see code example.

  libary(tm)

  data("crude")
  crude <- as.VCorpus(crude)
  crude <- tm_map(crude, stripWhitespace)
  crude <- tm_map(crude, removePunctuation)
  crude <- tm_map(crude, content_transformer(tolower))
  crude <- tm_map(crude, removeWords, stopwords("english"))
  crude <- tm_map(crude, stemDocument)
  dtm <- DocumentTermMatrix(crude)
  sdtm <- removeSparseTerms(dtm, 0.3)
  sdtm2 <- removeSparseTerms(dtm, 0.7)

  sdtm$ncol
  inspect(sdtm) # 4 words returned 
  sdtm2$ncol      
  inspect(sdtm2) # 24 words returned

How Does the Removesparseterms in R Work