Text-mining with the tm-package - word stemming
I'm not 100% sure what you're after and don't totally get how tm_map
works. If I understand then the following works. As I understand you want to supply a list of words that should not be stemmed. I'm using the qdap package mostly because I'm lazy and it has a function mgsub
I like.
Note that I got frustrated with using mgsub
and tm_map
as it kept throwing an error so I just used lapply
instead.
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
library(tm)
# Step 1: Create corpus
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))
library(qdap)
# Step 2: list to retain and indentifier keys
retain <- c("lecturer", "lecture")
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")
# Step 3: sub the words you want to retain with identifier keys
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)
# Step 4: Stem it
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)
inspect(corpus) #inspect the pieces for the folks playing along at home
inspect(corpus.copy)
inspect(corpus.temp)
# Step 6: complete the stem
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)
Basically it works by:
- subbing out a unique identifier key for the supplied "NO STEM" words (the
mgsub
) - then you stem (using
stemDocument
) - next you reverse it and sub the identifier keys with the "NO STEM" words (the
mgsub
) - last complete the Stem (
stemCompletion
)
Here's the output:
## > inspect(corpus.final)
## A corpus with 4 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`1`
## i am member of the XYZ associate
##
## $`2`
## for our open associate position
##
## $`3`
## xyz memorial lecture takes place on wednesday
##
## $`4`
## vote for the most popular lecturer
Stemming words using tm package in R does not work properly?
It seems that the stemming transform can only be applied to PlainTextDocument types. See ? stemDocument
.
sp.corpus = Corpus(VectorSource(c("la niñera. los niños. las niñas. la niña. el niño.")))
docs <- tm_map(sp.corpus, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("spanish"))
docs <- tm_map(docs, PlainTextDocument) # needs to come before stemming
docs <- tm_map(docs, stemDocument, "spanish")
print(docs[[1]]$content)
# " niñer niñ niñ niñ niñ"
versus
# WRONG
sp.corpus = Corpus(VectorSource(c("la niñera. los niños. las niñas. la niña. el niño.")))
docs <- tm_map(sp.corpus, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("spanish"))
docs <- tm_map(docs, stemDocument, "spanish") # WRONG: apply PlainTextDocument first
docs <- tm_map(docs, PlainTextDocument)
print(docs[[1]]$content)
# " niñera niños niñas niña niñ"
In my opinion, this detail is not obvious and it'd be nice to get at least a warning when stemDocument is invoked on a non-PlainTextDocument.
Ho to use a custom stemming algorithm with tm package in R?
You can integrate other functions with content_transformer
, which you can then use in a tm_map
call. You just need to know what the receiving function needs. In this case cistem
needs the words so you can use the words
function from the NLP package to get there (automatically loaded when you load the tm library). Also an unlist
and lapply
are needed.
* Note: *cistem
returns the words in lowercase, so be aware of this fact.
library(cistem)
library(tm)
# Some text
txt <- c("Dies ist ein deutscher Text.",
"Dies ist ein anderer deutscher Text.")
# the stemmer based on cistem
my_stemmer <- content_transformer(function(x) {
unlist(lapply(x, function(line) { # unlist the corpus and lapply over the list
paste(cistem(words(line)), collapse = " ")) # paste the words back together.
}
)
})
my_corpus <- VCorpus(VectorSource(txt))
# stem the corpus
my_stemmed_corpus <- tm_map(my_corpus, my_stemmer)
# check output
inspect(my_stemmed_corpus[[1]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 26
dies ist ein deutsch text.
inspect(my_stemmed_corpus[[2]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 32
dies ist ein ander deutsch text.
Related Topics
Partially Color Histogram in R
How to Plot a Normal Distribution by Labeling Specific Parts of the X-Axis
Harnessing .F List Names with Purrr::Pmap
Overlay Geom_Points() on Geom_Boxplot(Fill=Group)
Merge Overlapping Ranges into Unique Groups, in Dataframe
List Members Can Be Accessed with Partial Name? Is This a Feature
Add Raster to Ggmap Base Map: Set Alpha (Transparency) and Fill Color to Inset_Raster() in Ggplot2
In R Combine a List of Lists into One List
Show Element Values in Barplot
How to Plot a Heat Map on a Spatial Map
Dplyr Summarise Multiple Columns Using T.Test
How to Make a Timeseries Boxplot in R
Grouping Every N Minutes with Dplyr
Ggplot2: Using Gtable to Move Strip Labels to Top of Panel for Facet_Grid