Stemming with R Text Analysis
We could set up a list of synonyms and replace those values. For example
synonyms <- list(
list(word="account", syns=c("acount", "accounnt"))
)
This says we want to replace "acount" and "accounnt" with "account" (i'm assuming we're doing this after stemming). Now let's create test data.
raw<-c("accounts", "account", "accounting", "acounting",
"acount", "acounts", "accounnt")
And now let's define a transformation function that will replace the words in our list with the primary synonym.
library(tm)
replaceSynonyms <- content_transformer(function(x, syn=NULL) {
Reduce(function(a,b) {
gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word, a)}, syn, x)
})
Here we use the content_transformer
function to define a custom transformation. And basically we just do a gsub
to replace each of the words. We can then use this on a corpus
tm <- Corpus(VectorSource(raw))
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, replaceSynonyms, synonyms)
inspect(tm)
and we can see all these values are transformed into "account" as desired. To add other synonyms, just add additional lists to the main synonyms
list. Each sub-list should have the names "word" and "syns".
Text-mining with the tm-package - word stemming
I'm not 100% sure what you're after and don't totally get how tm_map
works. If I understand then the following works. As I understand you want to supply a list of words that should not be stemmed. I'm using the qdap package mostly because I'm lazy and it has a function mgsub
I like.
Note that I got frustrated with using mgsub
and tm_map
as it kept throwing an error so I just used lapply
instead.
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
library(tm)
# Step 1: Create corpus
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))
library(qdap)
# Step 2: list to retain and indentifier keys
retain <- c("lecturer", "lecture")
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")
# Step 3: sub the words you want to retain with identifier keys
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)
# Step 4: Stem it
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)
inspect(corpus) #inspect the pieces for the folks playing along at home
inspect(corpus.copy)
inspect(corpus.temp)
# Step 6: complete the stem
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)
Basically it works by:
- subbing out a unique identifier key for the supplied "NO STEM" words (the
mgsub
) - then you stem (using
stemDocument
) - next you reverse it and sub the identifier keys with the "NO STEM" words (the
mgsub
) - last complete the Stem (
stemCompletion
)
Here's the output:
## > inspect(corpus.final)
## A corpus with 4 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`1`
## i am member of the XYZ associate
##
## $`2`
## for our open associate position
##
## $`3`
## xyz memorial lecture takes place on wednesday
##
## $`4`
## vote for the most popular lecturer
Stemming function in r
If you want to remove stopwords from each sentence
, you could use lapply
:
library(tm)
lapply(sentences, removeWords, stopwords())
#[[1]]
#[1] "" "color" "blue" "neutralize" "orange" "yellow" "reflection" "."
#[[2]]
#[1] "zod" "stabbed" "" "" "blue" "kryptonite" "."
#...
#...
However, from your expected output it looks you want to paste the text together.
lapply(sentences, paste0, collapse = " ")
#[[1]]
#[1] "the color blue neutralize orange yellow reflection ."
#[[2]]
#[1] "zod stabbed me with blue kryptonite ."
#....
r text analysis stem completion
TM has a function stemCompletion()
x <- c("completed","complete","completion","teach","taught")
tm <- Corpus(VectorSource(x))
tm <- tm_map(tm, stemDocument)
inspect(tm)
dictCorpus <- tm
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, stripWhitespace, mc.cores=cores)
tm<-tm_map(tm, stemCompletion,dictionary=dictCorpus)
As for completing verbs to the present tense, I am not sure that is possible with tm. Maybe RWeka, word2vec or qdap will have methods but I am not sure.
A quick and dirty, solution may be to set type = shortest
in stemDocument
generally I think current tense words will be shorter than past tense and gerunds.
Misspelling-aware stemming with R Text Analysis
For the basic automatic construction of a stemmer from a standard English dictionary, Tyler Rinker's answers already shows what you want.
All you need to add is code for synthesizing likely misspellings, or matching (common) misspellings in your corpus using a word-distance metric like Levenshtein distance (see adist
) to find the closest match in the dictionary.
problems in stemming in text analysis (Swedish data)
You are almost there, but using PlainTextDocument
is interfering with your goal.
The following code will return your expected result. I'm using remove punctuation otherwise the stemming will not work on the works that are at the end of the sentence. Also you will see warning messages appearing after both tm_map calls. You can ignore these.
corpus.prep <- Corpus(VectorSource(text), readerControl =list(reader=readPlain, language="swe"))
corpus.prep <- tm_map(corpus.prep, removePunctuation)
corpus.prep <- tm_map(corpus.prep, stemDocument, language = "swedish")
head(content(corpus.prep))
[1] "TV och var med kompis" "Jobb på kompis huset" "Ta det lugnt umgås med kompis" "Umgås med kompis vänn"
[5] "koll anim med kompis"
For this kind of work I tend to use quanteda. Better support and works a lot better than tm.
library(quanteda)
# remove_punct not really needed as quanteda treats the "." as a separate token.
my_dfm <- dfm(text, remove_punct = TRUE)
dfm_wordstem(my_dfm, language = "swedish")
Document-feature matrix of: 5 documents, 15 features (69.3% sparse).
5 x 15 sparse Matrix of class "dfm"
features
docs tv och var med kompis jobb på huset ta det lugnt umgås vänn koll anim
text1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
text2 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0
text3 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0
text4 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0
text5 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1
Word stemming in R
That is just how the Porter Stemmer works. The reason for this is that it allows fairly simple rules to create the stems without having to store a large English vocabulary. For example, I think that you would not like that both change
and changing
go to chang
. It seems more natural that they should both stem to change
. So would you make a rule that if you take ing
off the end of a word, you should add back e
to get the stem? Then what would happen with clang
and clanging
? The Porter Stemmer gives clang
. Adding e
would give the non-word clange
. Either you use simple processing rules that sometimes create stems that are not words, or you must include a large vocabulary and have more complex rules that depend on what the words are. The Porter Stemmer uses the simple rules method.
Not getting the right text after stemming in text analysis (Swedish)
What you describe here is actually not stemming but is called lemmatization (see @Newl's link for the difference).
To get the correct lemmas, you can use the R
package UDPipe
, which is a wrapper around the UDPipe C++ library.
Here is a quick example of how you would do what you want:
# install.packages("udpipe")
library(udpipe)
dl <- udpipe_download_model(language = "swedish-lines")
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.3/master/inst/udpipe-ud-2.3-181115/swedish-lines-ud-2.3-181115.udpipe to C:/Users/Johannes Gruber/AppData/Local/Temp/RtmpMhaF8L/reprex8e40d80ef3/swedish-lines-ud-2.3-181115.udpipe
udmodel_swed <- udpipe_load_model(file = dl$file_model)
text_example <- c("projekt", "papper", "arbete")
x <- udpipe_annotate(udmodel_swed, x = text_example)
x <- as.data.frame(x)
x$lemma
#> [1] "projekt" "papper" "arbete"
R stemming a string/document/corpus
The RTextTools package on CRAN allows you to do this.
library(RTextTools)
worder1<- c("I am taking","these are the samples",
"He speaks differently","This is distilled","It was placed")
df1 <- data.frame(id=1:5, words=worder1)
matrix <- create_matrix(df1, stemWords=TRUE, removeStopwords=FALSE, minWordLength=2)
colnames(matrix) # SEE THE STEMMED TERMS
This returns a DocumentTermMatrix
that can be used with package tm
. You can play around with the other parameters (e.g. removing stopwords, changing the minimum word length, using a stemmer for a different language) to get the results you need. When displayed as.matrix
the example produces the following term matrix:
Terms
Docs am are differ distil he is it place sampl speak take the these this was
1 I am taking 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2 these are the samples 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0
3 He speaks differently 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0
4 This is distilled 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0
5 It was placed 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1
Related Topics
How to Generate a Frequency Table in R with With Cumulative Frequency and Relative Frequency
Shared Memory in Parallel Foreach in R
Annotating Facet Title as Strip Over Facet
Common Main Title of a Figure Panel Compiled with Par(Mfrow)
Emacs Ess Mode - Tabbing for Comment Region
Fastest Way to Multiply Matrix Columns with Vector Elements in R
Model.Matrix() with Na.Action=Null
How to Define a Vectorized Function in R
Long and Wide Data - When to Use What
R- Converting Data from Fraction to Decimal
Producing a Boxplot in Ggplot2 Using Summary Statistics
Row-By-Row Operations and Updates in Data.Table
How to Make the Legend in Ggplot2 the Same Height as My Plot
Add a Page Refresh Button by Using R Shiny
How to Upload a File to a Server via Ftp Using R