Stemming with R Text Analysis

Stemming with R Text Analysis

We could set up a list of synonyms and replace those values. For example

synonyms <- list(
list(word="account", syns=c("acount", "accounnt"))
)

This says we want to replace "acount" and "accounnt" with "account" (i'm assuming we're doing this after stemming). Now let's create test data.

raw<-c("accounts", "account", "accounting", "acounting", 
"acount", "acounts", "accounnt")

And now let's define a transformation function that will replace the words in our list with the primary synonym.

library(tm)
replaceSynonyms <- content_transformer(function(x, syn=NULL) {
Reduce(function(a,b) {
gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word, a)}, syn, x)
})

Here we use the content_transformer function to define a custom transformation. And basically we just do a gsub to replace each of the words. We can then use this on a corpus

tm <- Corpus(VectorSource(raw))
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, replaceSynonyms, synonyms)
inspect(tm)

and we can see all these values are transformed into "account" as desired. To add other synonyms, just add additional lists to the main synonyms list. Each sub-list should have the names "word" and "syns".

Text-mining with the tm-package - word stemming

I'm not 100% sure what you're after and don't totally get how tm_map works. If I understand then the following works. As I understand you want to supply a list of words that should not be stemmed. I'm using the qdap package mostly because I'm lazy and it has a function mgsub I like.

Note that I got frustrated with using mgsub and tm_map as it kept throwing an error so I just used lapply instead.

texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")

library(tm)
# Step 1: Create corpus
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))

library(qdap)
# Step 2: list to retain and indentifier keys
retain <- c("lecturer", "lecture")
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")

# Step 3: sub the words you want to retain with identifier keys
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)

# Step 4: Stem it
corpus.temp <- tm_map(corpus, stemDocument, language = "english")

# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)

inspect(corpus) #inspect the pieces for the folks playing along at home
inspect(corpus.copy)
inspect(corpus.temp)

# Step 6: complete the stem
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)

Basically it works by:

  1. subbing out a unique identifier key for the supplied "NO STEM" words (the mgsub)
  2. then you stem (using stemDocument)
  3. next you reverse it and sub the identifier keys with the "NO STEM" words (the mgsub)
  4. last complete the Stem (stemCompletion)

Here's the output:

## >     inspect(corpus.final)
## A corpus with 4 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`1`
## i am member of the XYZ associate
##
## $`2`
## for our open associate position
##
## $`3`
## xyz memorial lecture takes place on wednesday
##
## $`4`
## vote for the most popular lecturer

Stemming function in r

If you want to remove stopwords from each sentence, you could use lapply :

library(tm)
lapply(sentences, removeWords, stopwords())

#[[1]]
#[1] "" "color" "blue" "neutralize" "orange" "yellow" "reflection" "."

#[[2]]
#[1] "zod" "stabbed" "" "" "blue" "kryptonite" "."
#...
#...

However, from your expected output it looks you want to paste the text together.

lapply(sentences, paste0, collapse = " ")

#[[1]]
#[1] "the color blue neutralize orange yellow reflection ."

#[[2]]
#[1] "zod stabbed me with blue kryptonite ."
#....

r text analysis stem completion

TM has a function stemCompletion()

x <- c("completed","complete","completion","teach","taught")
tm <- Corpus(VectorSource(x))
tm <- tm_map(tm, stemDocument)
inspect(tm)
dictCorpus <- tm
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, stripWhitespace, mc.cores=cores)

tm<-tm_map(tm, stemCompletion,dictionary=dictCorpus)

As for completing verbs to the present tense, I am not sure that is possible with tm. Maybe RWeka, word2vec or qdap will have methods but I am not sure.

A quick and dirty, solution may be to set type = shortest in stemDocument generally I think current tense words will be shorter than past tense and gerunds.

Misspelling-aware stemming with R Text Analysis

For the basic automatic construction of a stemmer from a standard English dictionary, Tyler Rinker's answers already shows what you want.

All you need to add is code for synthesizing likely misspellings, or matching (common) misspellings in your corpus using a word-distance metric like Levenshtein distance (see adist) to find the closest match in the dictionary.

problems in stemming in text analysis (Swedish data)

You are almost there, but using PlainTextDocument is interfering with your goal.

The following code will return your expected result. I'm using remove punctuation otherwise the stemming will not work on the works that are at the end of the sentence. Also you will see warning messages appearing after both tm_map calls. You can ignore these.

corpus.prep <- Corpus(VectorSource(text), readerControl    =list(reader=readPlain, language="swe"))
corpus.prep <- tm_map(corpus.prep, removePunctuation)
corpus.prep <- tm_map(corpus.prep, stemDocument, language = "swedish")

head(content(corpus.prep))

[1] "TV och var med kompis" "Jobb på kompis huset" "Ta det lugnt umgås med kompis" "Umgås med kompis vänn"
[5] "koll anim med kompis"

For this kind of work I tend to use quanteda. Better support and works a lot better than tm.

library(quanteda)

# remove_punct not really needed as quanteda treats the "." as a separate token.
my_dfm <- dfm(text, remove_punct = TRUE)
dfm_wordstem(my_dfm, language = "swedish")

Document-feature matrix of: 5 documents, 15 features (69.3% sparse).
5 x 15 sparse Matrix of class "dfm"
features
docs tv och var med kompis jobb på huset ta det lugnt umgås vänn koll anim
text1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
text2 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0
text3 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0
text4 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0
text5 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1

Word stemming in R

That is just how the Porter Stemmer works. The reason for this is that it allows fairly simple rules to create the stems without having to store a large English vocabulary. For example, I think that you would not like that both change and changing go to chang. It seems more natural that they should both stem to change. So would you make a rule that if you take ing off the end of a word, you should add back e to get the stem? Then what would happen with clang and clanging? The Porter Stemmer gives clang. Adding e would give the non-word clange. Either you use simple processing rules that sometimes create stems that are not words, or you must include a large vocabulary and have more complex rules that depend on what the words are. The Porter Stemmer uses the simple rules method.

Not getting the right text after stemming in text analysis (Swedish)

What you describe here is actually not stemming but is called lemmatization (see @Newl's link for the difference).

To get the correct lemmas, you can use the R package UDPipe, which is a wrapper around the UDPipe C++ library.

Here is a quick example of how you would do what you want:

# install.packages("udpipe")    
library(udpipe)
dl <- udpipe_download_model(language = "swedish-lines")
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.3/master/inst/udpipe-ud-2.3-181115/swedish-lines-ud-2.3-181115.udpipe to C:/Users/Johannes Gruber/AppData/Local/Temp/RtmpMhaF8L/reprex8e40d80ef3/swedish-lines-ud-2.3-181115.udpipe

udmodel_swed <- udpipe_load_model(file = dl$file_model)

text_example <- c("projekt", "papper", "arbete")

x <- udpipe_annotate(udmodel_swed, x = text_example)
x <- as.data.frame(x)
x$lemma
#> [1] "projekt" "papper" "arbete"

R stemming a string/document/corpus

The RTextTools package on CRAN allows you to do this.

library(RTextTools)
worder1<- c("I am taking","these are the samples",
"He speaks differently","This is distilled","It was placed")
df1 <- data.frame(id=1:5, words=worder1)

matrix <- create_matrix(df1, stemWords=TRUE, removeStopwords=FALSE, minWordLength=2)
colnames(matrix) # SEE THE STEMMED TERMS

This returns a DocumentTermMatrix that can be used with package tm. You can play around with the other parameters (e.g. removing stopwords, changing the minimum word length, using a stemmer for a different language) to get the results you need. When displayed as.matrix the example produces the following term matrix:

                         Terms
Docs am are differ distil he is it place sampl speak take the these this was
1 I am taking 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2 these are the samples 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0
3 He speaks differently 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0
4 This is distilled 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0
5 It was placed 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1


Related Topics



Leave a reply



Submit