Stemming Algorithm That Produces Real Words

Stemming algorithm that produces real words

The core issue here is that stemming algorithms operate ~~on a phonetic basis~~ purely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:

Locate or create a large dictionary which maps each possible stem back to an actual word. (e.g., communiti -> community)
Create a function which compares each stem to a list of the words that were reduced to that stem and attempts to determine which is most similar. (e.g., comparing "communiti" against "community" and "communities" in such a way that "community" will be recognized as the more similar option)

Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.

Get the word from stem (stemming)

What, so you want to take a stem and map it to a list of possible words in a dictionary that stem back to it?

This is difficult because the stemming process is lossy and because it's not a 1:1 transformation.

That said, in some cases like environ -> {environment, environments, environmental} and educ -> {educate, educational, education, educated, educating} you can get by with a trie structure where you do a prefix lookup. Things get more interesting for stems like happi which has to map back to happy

In the general case, you would have to start with a dictionary and then produce an inverted index by stemming each word and mapping the stem back to the source word in the index. Using the inverted index you can then look up matches given a stem.

Hope this helps..

Is it possible to get a natural word after it has been stemmed?

Stemmer is able to process artificial non-existing words. Would you like them to be returned as elements of a set of all possible words? How do you know that the word doesn't exist and shouldn't be returned?

As an option: find a dictionary of all words and their forms. Find a stem for every of them. Save this projection as a map: ( stem, list of all word forms ). So you'll be able to get the list of all word forms for a given stem.

UPD:
If you need all possible words including non-existing then I can offer such an algorithm (it's not checked, just a suggestion):

Porter stemming algorithm. We need a reversed version.

If the rule in straight algorithm has a form (m>1) E -> (delete last E) then the reversed rule would be "fork with E" which means we need to try alternative ways. E.g., in straight algorithm probate -> probat, in reversed we have two alternatives: probat -> { probat, probate }. Each of these alternatives should be separately processed further. Note that this is a set of alternatives, so we will process only distinct words. Such a rule would have the following form: A -> { , B, C }, which means "replace ending A in three alternative ways: leave as-is, with B and with C".

Step 5b: (m>1) *L -> { , +L } // Add L if there's L at the end.
Step 5a: (m>1) -> { , +E }
         (m=1 and not *o) -> { , +E } // *o is a special condition, it's not *O.
Step 4: (m>1) *S or *T -> { , +ION }
        (m>1) -> { , +AL, +ANCE, +ENCE, ..., +IVE, +IZE }
Step 3: (m>0) *AL -> { , +IZE }
        (m>0) *IC -> { , +ATE, +ITI, +AL }
        (m>0) -> { , +ATIVE, +FUL, +NESS }
Step 2: (m>0) *ATE -> { , ATIONAL } // Replace ATE.
        (m>0) *TION -> { , +AL } // Add AL at the end.
        (m>0) *ENCE -> { , ENCI } // Replace ENCE.
        ...
        (m>0) *BLE -> { , BILITI } // Replace BLE.
Step 1c: (*v*) *I -> { , Y } // Replace I.
Step 1b: (m=1 and *oE) -> { , +D, delete last E and add ING } // *o is a special condition.
         (*v*c and not (*L or *S or *Z)) -> { , add last consonant +ED, add last consonant + ING }
         *IZE -> { , IZING, +D }
         (*v*BLE) -> { , +D, delete last E and add ING }
         *ATE -> { , ATING, +D }
         (*v*) -> { , +ED, +ING }
         (m>0) *EE -> { , +D }
Step 1a: *I -> { , +ES }
         *SS -> { , +ES }
         not *S -> { , +S }

The straight algorithm had to choose first longest rule. The reversed algorithm should use all the rules.

Example (straight):

Input: PLAYING
Step 1a doesn't match.
PLAYING -> PLAY (Step 1b)
PLAY -> PLAI (Step 1c)
m=0, so the steps 2-5 don't match.
Result: PLAI

Reversed:

Input: PLAI
m=0, so the steps 2-5 are skipped
Step 1c:
PLAI -> { PLAI, PLAY }
Step 1b:
PLAI -> { PLAI, PLAIED, PLAIING }
PLAY -> { PLAY, PLAYED, PLAYING }
Resulting set: { PLAI, PLAIED, PLAIING, PLAY, PLAYED, PLAYING }
Step 1a:
PLAI -> { PLAI, PLAIS, PLAIES }
PLAIED -> { PLAIED, PLAIEDS }
PLAIING -> { PLAIING, PLAIINGS }
PLAY -> { PLAY, PLAYS }
PLAYED -> { PLAYED, PLAYEDS }
PLAYING -> { PLAYING, PLAYINGS }
Resulting set: { PLAI, PLAIS, PLAIES, PLAIED, PLAIEDS, PLAIING, PLAIINGS, PLAY, PLAYS, PLAYED, PLAYEDS, PLAYING, PLAYINGS }

I've checked all those words at Michael Tontchev's link. The result for every of them is "plai" (note that the site doesn't accept upper-case input).

Stemming on tokenized words

You may find better answers but I personally find the LemmInflect library to be the best for lemmatization and inflections.

#!pip install lemminflect
from lemminflect import getLemma, getInflection, getAllLemmas

word = 'testing'
lemma = list(lemminflect.getAllLemmas(word, upos='NOUN').values())[0]
inflect = lemminflect.getInflection(lemma[0], tag='VBD')

print(word, lemma, inflect)

testing ('test',) ('tested',)

I would avoid stemming because it's not really useful if you want to work with language models or just text classification with any context. Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word.

Inflections are the opposite of a lemma.

sentence = ['I', 'am', 'testing', 'my', 'new', 'library']

def l(sentence):
    lemmatized_sent = []
    for i in sentence:
        try: lemmatized_sent.append(list(getAllLemmas(i, upos='NOUN').values())[0][0])
        except: lemmatized_sent.append(i)
    return lemmatized_sent

l(sentence)

['I', 'be', 'test', 'my', 'new', 'library']

#To apply to dataframe use this
df['sentences'].apply(l)

Do read the documentation for LemmInflect. You can do so much more with it.

why the results of the porter stemmer algorithm that I have not in accordance with the root word that should be?

Quoting from your link:

2. Why is the stemmer not producing proper words?

It is often taken to be a crude error that a stemming algorithm does not leave a real word after removing the stem. But the purpose of stemming is to bring variant forms of a word together, not to map a word onto its ‘paradigm’ form.

And connected with this,

3. Why are there errors?

The question normally comes in the form, why should word X be stemmed to x1, when one would have expected it to be stemmed to x2? It is important to remember that the stemming algorithm cannot achieve perfection. On balance it will (or may) improve IR performance, but in individual cases it may sometimes make what are, or what seem to be, errors. Of course, this is a different matter from suggesting an additional rule that might be included in the stemmer to improve its performance.

another porter stemming algorithm implementation question?

The Porter stemmer and other stemming algorithms don't always return words; they return word stems. The goal is that related words should have the same stem. As long as "happiness", "happy", and "happily" all reduce to the same stem, then your stemmer is doing its job, even if the stem isn't a real word.

What is the difference between lemmatization vs stemming?

Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

From the NLTK docs:

Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.

Stemming Algorithm That Produces Real Words