Lemmatization Java

Lemmatization with apache lucene

In case someone still needs it, I decided to return to this question and illustrate how to use the russianmorphology library I found earlier to do lemmatization for English and Russian languages.

First of all, you will need these dependencies (besides the lucene-core):

<!-- if you need Russain -->
<dependency>
    <groupId>org.apache.lucene.morphology</groupId>
    <artifactId>russian</artifactId>
    <version>1.1</version>
</dependency>

<!-- if you need English-->
<dependency>
    <groupId>org.apache.lucene.morphology</groupId>
    <artifactId>english</artifactId>
    <version>1.1</version>
</dependency>

<dependency>
    <groupId>org.apache.lucene.morphology</groupId>
    <artifactId>morph</artifactId>
    <version>1.1</version>
</dependency>

Then, make sure you import the right analyzer:

import org.apache.lucene.morphology.english.EnglishAnalyzer;
import org.apache.lucene.morphology.russian.RussianAnalyzer;

These analyzers, unlike standard lucene analyzers, use MorphologyFilter which converts each word into a set of its normal forms.

So if you use the following code

String text = "The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from \"produced\", the lemma is \"produce\", but the stem is \"produc-\". This is because there are words such as production";
Analyzer analyzer = new EnglishAnalyzer();
TokenStream stream = analyzer.tokenStream("field", text);
stream.reset();
while (stream.incrementToken()) {
    String lemma = stream.getAttribute(CharTermAttribute.class).toString();
    System.out.print(lemma + " ");
}
stream.end();
stream.close();

it will print

the stem be the part of the word that never change even when
morphologically inflected inflect a lemma be the base form of the word
for example from produced produce the lemma be produce but the stem be
produc this be because there are be word such as production

And for the Russian text

String text = "Продолжаю цикл постов об астрологии и науке. Астрология не имеет научного обоснования, но является частью истории науки, частью культуры и общественного сознания. Поэтому астрологический взгляд на науку весьма интересен.";

the RussianAnalyzer will print the following:

продолжать цикл пост об астрология и наука астрология не иметь научный
обоснование но являться часть частью история наука часть частью
культура и общественный сознание поэтому астрологический взгляд на
наука весьма интересный

Yo may notice that some words have more that one base form, e.g. inflected is converted to [inflected, inflect]. If you don't like this behaviour, you would have to change the implementation of the org.apache.lucene.morphology.analyzer.MorhpologyFilter (if you are interested in how exactly to do it, let me know and I'll elaborate on this).

Hope it helps, good luck!

How to use Stemmer or Lemmatizer to stem specific word

A complete example here -

import nltk
from nltk.corpus import wordnet
from difflib import get_close_matches as gcm
from itertools import chain
from nltk.stem.porter import *

texts = [ " apples are good. My teeth will fall out.",
          " roses are red. cars are great to have"]

lmtzr = nltk.WordNetLemmatizer()
stemmer = PorterStemmer()

for text in texts:
    tokens = nltk.word_tokenize(text) # should sent tokenize it first
    token_lemma = [ lmtzr.lemmatize(token) for token in tokens ] # take your pick here between lemmatizer and wordnet synset.
    wn_lemma = [ gcm(word, list(set(list(chain(*[i.lemma_names() for i in wordnet.synsets(word)]))))) for word in tokens ]
    #print(wn_lemma) # works for unconventional words like 'teeth' --> tooth. You might want to take a closer look
    tokens_final = [ stemmer.stem(tokens[i]) if len(tokens[i]) > len(token_lemma[i]) else token_lemma[i] for i in range(len(tokens)) ]
    print(tokens_final)

Output

['appl', 'are', 'good', '.', 'My', 'teeth', 'will', 'fall', 'out', '.']
['rose', 'are', 'red', '.', 'car', 'are', 'great', 'to', 'have']

Explanation

Notice stemmer.stem(tokens[i]) if len(tokens[i]) > len(token_lemma[i]) else token_lemma[i] this is where the magic happens. If the lemmatized word is a subset of the main word, then the word gets stemmed, otherwise it just remains lemmatized.

Note

The lemmatization that you are attempting has some edge cases. WordnetLemmatizer is not smart enough to handle exceptional cases like 'teeth' --> 'tooth'. In those cases you would want to take a look at Wordnet.synset which might come in handy.

I have included a small case in the comments for your investigation.

StanfordNLP lemmatization cannot handle -ing words

Lemmatization crucially depends on the part of speech of the token. Only tokens with the same part of speech are mapped to the same lemma.

In the sentence "This is confusing", confusing is analyzed as an adjective, and therefore it is lemmatized to confusing. In the sentence "I was confusing you with someone else", by contrast, confusing is analyzed as a verb, and is lemmatized to confuse.

If you want tokens with different parts of speech to be mapped to the same lemma, you can use a stemming algorithm such as Porter Stemming, which you can simply call on each token.

how to use lucene for lemmatization and elimination of empty French words

It's easy, all what you need is a FrenchAnalyzer like this:

IndexWriterConfig conf= new IndexWriterConfig (Version.LUCENE_45,new FrenchAnalyzer(Version.LUCENE_45,FrenchAnalyzer.getDefaultStopSet()));

and for empty words we use : FrenchAnalyzer.getDefaultStopSet() like i did in the previous code , and for the lemmatization it's already integrated in this analyzer and you can notice that when you look for the important words (by tf idf) .

Stanford CoreNLP lemmatization

CoreLabel class has a lemma() method that returns the lemma. e.g.

// token is a CoreLable instance
String lemma = token.lemma();

How to handle LemmatizerTrainer 'UTFDataFormatException: encoded string too long'?

Recently, I've written a patch to cure OpenNLP-1366. The related PR https://github.com/apache/opennlp/pull/427 documents the problem and solution in detail.

In this context, the upcoming OpenNLP version 2.0.1 will bring the cure for the problem reported in the OP. Updating to the aforementioned version will resolve the crashing during writing trained model files.

Note:

I verified that the patch works with UD_German-HDT, UD_German-GSD, and other treebanks for the German language.