Lemmatization with apache lucene
In case someone still needs it, I decided to return to this question and illustrate how to use the russianmorphology library I found earlier to do lemmatization for English and Russian languages.
First of all, you will need these dependencies (besides the lucene-core
):
<!-- if you need Russain -->
<dependency>
<groupId>org.apache.lucene.morphology</groupId>
<artifactId>russian</artifactId>
<version>1.1</version>
</dependency>
<!-- if you need English-->
<dependency>
<groupId>org.apache.lucene.morphology</groupId>
<artifactId>english</artifactId>
<version>1.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene.morphology</groupId>
<artifactId>morph</artifactId>
<version>1.1</version>
</dependency>
Then, make sure you import the right analyzer:
import org.apache.lucene.morphology.english.EnglishAnalyzer;
import org.apache.lucene.morphology.russian.RussianAnalyzer;
These analyzers, unlike standard lucene analyzers, use MorphologyFilter
which converts each word into a set of its normal forms.
So if you use the following code
String text = "The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from \"produced\", the lemma is \"produce\", but the stem is \"produc-\". This is because there are words such as production";
Analyzer analyzer = new EnglishAnalyzer();
TokenStream stream = analyzer.tokenStream("field", text);
stream.reset();
while (stream.incrementToken()) {
String lemma = stream.getAttribute(CharTermAttribute.class).toString();
System.out.print(lemma + " ");
}
stream.end();
stream.close();
it will print
the stem be the part of the word that never change even when
morphologically inflected inflect a lemma be the base form of the word
for example from produced produce the lemma be produce but the stem be
produc this be because there are be word such as production
And for the Russian text
String text = "Продолжаю цикл постов об астрологии и науке. Астрология не имеет научного обоснования, но является частью истории науки, частью культуры и общественного сознания. Поэтому астрологический взгляд на науку весьма интересен.";
the RussianAnalyzer
will print the following:
продолжать цикл пост об астрология и наука астрология не иметь научный
обоснование но являться часть частью история наука часть частью
культура и общественный сознание поэтому астрологический взгляд на
наука весьма интересный
Yo may notice that some words have more that one base form, e.g. inflected
is converted to [inflected, inflect]
. If you don't like this behaviour, you would have to change the implementation of the org.apache.lucene.morphology.analyzer.MorhpologyFilter
(if you are interested in how exactly to do it, let me know and I'll elaborate on this).
Hope it helps, good luck!
How to use Stemmer or Lemmatizer to stem specific word
A complete example here -
import nltk
from nltk.corpus import wordnet
from difflib import get_close_matches as gcm
from itertools import chain
from nltk.stem.porter import *
texts = [ " apples are good. My teeth will fall out.",
" roses are red. cars are great to have"]
lmtzr = nltk.WordNetLemmatizer()
stemmer = PorterStemmer()
for text in texts:
tokens = nltk.word_tokenize(text) # should sent tokenize it first
token_lemma = [ lmtzr.lemmatize(token) for token in tokens ] # take your pick here between lemmatizer and wordnet synset.
wn_lemma = [ gcm(word, list(set(list(chain(*[i.lemma_names() for i in wordnet.synsets(word)]))))) for word in tokens ]
#print(wn_lemma) # works for unconventional words like 'teeth' --> tooth. You might want to take a closer look
tokens_final = [ stemmer.stem(tokens[i]) if len(tokens[i]) > len(token_lemma[i]) else token_lemma[i] for i in range(len(tokens)) ]
print(tokens_final)
Output
['appl', 'are', 'good', '.', 'My', 'teeth', 'will', 'fall', 'out', '.']
['rose', 'are', 'red', '.', 'car', 'are', 'great', 'to', 'have']
Explanation
Notice stemmer.stem(tokens[i]) if len(tokens[i]) > len(token_lemma[i]) else token_lemma[i]
this is where the magic happens. If the lemmatized word is a subset of the main word, then the word gets stemmed, otherwise it just remains lemmatized.
Note
The lemmatization that you are attempting has some edge cases. WordnetLemmatizer
is not smart enough to handle exceptional cases like 'teeth' --> 'tooth'. In those cases you would want to take a look at Wordnet.synset
which might come in handy.
I have included a small case in the comments for your investigation.
StanfordNLP lemmatization cannot handle -ing words
Lemmatization crucially depends on the part of speech of the token. Only tokens with the same part of speech are mapped to the same lemma.
In the sentence "This is confusing", confusing
is analyzed as an adjective, and therefore it is lemmatized to confusing
. In the sentence "I was confusing you with someone else", by contrast, confusing
is analyzed as a verb, and is lemmatized to confuse
.
If you want tokens with different parts of speech to be mapped to the same lemma, you can use a stemming algorithm such as Porter Stemming, which you can simply call on each token.
how to use lucene for lemmatization and elimination of empty French words
It's easy, all what you need is a FrenchAnalyzer like this:
IndexWriterConfig conf= new IndexWriterConfig (Version.LUCENE_45,new FrenchAnalyzer(Version.LUCENE_45,FrenchAnalyzer.getDefaultStopSet()));
and for empty words we use : FrenchAnalyzer.getDefaultStopSet() like i did in the previous code , and for the lemmatization it's already integrated in this analyzer and you can notice that when you look for the important words (by tf idf) .
Stanford CoreNLP lemmatization
CoreLabel
class has a lemma()
method that returns the lemma. e.g.
// token is a CoreLable instance
String lemma = token.lemma();
How to handle LemmatizerTrainer 'UTFDataFormatException: encoded string too long'?
Recently, I've written a patch to cure OpenNLP-1366. The related PR https://github.com/apache/opennlp/pull/427 documents the problem and solution in detail.
In this context, the upcoming OpenNLP version 2.0.1 will bring the cure for the problem reported in the OP. Updating to the aforementioned version will resolve the crashing during writing trained model files.
Note:
I verified that the patch works with UD_German-HDT
, UD_German-GSD
, and other treebanks for the German language.
Related Topics
Jaxb :Need Namespace Prefix to All the Elements
Is There Something Like Instanceof(Class<> C) in Java
How to Set Color to a Certain Row If Certain Conditions Are Met Using Java
Making Spring-Data-Mongodb Multi-Tenant
What Does "An Arbitrary Object of a Particular Type" Mean in Java 8
How to Add New Methods to the String Class in Java
Java, Pass-By-Value, Reference Variables
Spring Scheduling Task - Run Only Once
Inetaddress.Getlocalhost() Slow to Run (30+ Seconds)
How to Compress a String in Java
Eventlisteners Using Hibernate 4.0 with Spring 3.1.0.Release
Understanding Scanner's Nextline(), Next(), and Nextint() Methods
How to Convert Words to a Number
Differencebetween a Pointer and a Reference Variable in Java
Rethrowing an Exception: Why Does the Method Compile Without a Throws Clause