wordnet lemmatization and pos tagging in python
First of all, you can use nltk.pos_tag()
directly without training it.
The function will load a pretrained tagger from a file. You can see the file name
with nltk.tag._POS_TAGGER
:
nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle'
As it was trained with the Treebank corpus, it also uses the Treebank tag set.
The following function would map the treebank tags to WordNet part of speech names:
from nltk.corpus import wordnet
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
You can then use the return value with the lemmatizer:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'
Check the return value before passing it to the Lemmatizer because an empty string would give a KeyError
.
Using WordNetLemmatizer.lemmatize() with pos_tags throws KeyError
You get a KeyError
because wordnet
is not using the same pos
labels. The accepted pos
labels for wordnet
based on source code are these: adj
, adv
, adv
and verb
.
EDIT based on @bivouac0 's comment:
So to bypass this issue you have to make a mapper. Mapping function is heavily based on this answer. Non-supported POS will not be lemmatized.
import nltk
import pandas as pd
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return None
x = pd.DataFrame(data=[['this is a sample of text.'], ['one more text.']],
columns=['Phrase'])
x['Phrase'] = x['Phrase'].apply(lambda v: nltk.pos_tag(nltk.word_tokenize(v)))
x['Phrase_lemma'] = x['Phrase'].transform(lambda value: ' '.join([lemmatizer.lemmatize(a[0],pos=get_wordnet_pos(a[1])) if get_wordnet_pos(a[1]) else a[0] for a in value]))
NLTK: lemmatizer and pos_tag
You need to convert the tag from the pos_tagger to one of the four "syntactic categories" that wordnet recognizes, then pass that to the lemmatizer as the word_pos.
From the docs:
Syntactic category: n for noun files, v for verb files, a for adjective files, r for adverb files.
Why NLTK's Wordnet Lemmatizer Does Not Lemmatize Adverbs and Adjectives?
For the words lovely and absolutely, the lemmas are the same. Here's a few close words you can try in NLTK.
word:pos -> lemma
-------------------------
absolute:adj -> absolute
absolutely:adv -> absolutely
lovely:adj -> lovely
lovelier:adj -> lovely
loveliest:adj -> lovely
Be aware that to get the correct lemma you need the correct part-of-speech (pos) tag, and to get the correct pos tag you need to parse a well formed sentence with the word in it, so the tagger has the context. Without this, you will often get the wrong pos tag for the word.
In general NLTK is a fairly poor at pos tagging and at lemmatization. It's an old library that is rule based and it doesn't use more modern techniques. I would generally not recommend using NLTK.
Spacy is probably the most popular NLP system and it will do pos tagging and lemmatization (among other things) all in the same step. Unfortunately Spacy's lemmatizer uses the same basic design as NLTK and while its performance is better, it's still not the best.
Lemminflect gives the best overall performance but it's only a lemma/inflection lookup. It doesn't include a pos tagger so you still need to get the tag from somewhere. Lemminflect also acts as a plug-in for spacy and using the two together will give you the best performance. Lemminflect's homepage shows how to do this along with some stats on performance compared to NLTK and Spacy.
However, remember that you won't get the right lemmas without the right pos tag and for Spacy, or any tagger, to get that right, the word needs to be in a full sentence.
Related Topics
Numpy: Get Random Set of Rows from 2D Array
How to Do Virtual File Processing
Python: One Try Multiple Except
Code for Greatest Common Divisor in Python
Can Anyone Explain Python's Relative Imports
Using Subprocess to Run Python Script on Windows
How to Break Up This Long Line in Python
Python: How to Send Mail with To, Cc and Bcc
Numpy 'Smart' Symmetric Matrix
Python: Changing Methods and Attributes at Runtime
Skip Multiple Iterations in Loop
Scikit-Learn Dbscan Memory Usage
Python Function as a Function Argument
How to Remove the Top and Right Axis in Matplotlib
Error "Importerror: Dll Load Failed: %1 Is Not a Valid Win32 Application"
Using Cprofile Results with Kcachegrind
How to Give Column Name Dynamically from String Variable in SQL Alchemy Filter