Wordnet Lemmatization and Pos Tagging in Python

wordnet lemmatization and pos tagging in python

First of all, you can use nltk.pos_tag() directly without training it.
The function will load a pretrained tagger from a file. You can see the file name
with nltk.tag._POS_TAGGER:

nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle'

As it was trained with the Treebank corpus, it also uses the Treebank tag set.

The following function would map the treebank tags to WordNet part of speech names:

from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

You can then use the return value with the lemmatizer:

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'

Check the return value before passing it to the Lemmatizer because an empty string would give a KeyError.

Using WordNetLemmatizer.lemmatize() with pos_tags throws KeyError

You get a KeyError because wordnet is not using the same pos labels. The accepted pos labels for wordnet based on source code are these: adj, adv, adv and verb.

EDIT based on @bivouac0 's comment:

So to bypass this issue you have to make a mapper. Mapping function is heavily based on this answer. Non-supported POS will not be lemmatized.

import nltk
import pandas as pd
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

x = pd.DataFrame(data=[['this is a sample of text.'], ['one more text.']], 
                 columns=['Phrase'])

x['Phrase'] = x['Phrase'].apply(lambda v: nltk.pos_tag(nltk.word_tokenize(v)))

x['Phrase_lemma'] = x['Phrase'].transform(lambda value: ' '.join([lemmatizer.lemmatize(a[0],pos=get_wordnet_pos(a[1])) if get_wordnet_pos(a[1]) else a[0] for a in  value]))

NLTK: lemmatizer and pos_tag

You need to convert the tag from the pos_tagger to one of the four "syntactic categories" that wordnet recognizes, then pass that to the lemmatizer as the word_pos.

From the docs:

Syntactic category: n for noun files, v for verb files, a for adjective files, r for adverb files.

Why NLTK's Wordnet Lemmatizer Does Not Lemmatize Adverbs and Adjectives?

For the words lovely and absolutely, the lemmas are the same. Here's a few close words you can try in NLTK.

word:pos       -> lemma
-------------------------
absolute:adj   -> absolute
absolutely:adv -> absolutely
lovely:adj     -> lovely
lovelier:adj   -> lovely
loveliest:adj  -> lovely

Be aware that to get the correct lemma you need the correct part-of-speech (pos) tag, and to get the correct pos tag you need to parse a well formed sentence with the word in it, so the tagger has the context. Without this, you will often get the wrong pos tag for the word.

In general NLTK is a fairly poor at pos tagging and at lemmatization. It's an old library that is rule based and it doesn't use more modern techniques. I would generally not recommend using NLTK.

Spacy is probably the most popular NLP system and it will do pos tagging and lemmatization (among other things) all in the same step. Unfortunately Spacy's lemmatizer uses the same basic design as NLTK and while its performance is better, it's still not the best.

Lemminflect gives the best overall performance but it's only a lemma/inflection lookup. It doesn't include a pos tagger so you still need to get the tag from somewhere. Lemminflect also acts as a plug-in for spacy and using the two together will give you the best performance. Lemminflect's homepage shows how to do this along with some stats on performance compared to NLTK and Spacy.

However, remember that you won't get the right lemmas without the right pos tag and for Spacy, or any tagger, to get that right, the word needs to be in a full sentence.

Wordnet Lemmatization and Pos Tagging in Python