Python NLTK pos_tag not returning the correct part-of-speech tag
In short:
NLTK is not perfect. In fact, no model is perfect.
Note:
As of NLTK version 3.1, default pos_tag
function is no longer the old MaxEnt English pickle.
It is now the perceptron tagger from @Honnibal's implementation, see nltk.tag.pos_tag
>>> import inspect
>>> print inspect.getsource(pos_tag)
def pos_tag(tokens, tagset=None):
tagger = PerceptronTagger()
return _pos_tag(tokens, tagset, tagger)
Still it's better but not perfect:
>>> from nltk import pos_tag
>>> pos_tag("The quick brown fox jumps over the lazy dog".split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
At some point, if someone wants TL;DR
solutions, see https://github.com/alvations/nltk_cli
In long:
Try using other tagger (see https://github.com/nltk/nltk/tree/develop/nltk/tag) , e.g.:
- HunPos
- Stanford POS
- Senna
Using default MaxEnt POS tagger from NLTK, i.e. nltk.pos_tag
:
>>> from nltk import word_tokenize, pos_tag
>>> text = "The quick brown fox jumps over the lazy dog"
>>> pos_tag(word_tokenize(text))
[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]
Using Stanford POS tagger:
$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip
$ unzip stanford-postagger-2015-04-20.zip
$ mv stanford-postagger-2015-04-20 stanford-postagger
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.stanford import POSTagger
>>> _path_to_model = home + '/stanford-postagger/models/english-bidirectional-distsim.tagger'
>>> _path_to_jar = home + '/stanford-postagger/stanford-postagger.jar'
>>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'jumps', u'VBZ'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]
Using HunPOS (NOTE: the default encoding is ISO-8859-1 not UTF8):
$ cd ~
$ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
$ tar zxvf hunpos-1.0-linux.tgz
$ wget https://hunpos.googlecode.com/files/en_wsj.model.gz
$ gzip -d en_wsj.model.gz
$ mv en_wsj.model hunpos-1.0-linux/
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.hunpos import HunposTagger
>>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag'
>>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model'
>>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> ht.tag(text.split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
Using Senna (Make sure you've the latest version of NLTK, there were some changes made to the API):
$ cd ~
$ wget http://ronan.collobert.com/senna/senna-v3.0.tgz
$ tar zxvf senna-v3.0.tgz
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.senna import SennaTagger
>>> st = SennaTagger(home+'/senna')
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[('The', u'DT'), ('quick', u'JJ'), ('brown', u'JJ'), ('fox', u'NN'), ('jumps', u'VBZ'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'NN')]
Or try building a better POS tagger:
- Ngram Tagger: http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/
- Affix/Regex Tagger: http://streamhacker.com/2008/11/10/part-of-speech-tagging-with-nltk-part-2/
- Build Your Own Brill (Read the code it's a pretty fun tagger, http://www.nltk.org/_modules/nltk/tag/brill.html), see http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/
- Perceptron Tagger: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
- LDA Tagger: http://scm.io/blog/hack/2015/02/lda-intentions/
Complains about pos_tag
accuracy on stackoverflow include:
- POS tagging - NLTK thinks noun is adjective
- python NLTK POS tagger not behaving as expected
- How to obtain better results using NLTK pos tag
- pos_tag in NLTK does not tag sentences correctly
Issues about NLTK HunPos include:
- How do I tag textfiles with hunpos in nltk?
- Does anyone know how to configure the hunpos wrapper class on nltk?
Issues with NLTK and Stanford POS tagger include:
- trouble importing stanford pos tagger into nltk
- Java Command Fails in NLTK Stanford POS Tagger
- Error using Stanford POS Tagger in NLTK Python
- How to improve speed with Stanford NLP Tagger and NLTK
- Nltk stanford pos tagger error : Java command failed
- Instantiating and using StanfordTagger within NLTK
- Running Stanford POS tagger in NLTK leads to "not a valid Win32 application" on Windows
pos_tag in NLTK does not tag sentences correctly
Short answer: you can't. Slightly longer answer: you can override specific words using a manually created UnigramTagger. See my answer for custom tagging with nltk for details on this method.
Python NLTK pos_tag not returning the correct part-of-speech tag
In short:
NLTK is not perfect. In fact, no model is perfect.
Note:
As of NLTK version 3.1, default pos_tag
function is no longer the old MaxEnt English pickle.
It is now the perceptron tagger from @Honnibal's implementation, see nltk.tag.pos_tag
>>> import inspect
>>> print inspect.getsource(pos_tag)
def pos_tag(tokens, tagset=None):
tagger = PerceptronTagger()
return _pos_tag(tokens, tagset, tagger)
Still it's better but not perfect:
>>> from nltk import pos_tag
>>> pos_tag("The quick brown fox jumps over the lazy dog".split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
At some point, if someone wants TL;DR
solutions, see https://github.com/alvations/nltk_cli
In long:
Try using other tagger (see https://github.com/nltk/nltk/tree/develop/nltk/tag) , e.g.:
- HunPos
- Stanford POS
- Senna
Using default MaxEnt POS tagger from NLTK, i.e. nltk.pos_tag
:
>>> from nltk import word_tokenize, pos_tag
>>> text = "The quick brown fox jumps over the lazy dog"
>>> pos_tag(word_tokenize(text))
[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]
Using Stanford POS tagger:
$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip
$ unzip stanford-postagger-2015-04-20.zip
$ mv stanford-postagger-2015-04-20 stanford-postagger
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.stanford import POSTagger
>>> _path_to_model = home + '/stanford-postagger/models/english-bidirectional-distsim.tagger'
>>> _path_to_jar = home + '/stanford-postagger/stanford-postagger.jar'
>>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'jumps', u'VBZ'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]
Using HunPOS (NOTE: the default encoding is ISO-8859-1 not UTF8):
$ cd ~
$ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
$ tar zxvf hunpos-1.0-linux.tgz
$ wget https://hunpos.googlecode.com/files/en_wsj.model.gz
$ gzip -d en_wsj.model.gz
$ mv en_wsj.model hunpos-1.0-linux/
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.hunpos import HunposTagger
>>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag'
>>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model'
>>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> ht.tag(text.split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
Using Senna (Make sure you've the latest version of NLTK, there were some changes made to the API):
$ cd ~
$ wget http://ronan.collobert.com/senna/senna-v3.0.tgz
$ tar zxvf senna-v3.0.tgz
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.senna import SennaTagger
>>> st = SennaTagger(home+'/senna')
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[('The', u'DT'), ('quick', u'JJ'), ('brown', u'JJ'), ('fox', u'NN'), ('jumps', u'VBZ'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'NN')]
Or try building a better POS tagger:
- Ngram Tagger: http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/
- Affix/Regex Tagger: http://streamhacker.com/2008/11/10/part-of-speech-tagging-with-nltk-part-2/
- Build Your Own Brill (Read the code it's a pretty fun tagger, http://www.nltk.org/_modules/nltk/tag/brill.html), see http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/
- Perceptron Tagger: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
- LDA Tagger: http://scm.io/blog/hack/2015/02/lda-intentions/
Complains about pos_tag
accuracy on stackoverflow include:
- POS tagging - NLTK thinks noun is adjective
- python NLTK POS tagger not behaving as expected
- How to obtain better results using NLTK pos tag
- pos_tag in NLTK does not tag sentences correctly
Issues about NLTK HunPos include:
- How do I tag textfiles with hunpos in nltk?
- Does anyone know how to configure the hunpos wrapper class on nltk?
Issues with NLTK and Stanford POS tagger include:
- trouble importing stanford pos tagger into nltk
- Java Command Fails in NLTK Stanford POS Tagger
- Error using Stanford POS Tagger in NLTK Python
- How to improve speed with Stanford NLP Tagger and NLTK
- Nltk stanford pos tagger error : Java command failed
- Instantiating and using StanfordTagger within NLTK
- Running Stanford POS tagger in NLTK leads to "not a valid Win32 application" on Windows
Python NLTK: How to tag sentences with the simplified set of part-of-speech tags?
To simplify tags from the default tagger, you can use nltk.tag.simplify.simplify_wsj_tag
, like so:
>>> import nltk
>>> from nltk.tag.simplify import simplify_wsj_tag
>>> tagged_sent = nltk.pos_tag(tokens)
>>> simplified = [(word, simplify_wsj_tag(tag)) for word, tag in tagged_sent]
Related Topics
Difference Between Subprocess.Popen and Os.System
Django Filter Queryset _In for *Every* Item in List
How to Split Elements of a List
What's the Function Like Sum() But for Multiplication? Product()
Python: Get a Frequency Count Based on Two Columns (Variables) in Pandas Dataframe Some Row Appers
Find the Max of Two or More Columns with Pandas
How to Include Third Party Python Libraries in Google App Engine
Python Nltk Pos_Tag Not Returning the Correct Part-Of-Speech Tag
Grouping Python Dictionary Keys as a List and Create a New Dictionary with This List as a Value
Editing the Date Formatting of X-Axis Tick Labels in Matplotlib
Typeerror: '<=' Not Supported Between Instances of 'Str' and 'Int'
Horizontal Stacked Bar Plot and Add Labels to Each Section
Importerror: No Module Named Pil
Should I Use 'Has_Key()' or 'In' on Python Dicts