NLTK Named Entity recognition to a Python list
nltk.ne_chunk
returns a nested nltk.tree.Tree
object so you would have to traverse the Tree
object to get to the NEs.
Take a look at Named Entity Recognition with Regular Expression: NLTK
>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> from nltk.tree import Tree
>>>
>>> def get_continuous_chunks(text):
... chunked = ne_chunk(pos_tag(word_tokenize(text)))
... continuous_chunk = []
... current_chunk = []
... for i in chunked:
... if type(i) == Tree:
... current_chunk.append(" ".join([token for token, pos in i.leaves()]))
... if current_chunk:
... named_entity = " ".join(current_chunk)
... if named_entity not in continuous_chunk:
... continuous_chunk.append(named_entity)
... current_chunk = []
... else:
... continue
... return continuous_chunk
...
>>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
>>> get_continuous_chunks(my_sent)
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']
>>> my_sent = "How's the weather in New York and Brooklyn"
>>> get_continuous_chunks(my_sent)
['New York', 'Brooklyn']
NLP Named Entity Recognition using NLTK and Spacy
Spacy models are statistical. So the named entities that these models recognize are dependent on the data sets that these models were trained on.
According to spacy documentation a named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title.
For example, the name Zoni is not common, so the model doesn't recognize the name as being a named entity (person). If I change the name Zoni to William in your sentence spacy recognize William as a person.
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp('William I want to find a pencil, a eraser and a sharpener')
for entity in doc.ents:
print(entity.label_, ' | ', entity.text)
#output
PERSON | William
One would assume that pencil, eraser and sharpener are objects, so they would potentially be classified as products, because spacy documentation states 'objects' are products. But that does not seem to be the case with the 3 objects in your sentence.
I also noted that if no named entities are found in the input text then the output will be empty.
import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp('Zoni I want to find a pencil, a eraser and a sharpener')
if not doc.ents:
print ('No named entities were recognized in the input text.')
else:
for entity in doc.ents:
print(entity.label_, ' | ', entity.text)
Named Entity Recognition using NLTK: Extract Auditor name, address and organisation
Try spacy instead of NLTK:
https://spacy.io/usage/linguistic-features#named-entities
I think spacy's pretrained models are likely to perform better. The results (with spacy 2.1, en_core_web_lg) for your sentence are:
Alastair John Richard Nuttall PERSON
Ernst & Young LLP ORG
Leeds GPE
Named Entity Recognition for NLTK in Python. Identifying the NE
This answer may be off base, and in which case I'll delete it, as I don't have NLTK installed here to try it, but I think you can just do:
>>> sent3[2].node
'NE'
sent3[2][0]
returns the first child of the tree, not the node itself
Edit: I tried this when I got home, and it does indeed work.
Extracting multi-word named entities using NLTK Stanford NER in Python
The StanfordNERTagger in nltk doesn't retain information on the boundaries of named entities. If you try to parse the output of the tagger, there is no way to tell whether two consecutive nouns with the same tag are part of the same entity or whether they are distinct.
Alternatively, https://stanfordnlp.github.io/CoreNLP/other-languages.html#python indicates that the Stanford team is actively developing a python package called Stanza which uses the Stanford CoreNLP. It is slow, but really easy to use.
$ pip3 install stanza
>>> import stanza
>>> stanza.download ('en')
>>> nlp = stanza.Pipeline ('en')
>>> results = nlp (<insert your text string here>)
The chunked entities are in results.ents
.
NLTK Named Entity recognition for a column in a dataset
Try apply
:
df['ne'] = df['content'].apply(get_continuous_chunks)
For the code in your second example, create a function and apply it the same way:
def my_st(text):
tokenized_text = word_tokenize(text)
return st.tag(tokenized_text)
df['st'] = df['content'].apply(my_st)
Related Topics
Pythonic Way to Combine For-Loop and If-Statement
Serving Dynamically Generated Zip Archives in Django
How to Interact with the Recaptcha Audio Element Using Selenium and Python
Login to Website Using Urllib2 - Python 2.7
Python Library 'Unittest': Generate Multiple Tests Programmatically
Pygame: Problems with Shooting in Space Invaders
How to Write Code to Autocomplete Words and Sentences
Pandas Groupby.Size VS Series.Value_Counts VS Collections.Counter with Multiple Series
Value Error Trying to Install Python for Windows Extensions
When to Use Sys.Path.Append and When Modifying %Pythonpath% Is Enough
Python Create Unix Timestamp Five Minutes in the Future
Check If a Given Key Already Exists in a Dictionary and Increment It
When Is Not a Good Time to Use Python Generators
Why Is the Exit Window Button Work But the Exit Button in the Game Does Not Work
Convert a List of Tuples to a List of Lists
Site Matching Query Does Not Exist
How to Print a List with Integers Without the Brackets, Commas and No Quotes