Nltk Named Entity Recognition to a Python List

NLTK Named Entity recognition to a Python list

nltk.ne_chunk returns a nested nltk.tree.Tree object so you would have to traverse the Tree object to get to the NEs.

Take a look at Named Entity Recognition with Regular Expression: NLTK

>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> from nltk.tree import Tree
>>>
>>> def get_continuous_chunks(text):
... chunked = ne_chunk(pos_tag(word_tokenize(text)))
... continuous_chunk = []
... current_chunk = []
... for i in chunked:
... if type(i) == Tree:
... current_chunk.append(" ".join([token for token, pos in i.leaves()]))
... if current_chunk:
... named_entity = " ".join(current_chunk)
... if named_entity not in continuous_chunk:
... continuous_chunk.append(named_entity)
... current_chunk = []
... else:
... continue
... return continuous_chunk
...
>>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
>>> get_continuous_chunks(my_sent)
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']

>>> my_sent = "How's the weather in New York and Brooklyn"
>>> get_continuous_chunks(my_sent)
['New York', 'Brooklyn']

NLP Named Entity Recognition using NLTK and Spacy

Spacy models are statistical. So the named entities that these models recognize are dependent on the data sets that these models were trained on.

According to spacy documentation a named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title.

For example, the name Zoni is not common, so the model doesn't recognize the name as being a named entity (person). If I change the name Zoni to William in your sentence spacy recognize William as a person.

import spacy
nlp = spacy.load('en_core_web_lg')

doc = nlp('William I want to find a pencil, a eraser and a sharpener')

for entity in doc.ents:
print(entity.label_, ' | ', entity.text)
#output
PERSON | William

One would assume that pencil, eraser and sharpener are objects, so they would potentially be classified as products, because spacy documentation states 'objects' are products. But that does not seem to be the case with the 3 objects in your sentence.

I also noted that if no named entities are found in the input text then the output will be empty.

import spacy
nlp = spacy.load("en_core_web_lg")

doc = nlp('Zoni I want to find a pencil, a eraser and a sharpener')
if not doc.ents:
print ('No named entities were recognized in the input text.')
else:
for entity in doc.ents:
print(entity.label_, ' | ', entity.text)

Named Entity Recognition using NLTK: Extract Auditor name, address and organisation

Try spacy instead of NLTK:

https://spacy.io/usage/linguistic-features#named-entities

I think spacy's pretrained models are likely to perform better. The results (with spacy 2.1, en_core_web_lg) for your sentence are:

Alastair John Richard Nuttall PERSON

Ernst & Young LLP ORG

Leeds GPE

Named Entity Recognition for NLTK in Python. Identifying the NE

This answer may be off base, and in which case I'll delete it, as I don't have NLTK installed here to try it, but I think you can just do:

   >>> sent3[2].node
'NE'

sent3[2][0] returns the first child of the tree, not the node itself

Edit: I tried this when I got home, and it does indeed work.

Extracting multi-word named entities using NLTK Stanford NER in Python

The StanfordNERTagger in nltk doesn't retain information on the boundaries of named entities. If you try to parse the output of the tagger, there is no way to tell whether two consecutive nouns with the same tag are part of the same entity or whether they are distinct.

Alternatively, https://stanfordnlp.github.io/CoreNLP/other-languages.html#python indicates that the Stanford team is actively developing a python package called Stanza which uses the Stanford CoreNLP. It is slow, but really easy to use.

$ pip3 install stanza

>>> import stanza
>>> stanza.download ('en')
>>> nlp = stanza.Pipeline ('en')
>>> results = nlp (<insert your text string here>)

The chunked entities are in results.ents.

NLTK Named Entity recognition for a column in a dataset

Try apply:

df['ne'] = df['content'].apply(get_continuous_chunks)

For the code in your second example, create a function and apply it the same way:

def my_st(text):
tokenized_text = word_tokenize(text)
return st.tag(tokenized_text)

df['st'] = df['content'].apply(my_st)


Related Topics



Leave a reply



Submit