How to Extract Address from Raw Text Using Nltk in Python

How can I extract address from raw text using NLTK in python?

Definitely regular expressions :)

Something like

import re

txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)

# address = ['44 West 22nd Street, New York, NY 12345']

Explanation:

[0-9]{1,3}: 1 to 3 digits, the address number

(space): a space between the number and the street name

.+: street name, any character for any number of occurrences

,: a comma and a space before the city

.+: city, any character for any number of occurrences

,: a comma and a space before the state

[A-Z]{2}: exactly 2 uppercase chars from A to Z

[0-9]{5}: 5 digits

re.findall(expr, string) will return an array with all the occurrences found.

Improving the extraction of human names with nltk

Must agree with the suggestion that "make my code better" isn't well suited for this site, but I can give you some way where you can try to dig in.

Disclaimer: This answer is ~7 years old. Definitely, it needs to be updated to newer Python and NLTK versions. Please, try to do it yourself, and if it works, share your know-how with us.

Take a look at Stanford Named Entity Recognizer (NER). Its binding has been included in NLTK v 2.0, but you must download some core files. Here is script which can do all of that for you.

I wrote this script:

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
tokens = nltk.tokenize.word_tokenize(sent)
tags = st.tag(tokens)
for tag in tags:
if tag[1]=='PERSON': print tag

and got not so bad output:

('Francois', 'PERSON')
('R.', 'PERSON')
('Velde', 'PERSON')
('Richard', 'PERSON')
('Branson', 'PERSON')
('Virgin', 'PERSON')
('Galactic', 'PERSON')
('Bitcoin', 'PERSON')
('Bitcoin', 'PERSON')
('Paul', 'PERSON')
('Krugman', 'PERSON')
('Larry', 'PERSON')
('Summers', 'PERSON')
('Bitcoin', 'PERSON')
('Nick', 'PERSON')
('Colas', 'PERSON')

Hope this is helpful.

How to extract countries from a text?

you could use pycountry for your task (it also works with python 3):

pip install pycountry

import pycountry
text = "United States (New York), United Kingdom (London)"
for country in pycountry.countries:
if country.name in text:
print(country.name)

corpus extraction of noun using nltk

For each sentence you get a list of word and its tag (let's call it "pos") with tagged = nltk.pos_tag(words). E.g., for the first sentence

u"PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all."

you would get:

[(u'PRESIDENT', 'NNP'), (u'GEORGE', 'NNP'), (u'W.', 'NNP'), (u'BUSH','NNP'), 
(u"'S", 'POS'), (u'ADDRESS', 'NNP'), (u'BEFORE', 'IN'), (u'A', 'NNP'), (u'JOINT', 'NNP'),
(u'SESSION', 'NNP'), (u'OF', 'IN'), (u'THE', 'NNP'), (u'CONGRESS', 'NNP'), (u'ON', 'NNP'),
(u'THE', 'NNP'), (u'STATE', 'NNP'), (u'OF', 'IN'), (u'THE', 'NNP'), (u'UNION', 'NNP'),
(u'January', 'NNP'), (u'31', 'CD'), (u',', ','), (u'2006', 'CD'), (u'THE', 'NNP'),
(u'PRESIDENT', 'NNP'), (u':', ':'), (u'Thank', 'NNP'), (u'you', 'PRP'), (u'all', 'DT'),
(u'.', '.')]

If you want to retrieve all the words with pos =='NN' or pos == 'NNP' or pos =='NNS' or pos=='NNPS', you can do

nouns = [word for (word, pos) in tagged if pos in ['NN','NNP','NNS','NNPS']]

Then you would get a list of nouns for each sentence:

[u'PRESIDENT', u'GEORGE', u'W.', u'BUSH', u'ADDRESS', u'A', u'JOINT', u'SESSION', u'THE', u'CONGRESS', u'ON', u'THE', u'STATE', u'THE', u'UNION', u'January', u'THE', u'PRESIDENT', u'Thank']

Extracting a URL in Python

In response to the OP's edit I hijacked Find Hyperlinks in Text using Python (twitter related) and came up with this:

import re

myString = "This is my tweet check it out http://example.com/blah"

print(re.search("(?P<url>https?://[^\s]+)", myString).group("url"))


Related Topics



Leave a reply



Submit