How can I extract address from raw text using NLTK in python?
Definitely regular expressions :)
Something like
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}
: 1 to 3 digits, the address number
(space)
: a space between the number and the street name
.+
: street name, any character for any number of occurrences
,
: a comma and a space before the city
.+
: city, any character for any number of occurrences
,
: a comma and a space before the state
[A-Z]{2}
: exactly 2 uppercase chars from A to Z
[0-9]{5}
: 5 digits
re.findall(expr, string)
will return an array with all the occurrences found.
Improving the extraction of human names with nltk
Must agree with the suggestion that "make my code better" isn't well suited for this site, but I can give you some way where you can try to dig in.
Disclaimer: This answer is ~7 years old. Definitely, it needs to be updated to newer Python and NLTK versions. Please, try to do it yourself, and if it works, share your know-how with us.
Take a look at Stanford Named Entity Recognizer (NER). Its binding has been included in NLTK v 2.0, but you must download some core files. Here is script which can do all of that for you.
I wrote this script:
import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""
for sent in nltk.sent_tokenize(text):
tokens = nltk.tokenize.word_tokenize(sent)
tags = st.tag(tokens)
for tag in tags:
if tag[1]=='PERSON': print tag
and got not so bad output:
('Francois', 'PERSON')
('R.', 'PERSON')
('Velde', 'PERSON')
('Richard', 'PERSON')
('Branson', 'PERSON')
('Virgin', 'PERSON')
('Galactic', 'PERSON')
('Bitcoin', 'PERSON')
('Bitcoin', 'PERSON')
('Paul', 'PERSON')
('Krugman', 'PERSON')
('Larry', 'PERSON')
('Summers', 'PERSON')
('Bitcoin', 'PERSON')
('Nick', 'PERSON')
('Colas', 'PERSON')
Hope this is helpful.
How to extract countries from a text?
you could use pycountry for your task (it also works with python 3):
pip install pycountry
import pycountry
text = "United States (New York), United Kingdom (London)"
for country in pycountry.countries:
if country.name in text:
print(country.name)
corpus extraction of noun using nltk
For each sentence you get a list of word and its tag (let's call it "pos") with tagged = nltk.pos_tag(words)
. E.g., for the first sentence
u"PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all."
you would get:
[(u'PRESIDENT', 'NNP'), (u'GEORGE', 'NNP'), (u'W.', 'NNP'), (u'BUSH','NNP'),
(u"'S", 'POS'), (u'ADDRESS', 'NNP'), (u'BEFORE', 'IN'), (u'A', 'NNP'), (u'JOINT', 'NNP'),
(u'SESSION', 'NNP'), (u'OF', 'IN'), (u'THE', 'NNP'), (u'CONGRESS', 'NNP'), (u'ON', 'NNP'),
(u'THE', 'NNP'), (u'STATE', 'NNP'), (u'OF', 'IN'), (u'THE', 'NNP'), (u'UNION', 'NNP'),
(u'January', 'NNP'), (u'31', 'CD'), (u',', ','), (u'2006', 'CD'), (u'THE', 'NNP'),
(u'PRESIDENT', 'NNP'), (u':', ':'), (u'Thank', 'NNP'), (u'you', 'PRP'), (u'all', 'DT'),
(u'.', '.')]
If you want to retrieve all the words with pos =='NN' or pos == 'NNP' or pos =='NNS' or pos=='NNPS'
, you can do
nouns = [word for (word, pos) in tagged if pos in ['NN','NNP','NNS','NNPS']]
Then you would get a list of nouns for each sentence:
[u'PRESIDENT', u'GEORGE', u'W.', u'BUSH', u'ADDRESS', u'A', u'JOINT', u'SESSION', u'THE', u'CONGRESS', u'ON', u'THE', u'STATE', u'THE', u'UNION', u'January', u'THE', u'PRESIDENT', u'Thank']
Extracting a URL in Python
In response to the OP's edit I hijacked Find Hyperlinks in Text using Python (twitter related) and came up with this:
import re
myString = "This is my tweet check it out http://example.com/blah"
print(re.search("(?P<url>https?://[^\s]+)", myString).group("url"))
Related Topics
How to Copy/Repeat an Array N Times into a New Array
Pythonically Add Header to a CSV File
Getting S3 Objects' Last Modified Datetimes With Boto
Python File Opens and Immediately Closes
Install Utils Package in Python Facing With Error Package Not Found
String Concatenate Typeerror: Can Only Concatenate Str (Not "Int") to Str"
Stripping Whitespaces from a List Inside the List of Tuples
Add Excel File Attachment When Sending Python Email
Efficiently Find Repeated Characters in a String
Opencv - Saving Images to a Particular Folder of Choice
Unable to Install Psycopg2 (Pip Install Psycopg2)
Convert Tensorflow String to Python String
Selecting Specific Rows of CSV Based on a Column'S Value in Python
How to Increment a Variable on a for Loop in Jinja Template
Filtering Dataframe Using the Length of a Column
Test If Dictionary Key Exists, Is Not None and Isn't Blank
Key Error: None of [Int64Index...] Dtype='Int64] Are in the Columns
Pandas: Merging Two Columns into One With Corresponding Values