Why is my NLTK function slow when processing the DataFrame?
Your original nlkt()
loops through each row 3 times.
def nlkt(val):
val=repr(val)
clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]
nopunc = [char for char in str(clean_txt) if char not in string.punctuation]
nonum = [char for char in nopunc if not char.isdigit()]
words_string = ''.join(nonum)
return words_string
Also, each time you're calling nlkt()
, you're re-initializing these again and again.
stopwords.words('english')
string.punctuation
These should be global.
stoplist = stopwords.words('english') + list(string.punctuation)
Going through things line by line:
val=repr(val)
I'm not sure why you need to do this. But you could easy cast a column to a str
type. This should be done outside of your preprocessing function.
Hopefully this is self-explanatory:
>>> import pandas as pd
>>> df = pd.DataFrame([[0, 1, 2], [2, 'xyz', 4], [5, 'abc', 'def']])
>>> df
0 1 2
0 0 1 2
1 2 xyz 4
2 5 abc def
>>> df[1]
0 1
1 xyz
2 abc
Name: 1, dtype: object
>>> df[1].astype(str)
0 1
1 xyz
2 abc
Name: 1, dtype: object
>>> list(df[1])
[1, 'xyz', 'abc']
>>> list(df[1].astype(str))
['1', 'xyz', 'abc']
Now going to the next line:
clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]
Using str.split()
is awkward, you should use a proper tokenizer. Otherwise, your punctuations might be stuck with the preceding word, e.g.
>>> from nltk.corpus import stopwords
>>> from nltk import word_tokenize
>>> import string
>>> stoplist = stopwords.words('english') + list(string.punctuation)
>>> stoplist = set(stoplist)
>>> text = 'This is foo, bar and doh.'
>>> [word for word in text.split() if word.lower() not in stoplist]
['foo,', 'bar', 'doh.']
>>> [word for word in word_tokenize(text) if word.lower() not in stoplist]
['foo', 'bar', 'doh']
Also checking for .isdigit()
should be checked together:
>>> text = 'This is foo, bar, 234, 567 and doh.'
>>> [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit()]
['foo', 'bar', 'doh']
Putting it all together your nlkt()
should look like this:
def preprocess(text):
return [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit()]
And you can use the DataFrame.apply
:
data['Anylize_Text'].apply(preprocess)
NLTK-based text processing with pandas
Your function is slow and is incomplete. First, with the issues -
- You're not lowercasing your data.
- You're not getting rid of digits and punctuation properly.
- You're not returning a string (you should join the list using
str.join
and return it) - Furthermore, a list comprehension with text processing is a prime way to introduce readability issues, not to mention possible redundancies (you may call a function multiple times, for each
if
condition it appears in.
Next, there are a couple of glaring inefficiencies with your function, especially with the stopword removal code.
Your
stopwords
structure is a list, andin
checks on lists are slow. The first thing to do would be to convert that to aset
, making thenot in
check constant time.You're using
nltk.word_tokenize
which is unnecessarily slow.Lastly, you shouldn't always rely on
apply
, even if you are working with NLTK where there's rarely any vectorised solution available. There are almost always other ways to do the exact same thing. Oftentimes, even a python loop is faster. But this isn't set in stone.
First, create your enhanced stopwords
as a set -
user_defined_stop_words = ['st','rd','hong','kong']
i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words
stopwords = set(i).union(j)
The next fix is to get rid of the list comprehension and convert this into a multi-line function. This makes things so much easier to work with. Each line of your function should be dedicated to solving a particular task (example, getting rid of digits/punctuation, or getting rid of stopwords, or lowercasing) -
def preprocess(x):
x = re.sub('[^a-z\s]', '', x.lower()) # get rid of noise
x = [w for w in x.split() if w not in set(stopwords)] # remove stopwords
return ' '.join(x) # join the list
As an example. This would then be apply
ied to your column -
df['Clean_addr'] = df['Adj_Addr'].apply(preprocess)
As an alternative, here's an approach that doesn't rely on apply
. This should be work well for small sentences.
Load your data into a series -
v = miss_data['Adj_Addr']
v
0 23FLOOR 9 DES VOEUX RD WEST HONG KONG
1 PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT...
2 C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIE...
Name: Adj_Addr, dtype: object
Now comes the heavy lifting.
- Lowercase with
str.lower
- Remove noise using
str.replace
- Split words into separate cells using
str.split
- Apply stopword removal using
pd.DataFrame.isin
+pd.DataFrame.where
- Finally, join the dataframe using
agg
.
v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)
v.where(~v.isin(stopwords) & v.notnull(), '')\
.agg(' '.join, axis=1)\
.str.replace('\s+', ' ')\
.str.strip()
0 floor des voeux west
1 pag consulting flat aia central connaught central
2 co city lost studios flat f hillier sheung
dtype: object
To use this on multiple columns, place this code in a function preprocess2
and call apply
-
def preprocess2(v):
v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)
return v.where(~v.isin(stopwords) & v.notnull(), '')\
.agg(' '.join, axis=1)\
.str.replace('\s+', ' ')\
.str.strip()
c = ['Col1', 'Col2', ...] # columns to operate
df[c] = df[c].apply(preprocess2, axis=0)
You'll still need an apply
call, but with a small number of columns, it shouldn't scale too badly. If you dislike apply
, then here's a loopy variant for you -
for _c in c:
df[_c] = preprocess2(df[_c])
Let's see the difference between our non-loopy version and the original -
s = pd.concat([s] * 100000, ignore_index=True)
s.size
300000
First, a sanity check -
preprocess2(s).eq(s.apply(preprocess)).all()
True
Now come the timings.
%timeit preprocess2(s)
1 loop, best of 3: 13.8 s per loop
%timeit s.apply(preprocess)
1 loop, best of 3: 9.72 s per loop
This is surprising, because apply
is seldom faster than a non-loopy solution. But this makes sense in this case because we've optimised preprocess
quite a bit, and string operations in pandas are seldom vectorised (they usually are, but the performance gain isn't as much as you'd expect).
Let's see if we can do better, bypassing the apply
, using np.vectorize
preprocess3 = np.vectorize(preprocess)
%timeit preprocess3(s)
1 loop, best of 3: 9.65 s per loop
Which is identical to apply
but happens to be a bit faster because of the reduced overhead around the "hidden" loop.
NLTK library working terribly slow
The WordNetLemmatizer may be the culprit. Wordnet needs to read from several files to work. There are lots of file access OS-level stuff that may hinder performance. Consider using another lemmatizer, see if the hard drive of the slow computer is faulty or try defragmenting it (if on windows)
Python (NLTK) - more efficient way to extract noun phrases?
Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.
With ne_chunk
and solution from
NLTK Named Entity recognition to a Python list and
How can I extract GPE(location) using NLTK ne_chunk?
[code]:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks((sent)))
[out]:
0 [New York]
1 [Washington, Bruce Wayne]
Name: text, dtype: object
To use the custom RegexpParser
:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))
[out]:
0 [bar sentence, New York city]
1 [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object
How to apply pos_tag_sents() to pandas dataframe efficiently
Input
$ cat test.csv
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat
TL;DR
>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0 cozily married practical athletics Mr. Brown flat
1 active married expensive soccer Mr. Chang flat
2 healthy single expensive badminton Mrs. Green ...
3 cozily married practical soccer Mr. Brown hier...
4 cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]
>>> df['POS'] = tagged_texts
>>> df
ID Task label \
0 1 Collect Information no response
1 2 New Credit no response
2 3 Collect Information response
3 4 Collect Information response
4 5 Collect Information response
Text \
0 cozily married practical athletics Mr. Brown flat
1 active married expensive soccer Mr. Chang flat
2 healthy single expensive badminton Mrs. Green ...
3 cozily married practical soccer Mr. Brown hier...
4 cozily single practical badminton Mr. Brown flat
POS
0 [(cozily, RB), (married, JJ), (practical, JJ),...
1 [(active, JJ), (married, VBD), (expensive, JJ)...
2 [(healthy, JJ), (single, JJ), (expensive, JJ),...
3 [(cozily, RB), (married, JJ), (practical, JJ),...
4 [(cozily, RB), (single, JJ), (practical, JJ), ...
In Long:
First, you can extract the Text
column to a list of string:
texts = df['Text'].tolist()
Then you can apply the word_tokenize
function:
map(word_tokenize, texts)
Note that, @Boud's suggested is almost the same, using df.apply
:
df['Text'].apply(word_tokenize)
Then you dump the tokenized text into a list of list of string:
df['Text'].apply(word_tokenize).tolist()
Then you can use pos_tag_sents
:
pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )
Then you add the column back to the DataFrame:
df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )
Related Topics
When and How to Use the Builtin Function Property() in Python
Counting Repeated Characters in a String in Python
Difference Between Methods and Functions, in Python Compared to C++
Setting Different Bar Color in Matplotlib Python
How to Stop Flask Application Without Using Ctrl-C
Cost of Exception Handlers in Python
How to Get the Duration of a Video in Python
Print a String as Hexadecimal Bytes
Matplotlib Xticks Not Lining Up with Histogram
Why Is Semicolon Allowed in This Python Snippet
Timeit Versus Timing Decorator
Flask to Return Image Stored in Database
How to Put Multiple Statements in One Line
Numpy Sum Elements in Array Based on Its Value