Why Is My Nltk Function Slow When Processing the Dataframe

Why is my NLTK function slow when processing the DataFrame?

Your original nlkt() loops through each row 3 times.

def nlkt(val):
val=repr(val)
clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]
nopunc = [char for char in str(clean_txt) if char not in string.punctuation]
nonum = [char for char in nopunc if not char.isdigit()]
words_string = ''.join(nonum)
return words_string

Also, each time you're calling nlkt(), you're re-initializing these again and again.

  • stopwords.words('english')
  • string.punctuation

These should be global.

stoplist = stopwords.words('english') + list(string.punctuation)

Going through things line by line:

val=repr(val)

I'm not sure why you need to do this. But you could easy cast a column to a str type. This should be done outside of your preprocessing function.

Hopefully this is self-explanatory:

>>> import pandas as pd
>>> df = pd.DataFrame([[0, 1, 2], [2, 'xyz', 4], [5, 'abc', 'def']])
>>> df
0 1 2
0 0 1 2
1 2 xyz 4
2 5 abc def
>>> df[1]
0 1
1 xyz
2 abc
Name: 1, dtype: object
>>> df[1].astype(str)
0 1
1 xyz
2 abc
Name: 1, dtype: object
>>> list(df[1])
[1, 'xyz', 'abc']
>>> list(df[1].astype(str))
['1', 'xyz', 'abc']

Now going to the next line:

clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]

Using str.split() is awkward, you should use a proper tokenizer. Otherwise, your punctuations might be stuck with the preceding word, e.g.

>>> from nltk.corpus import stopwords
>>> from nltk import word_tokenize
>>> import string
>>> stoplist = stopwords.words('english') + list(string.punctuation)
>>> stoplist = set(stoplist)

>>> text = 'This is foo, bar and doh.'

>>> [word for word in text.split() if word.lower() not in stoplist]
['foo,', 'bar', 'doh.']

>>> [word for word in word_tokenize(text) if word.lower() not in stoplist]
['foo', 'bar', 'doh']

Also checking for .isdigit() should be checked together:

>>> text = 'This is foo, bar, 234, 567 and doh.'
>>> [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit()]
['foo', 'bar', 'doh']

Putting it all together your nlkt() should look like this:

def preprocess(text):
return [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit()]

And you can use the DataFrame.apply:

data['Anylize_Text'].apply(preprocess)

NLTK-based text processing with pandas

Your function is slow and is incomplete. First, with the issues -

  1. You're not lowercasing your data.
  2. You're not getting rid of digits and punctuation properly.
  3. You're not returning a string (you should join the list using str.join and return it)
  4. Furthermore, a list comprehension with text processing is a prime way to introduce readability issues, not to mention possible redundancies (you may call a function multiple times, for each if condition it appears in.

Next, there are a couple of glaring inefficiencies with your function, especially with the stopword removal code.

  1. Your stopwords structure is a list, and in checks on lists are slow. The first thing to do would be to convert that to a set, making the not in check constant time.

  2. You're using nltk.word_tokenize which is unnecessarily slow.

  3. Lastly, you shouldn't always rely on apply, even if you are working with NLTK where there's rarely any vectorised solution available. There are almost always other ways to do the exact same thing. Oftentimes, even a python loop is faster. But this isn't set in stone.

First, create your enhanced stopwords as a set -

user_defined_stop_words = ['st','rd','hong','kong'] 

i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words

stopwords = set(i).union(j)

The next fix is to get rid of the list comprehension and convert this into a multi-line function. This makes things so much easier to work with. Each line of your function should be dedicated to solving a particular task (example, getting rid of digits/punctuation, or getting rid of stopwords, or lowercasing) -

def preprocess(x):
x = re.sub('[^a-z\s]', '', x.lower()) # get rid of noise
x = [w for w in x.split() if w not in set(stopwords)] # remove stopwords
return ' '.join(x) # join the list

As an example. This would then be applyied to your column -

df['Clean_addr'] = df['Adj_Addr'].apply(preprocess)

As an alternative, here's an approach that doesn't rely on apply. This should be work well for small sentences.

Load your data into a series -

v = miss_data['Adj_Addr']
v

0 23FLOOR 9 DES VOEUX RD WEST HONG KONG
1 PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT...
2 C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIE...
Name: Adj_Addr, dtype: object

Now comes the heavy lifting.

  1. Lowercase with str.lower
  2. Remove noise using str.replace
  3. Split words into separate cells using str.split
  4. Apply stopword removal using pd.DataFrame.isin + pd.DataFrame.where
  5. Finally, join the dataframe using agg.

v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)

v.where(~v.isin(stopwords) & v.notnull(), '')\
.agg(' '.join, axis=1)\
.str.replace('\s+', ' ')\
.str.strip()

0 floor des voeux west
1 pag consulting flat aia central connaught central
2 co city lost studios flat f hillier sheung
dtype: object

To use this on multiple columns, place this code in a function preprocess2 and call apply -

def preprocess2(v):
v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)

return v.where(~v.isin(stopwords) & v.notnull(), '')\
.agg(' '.join, axis=1)\
.str.replace('\s+', ' ')\
.str.strip()

c = ['Col1', 'Col2', ...] # columns to operate
df[c] = df[c].apply(preprocess2, axis=0)

You'll still need an apply call, but with a small number of columns, it shouldn't scale too badly. If you dislike apply, then here's a loopy variant for you -

for _c in c:
df[_c] = preprocess2(df[_c])

Let's see the difference between our non-loopy version and the original -

s = pd.concat([s] * 100000, ignore_index=True) 

s.size
300000

First, a sanity check -

preprocess2(s).eq(s.apply(preprocess)).all()
True

Now come the timings.

%timeit preprocess2(s)   
1 loop, best of 3: 13.8 s per loop

%timeit s.apply(preprocess)
1 loop, best of 3: 9.72 s per loop

This is surprising, because apply is seldom faster than a non-loopy solution. But this makes sense in this case because we've optimised preprocess quite a bit, and string operations in pandas are seldom vectorised (they usually are, but the performance gain isn't as much as you'd expect).

Let's see if we can do better, bypassing the apply, using np.vectorize

preprocess3 = np.vectorize(preprocess)

%timeit preprocess3(s)
1 loop, best of 3: 9.65 s per loop

Which is identical to apply but happens to be a bit faster because of the reduced overhead around the "hidden" loop.

NLTK library working terribly slow

The WordNetLemmatizer may be the culprit. Wordnet needs to read from several files to work. There are lots of file access OS-level stuff that may hinder performance. Consider using another lemmatizer, see if the hard drive of the slow computer is faulty or try defragmenting it (if on windows)

Python (NLTK) - more efficient way to extract noun phrases?

Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.

With ne_chunk and solution from

  • NLTK Named Entity recognition to a Python list and

  • How can I extract GPE(location) using NLTK ne_chunk?

[code]:

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []

for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue

return continuous_chunk

df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})

df['text'].apply(lambda sent: get_continuous_chunks((sent)))

[out]:

0                   [New York]
1 [Washington, Bruce Wayne]
Name: text, dtype: object

To use the custom RegexpParser :

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)

def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []

for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue

return continuous_chunk

df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})

df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))

[out]:

0                  [bar sentence, New York city]
1 [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object

How to apply pos_tag_sents() to pandas dataframe efficiently

Input

$ cat test.csv 
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat

TL;DR

>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0 cozily married practical athletics Mr. Brown flat
1 active married expensive soccer Mr. Chang flat
2 healthy single expensive badminton Mrs. Green ...
3 cozily married practical soccer Mr. Brown hier...
4 cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]

>>> df['POS'] = tagged_texts
>>> df
ID Task label \
0 1 Collect Information no response
1 2 New Credit no response
2 3 Collect Information response
3 4 Collect Information response
4 5 Collect Information response

Text \
0 cozily married practical athletics Mr. Brown flat
1 active married expensive soccer Mr. Chang flat
2 healthy single expensive badminton Mrs. Green ...
3 cozily married practical soccer Mr. Brown hier...
4 cozily single practical badminton Mr. Brown flat

POS
0 [(cozily, RB), (married, JJ), (practical, JJ),...
1 [(active, JJ), (married, VBD), (expensive, JJ)...
2 [(healthy, JJ), (single, JJ), (expensive, JJ),...
3 [(cozily, RB), (married, JJ), (practical, JJ),...
4 [(cozily, RB), (single, JJ), (practical, JJ), ...

In Long:

First, you can extract the Text column to a list of string:

texts = df['Text'].tolist()

Then you can apply the word_tokenize function:

map(word_tokenize, texts)

Note that, @Boud's suggested is almost the same, using df.apply:

df['Text'].apply(word_tokenize)

Then you dump the tokenized text into a list of list of string:

df['Text'].apply(word_tokenize).tolist()

Then you can use pos_tag_sents:

pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

Then you add the column back to the DataFrame:

df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )


Related Topics



Leave a reply



Submit