NLTK-based text processing with pandas
Your function is slow and is incomplete. First, with the issues -
- You're not lowercasing your data.
- You're not getting rid of digits and punctuation properly.
- You're not returning a string (you should join the list using
str.join
and return it) - Furthermore, a list comprehension with text processing is a prime way to introduce readability issues, not to mention possible redundancies (you may call a function multiple times, for each
if
condition it appears in.
Next, there are a couple of glaring inefficiencies with your function, especially with the stopword removal code.
Your
stopwords
structure is a list, andin
checks on lists are slow. The first thing to do would be to convert that to aset
, making thenot in
check constant time.You're using
nltk.word_tokenize
which is unnecessarily slow.Lastly, you shouldn't always rely on
apply
, even if you are working with NLTK where there's rarely any vectorised solution available. There are almost always other ways to do the exact same thing. Oftentimes, even a python loop is faster. But this isn't set in stone.
First, create your enhanced stopwords
as a set -
user_defined_stop_words = ['st','rd','hong','kong']
i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words
stopwords = set(i).union(j)
The next fix is to get rid of the list comprehension and convert this into a multi-line function. This makes things so much easier to work with. Each line of your function should be dedicated to solving a particular task (example, getting rid of digits/punctuation, or getting rid of stopwords, or lowercasing) -
def preprocess(x):
x = re.sub('[^a-z\s]', '', x.lower()) # get rid of noise
x = [w for w in x.split() if w not in set(stopwords)] # remove stopwords
return ' '.join(x) # join the list
As an example. This would then be apply
ied to your column -
df['Clean_addr'] = df['Adj_Addr'].apply(preprocess)
As an alternative, here's an approach that doesn't rely on apply
. This should be work well for small sentences.
Load your data into a series -
v = miss_data['Adj_Addr']
v
0 23FLOOR 9 DES VOEUX RD WEST HONG KONG
1 PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT...
2 C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIE...
Name: Adj_Addr, dtype: object
Now comes the heavy lifting.
- Lowercase with
str.lower
- Remove noise using
str.replace
- Split words into separate cells using
str.split
- Apply stopword removal using
pd.DataFrame.isin
+pd.DataFrame.where
- Finally, join the dataframe using
agg
.
v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)
v.where(~v.isin(stopwords) & v.notnull(), '')\
.agg(' '.join, axis=1)\
.str.replace('\s+', ' ')\
.str.strip()
0 floor des voeux west
1 pag consulting flat aia central connaught central
2 co city lost studios flat f hillier sheung
dtype: object
To use this on multiple columns, place this code in a function preprocess2
and call apply
-
def preprocess2(v):
v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)
return v.where(~v.isin(stopwords) & v.notnull(), '')\
.agg(' '.join, axis=1)\
.str.replace('\s+', ' ')\
.str.strip()
c = ['Col1', 'Col2', ...] # columns to operate
df[c] = df[c].apply(preprocess2, axis=0)
You'll still need an apply
call, but with a small number of columns, it shouldn't scale too badly. If you dislike apply
, then here's a loopy variant for you -
for _c in c:
df[_c] = preprocess2(df[_c])
Let's see the difference between our non-loopy version and the original -
s = pd.concat([s] * 100000, ignore_index=True)
s.size
300000
First, a sanity check -
preprocess2(s).eq(s.apply(preprocess)).all()
True
Now come the timings.
%timeit preprocess2(s)
1 loop, best of 3: 13.8 s per loop
%timeit s.apply(preprocess)
1 loop, best of 3: 9.72 s per loop
This is surprising, because apply
is seldom faster than a non-loopy solution. But this makes sense in this case because we've optimised preprocess
quite a bit, and string operations in pandas are seldom vectorised (they usually are, but the performance gain isn't as much as you'd expect).
Let's see if we can do better, bypassing the apply
, using np.vectorize
preprocess3 = np.vectorize(preprocess)
%timeit preprocess3(s)
1 loop, best of 3: 9.65 s per loop
Which is identical to apply
but happens to be a bit faster because of the reduced overhead around the "hidden" loop.
Python text processing: NLTK and pandas
The benefit of using a pandas
DataFrame
would be to apply the nltk
functionality to each row
like so:
word_file = "/usr/share/dict/words"
words = open(word_file).read().splitlines()[10:50]
random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]
df = pd.DataFrame(random_word_list, columns=['text'])
df.head()
text
0 Aaru Aaronic abandonable abandonedly abaction ...
1 abampere abampere abacus aback abalone abactor...
2 abaisance abalienate abandonedly abaff abacina...
3 Ababdeh abalone abac abaiser abandonable abact...
4 abandonable abandon aba abaiser abaft Abama ab...
len(df)
50
txt = df.text.apply(word_tokenize)
txt.head()
0 [Aaru, Aaronic, abandonable, abandonedly, abac...
1 [abampere, abampere, abacus, aback, abalone, a...
2 [abaisance, abalienate, abandonedly, abaff, ab...
3 [Ababdeh, abalone, abac, abaiser, abandonable,...
4 [abandonable, abandon, aba, abaiser, abaft, Ab...
txt.apply(len)
0 1000
1 1000
2 1000
3 1000
4 1000
....
44 1000
45 1000
46 1000
47 1000
48 1000
49 1000
Name: text, dtype: int64
As a result, you get the .count()
for each row
entry:
txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
txt.head()
0 27
1 24
2 17
3 25
4 32
You can then sum the result using:
txt.sum()
1239
Preprocessing corpus stored in DataFrame with NLTK
First Issuestop_words = set(stopwords.words('english'))
and ... if word not in [stop_words]
: you created a set with just one element - the list of stopwords. No word
equals this whole list, therefore stopwords are not removed. So it must be:stop_words = stopwords.words('english')
df['tokenized_text'].apply(lambda words: [word for word in words if word not in stop_words + list(string.punctuation)])
lemmatizer = WordNetLemmatizer
here you assign the class but you need to create an object of this class: lemmatizer = WordNetLemmatizer()
You can't lemmatize a whole list in one take, instead you need to lemmatize word by word:df['tokenized_text'].apply(lambda words: [lemmatizer.lemmatize(word) for word in words])
Python NLTK and Pandas - text classifier - (newbie ) - importing my data in a format similar to provided example
I figured it out. I basically just needed to combine two lists into a tuple.
def merge(customerreview, reviewrating):
merged_list = [(customerreview[i], reviewrating[i]) for i in range(0,
len(customerreview))]
return merged_list
train = (merge(customerreview, reviewrating))
How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?
In short:
df['Text'].apply(word_tokenize)
Or if you want to add another column to store the tokenized list of strings:
df['tokenized_text'] = df['Text'].apply(word_tokenize)
There are tokenizers written specifically for twitter text, see http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual
To use nltk.tokenize.TweetTokenizer
:
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
df['Text'].apply(tt.tokenize)
Similar to:
How to apply pos_tag_sents() to pandas dataframe efficiently
how to use word_tokenize in data frame
How to apply pos_tag_sents() to pandas dataframe efficiently
Tokenizing words into a new column in a pandas dataframe
Run nltk sent_tokenize through Pandas dataframe
Python text processing: NLTK and pandas
Passing a pandas dataframe column to an NLTK tokenizer
I'm assuming this is an NLTK tokenizer. I believe these work by taking sentences as input and returning tokenised words as output.
What you're passing is raw_df
- a pd.DataFrame
object, not a str
. You cannot expect it to apply the function row-wise, without telling it to, yourself. There's a function called apply
for that.
raw_df['tokenized_sentences'] = raw_df['sentences'].apply(tokenizer.tokenize)
Assuming this works without any hitches, tokenized_sentences
will be a column of lists.
Since you're performing text processing on DataFrames, I'd recommend taking a look at another answer of mine here: Applying NLTK-based text pre-proccessing on a pandas dataframe
how to use word_tokenize in data frame
You can use apply method of DataFrame API:
import pandas as pd
import nltk
df = pd.DataFrame({'sentences': ['This is a very good site. I will recommend it to others.', 'Can you please give me a call at 9983938428. have issues with the listings.', 'good work! keep it up']})
df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)
Output:
>>> df
sentences \
0 This is a very good site. I will recommend it ...
1 Can you please give me a call at 9983938428. h...
2 good work! keep it up
tokenized_sents
0 [This, is, a, very, good, site, ., I, will, re...
1 [Can, you, please, give, me, a, call, at, 9983...
2 [good, work, !, keep, it, up]
For finding the length of each text try to use apply and lambda function again:
df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1)
>>> df
sentences \
0 This is a very good site. I will recommend it ...
1 Can you please give me a call at 9983938428. h...
2 good work! keep it up
tokenized_sents sents_length
0 [This, is, a, very, good, site, ., I, will, re... 14
1 [Can, you, please, give, me, a, call, at, 9983... 15
2 [good, work, !, keep, it, up] 6
pandas: text analysis: Transfer raw data to dataframe
possibly in such a way:
import pandas as pd
import re
# do smth
with open("12.txt", "r") as f:
data = f.read()
# print(data)
# ########## findall text in quotes
m = re.findall(r'\"(.+)\"', data)
print("RESULT: \n", m)
df = pd.DataFrame({'rep': m})
print(df)
# ########## retrieve and replace text in quotes for nothing
m = re.sub(r'\"(.+)\"', r'', data)
# ########## get First Name & Last Name from the rest text in each line
regex = re.compile("([A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+)")
mm = regex.findall(m)
df1 = pd.DataFrame({'author': mm})
print(df1)
# ########## join 2 dataframes
fin = pd.concat([df, df1], axis=1)
print(fin)
all print just for checking (get them away for cleaner code).
Just "C. Montgomery Burns" is loosing his first letter...
Related Topics
Compare Two Files Report Difference in Python
Pygame: Problems with Shooting in Space Invaders
Remove Non-Ascii Characters from Pandas Column
How to Let a Raw_Input Repeat Until I Want to Quit
When Is Not a Good Time to Use Python Generators
How to Enable MySQL Client Auto Re-Connect with MySQLdb
Is There a Difference Between Using a Dict Literal and a Dict Constructor
Double Precision Floating Values in Python
How Is Tuple Implemented in Cpython
Access Class Variable from Instance
Run Powershell Function from Python Script
How to Normalize JSON Correctly by Python Pandas
How to Get Md5 Sum of a String Using Python
What Do All the Distributions Available in Scipy.Stats Look Like
How to Get the Domain Name of My Site Within a Django Template
How to Flatten a List of Lists/Nested Lists