Stopword Removal with Nltk

Stopword removal with NLTK

I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:

operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators

Then you can simply test if a word is in or not in the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.

if word.lower() not in stop:
# use word

Struggling with removing stop words using nltk

word for word in text iterates over characters of text (not over words!)
you should change your code as below:

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize

stop_words = set(nltk.corpus.stopwords.words('english'))

def stop_word_remover(text):
word_tokens = word_tokenize(text)
word_list = [word for word in word_tokens if word.lower() not in stop_words]
return " ".join(word_list)

stop_word_remover("I don't like ice cream.")

## 'n't like ice cream .'

Stopword Removal Dilemma

import nltk
from nltk.corpus import stopwords
stop_words= stopwords.words('english')
type(stop_words)
print(len(stop_words))

If you look at the output, type of stop words is List. then :

personal_pronouns= ['i', 'you', 'she', 'he', 'they'] # you can add another words for remove
for word in personal_pronouns:
if word in stop_words:
stop_words.remove(word)
print(word+ ' Deleted')
print(len(stop_words))

python nltk processing with text, remove stopwords quickly

Try converting stopwords to a set. Using a list, your approach is O(n*m) where n is the number of words in text and m is the number of stop-words, using a set the approach is O(n + m). Let's compare both approaches list vs set:

import timeit
from nltk.corpus import stopwords

def list_clean(text):
stop_words = stopwords.words('english')
return [w for w in text if w.lower() not in stop_words]

def set_clean(text):
set_stop_words = set(stopwords.words('english'))
return [w for w in text if w.lower() not in set_stop_words]

text = ['the', 'cat', 'is', 'on', 'the', 'table', 'that', 'is', 'in', 'some', 'room'] * 100000

if __name__ == "__main__":
print(timeit.timeit('list_clean(text)', 'from __main__ import text,list_clean', number=5))
print(timeit.timeit('set_clean(text)', 'from __main__ import text,set_clean', number=5))

Output

7.6629380420199595
0.8327891009976156

In the code above list_clean is a function that removes stopwords using a list and set_clean is a function that removes stopwords using a set. The first time corresponds to list_clean and the second time corresponds to set_clean. For the given example the set_clean is almost 10 times faster.

UPDATE

The O(n*m) and O(n + m) are examples of big o notation, a theoretical approach of measuring the efficiency of algorithms. Basically the larger the polynomial the less efficient the algorithm, in this case O(n*m) is larger than O(n + m) so the list_clean method is theoretically less efficient than the set_clean method. This numbers come from the fact that search in list is O(n) and searching in a set takes a constant amount of time, often referred as O(1).

Removing Stop Word From a Text in Python Without Using NLTK

Check this out (This only works if the language in question can be broken on spaces, but that hasn't been clarified – Thanks to Oso) :

import numpy as np
your_stop_words = ['something','sth_else','and ...']
new_string = input()
words = np.array(new_string.split())
is_stop_word = np.isin(words,your_stop_words)
filtered_words = words[~is_stop_word]
clean_text = ' '.join(filtered_words)

If the language in question can not be broken to spaces, you can use this solution :

your_stop_words = ['something','sth_else','and ...']
new_string = input()
clean_text = new_string
for stop_word in your_stop_words :
clean_text = clean_text.replace(stop_word,"")

In this case, you need to ensure that a stop word can not be a part of another word. you can do it based on your language. for example you can use spaces around your stop words.

Remove stopwords from most common words from set of sentences in Python

You just need to include the parameter stop_words='english' to CountVectorizer()

vectorizer = CountVectorizer(stop_words='english')

You should now get:

['wear', 'mother', 'red', 'school', 'rt']


Refer the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html



Related Topics



Leave a reply



Submit