Stopword removal with NLTK
I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:
operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators
Then you can simply test if a word is in
or not in
the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.
if word.lower() not in stop:
# use word
Struggling with removing stop words using nltk
word for word in text
iterates over characters of text
(not over words!)
you should change your code as below:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
stop_words = set(nltk.corpus.stopwords.words('english'))
def stop_word_remover(text):
word_tokens = word_tokenize(text)
word_list = [word for word in word_tokens if word.lower() not in stop_words]
return " ".join(word_list)
stop_word_remover("I don't like ice cream.")
## 'n't like ice cream .'
Stopword Removal Dilemma
import nltk
from nltk.corpus import stopwords
stop_words= stopwords.words('english')
type(stop_words)
print(len(stop_words))
If you look at the output, type of stop words is List. then :
personal_pronouns= ['i', 'you', 'she', 'he', 'they'] # you can add another words for remove
for word in personal_pronouns:
if word in stop_words:
stop_words.remove(word)
print(word+ ' Deleted')
print(len(stop_words))
python nltk processing with text, remove stopwords quickly
Try converting stopwords
to a set. Using a list, your approach is O(n*m)
where n is the number of words in text and m
is the number of stop-words, using a set
the approach is O(n + m)
. Let's compare both approaches list
vs set
:
import timeit
from nltk.corpus import stopwords
def list_clean(text):
stop_words = stopwords.words('english')
return [w for w in text if w.lower() not in stop_words]
def set_clean(text):
set_stop_words = set(stopwords.words('english'))
return [w for w in text if w.lower() not in set_stop_words]
text = ['the', 'cat', 'is', 'on', 'the', 'table', 'that', 'is', 'in', 'some', 'room'] * 100000
if __name__ == "__main__":
print(timeit.timeit('list_clean(text)', 'from __main__ import text,list_clean', number=5))
print(timeit.timeit('set_clean(text)', 'from __main__ import text,set_clean', number=5))
Output
7.6629380420199595
0.8327891009976156
In the code above list_clean
is a function that removes stopwords using a list
and set_clean
is a function that removes stopwords using a set
. The first time corresponds to list_clean
and the second time corresponds to set_clean
. For the given example the set_clean
is almost 10 times faster.
UPDATE
The O(n*m)
and O(n + m)
are examples of big o notation, a theoretical approach of measuring the efficiency of algorithms. Basically the larger the polynomial the less efficient the algorithm, in this case O(n*m)
is larger than O(n + m)
so the list_clean
method is theoretically less efficient than the set_clean
method. This numbers come from the fact that search in list is O(n)
and searching in a set
takes a constant amount of time, often referred as O(1)
.
Removing Stop Word From a Text in Python Without Using NLTK
Check this out (This only works if the language in question can be broken on spaces, but that hasn't been clarified – Thanks to Oso) :
import numpy as np
your_stop_words = ['something','sth_else','and ...']
new_string = input()
words = np.array(new_string.split())
is_stop_word = np.isin(words,your_stop_words)
filtered_words = words[~is_stop_word]
clean_text = ' '.join(filtered_words)
If the language in question can not be broken to spaces, you can use this solution :
your_stop_words = ['something','sth_else','and ...']
new_string = input()
clean_text = new_string
for stop_word in your_stop_words :
clean_text = clean_text.replace(stop_word,"")
In this case, you need to ensure that a stop word can not be a part of another word. you can do it based on your language. for example you can use spaces around your stop words.
Remove stopwords from most common words from set of sentences in Python
You just need to include the parameter stop_words='english'
to CountVectorizer()
vectorizer = CountVectorizer(stop_words='english')
You should now get:
['wear', 'mother', 'red', 'school', 'rt']
Refer the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
Related Topics
How to Specify Working Directory for Popen
What Are the Python Equivalents to Ruby's Bundler/Perl's Carton
Ruby Equivalent of Python's "Dir"
Does Python Have an "Or Equals" Function Like ||= in Ruby
Running Ruby, Node, Python and Docker on the New Apple Silicon Architecture
Typeerror: Use() Got an Unexpected Keyword Argument 'Warn' When Importing Matplotlib
How to Set the R_Home Environment Variable to the R Home Directory
R Foverlaps Equivalent in Python
Error When Installing Rpy2 Module in Python with Easy_Install
Matplotlib Analog of R's 'Pairs'
Equivalent of a Python Dict in R
Closest Equivalent of a Factor Variable in Python Pandas
Comparison of R, Statmodels, Sklearn for a Classification Task with Logistic Regression