Struggling with removing stop words using nltk
word for word in text
iterates over characters of text
(not over words!)
you should change your code as below:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
stop_words = set(nltk.corpus.stopwords.words('english'))
def stop_word_remover(text):
word_tokens = word_tokenize(text)
word_list = [word for word in word_tokens if word.lower() not in stop_words]
return " ".join(word_list)
stop_word_remover("I don't like ice cream.")
## 'n't like ice cream .'
Stopword removal with NLTK
I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:
operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators
Then you can simply test if a word is in
or not in
the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.
if word.lower() not in stop:
# use word
Removing Stop Word From a Text in Python Without Using NLTK
Check this out (This only works if the language in question can be broken on spaces, but that hasn't been clarified – Thanks to Oso) :
import numpy as np
your_stop_words = ['something','sth_else','and ...']
new_string = input()
words = np.array(new_string.split())
is_stop_word = np.isin(words,your_stop_words)
filtered_words = words[~is_stop_word]
clean_text = ' '.join(filtered_words)
If the language in question can not be broken to spaces, you can use this solution :
your_stop_words = ['something','sth_else','and ...']
new_string = input()
clean_text = new_string
for stop_word in your_stop_words :
clean_text = clean_text.replace(stop_word,"")
In this case, you need to ensure that a stop word can not be a part of another word. you can do it based on your language. for example you can use spaces around your stop words.
python nltk processing with text, remove stopwords quickly
Try converting stopwords
to a set. Using a list, your approach is O(n*m)
where n is the number of words in text and m
is the number of stop-words, using a set
the approach is O(n + m)
. Let's compare both approaches list
vs set
:
import timeit
from nltk.corpus import stopwords
def list_clean(text):
stop_words = stopwords.words('english')
return [w for w in text if w.lower() not in stop_words]
def set_clean(text):
set_stop_words = set(stopwords.words('english'))
return [w for w in text if w.lower() not in set_stop_words]
text = ['the', 'cat', 'is', 'on', 'the', 'table', 'that', 'is', 'in', 'some', 'room'] * 100000
if __name__ == "__main__":
print(timeit.timeit('list_clean(text)', 'from __main__ import text,list_clean', number=5))
print(timeit.timeit('set_clean(text)', 'from __main__ import text,set_clean', number=5))
Output
7.6629380420199595
0.8327891009976156
In the code above list_clean
is a function that removes stopwords using a list
and set_clean
is a function that removes stopwords using a set
. The first time corresponds to list_clean
and the second time corresponds to set_clean
. For the given example the set_clean
is almost 10 times faster.
UPDATE
The O(n*m)
and O(n + m)
are examples of big o notation, a theoretical approach of measuring the efficiency of algorithms. Basically the larger the polynomial the less efficient the algorithm, in this case O(n*m)
is larger than O(n + m)
so the list_clean
method is theoretically less efficient than the set_clean
method. This numbers come from the fact that search in list is O(n)
and searching in a set
takes a constant amount of time, often referred as O(1)
.
Remove Stop words from multi-lingual Text
I want to remove stopwords from all the languages at once.
Merge the results of each stopwords(cc)
call, and pass that to a single tm_map(corpus, removeWords, allStopwords)
call.
I don't want to write the name of every language in the code to remove the stopwords
You could use stopwords_getlanguages()
to get a list of all the supported languages, and do it as a loop. See an example at https://www.rdocumentation.org/packages/stopwords/versions/2.3
For what its worth, I think this (using stopwords of all languages) is a bad idea. What is a stop word in one language could be a high information word in another language. E.g. just skimming https://github.com/stopwords-iso/stopwords-es/blob/master/stopwords-es.txt I spotted "embargo", "final", "mayor", "salvo", "sea", which are not in the English stopword list, and could carry information.
Of course it depends on what you are doing with the data once all these words have been stripped out.
But if something like searching for drug names, or other keywords, just do that on the original data, without removing stopwords.
Related Topics
Python's Equivalent of && (Logical-And) in an If-Statement
Matplotlib Make Tick Labels Font Size Smaller
Can Pandas Automatically Read Dates from a CSV File
Get Md5 Hash of Big Files in Python
Matplotlib Plots: Removing Axis, Legends and White Spaces
Python's Most Efficient Way to Choose Longest String in List
How to Get a Value of Datetime.Today() in Python That Is "Timezone Aware"
Transpose Column to Row with Spark
Unnest (Explode) a Pandas Series
Python: Calling 'List' on a Map Object Twice
Python: Problem with Raw_Input Reading a Number
In Practice, What Are the Main Uses for the "Yield From" Syntax in Python 3.3
What's the How to Install Pip, Virtualenv, and Distribute for Python
What's the Difference Between Select_Related and Prefetch_Related in Django Orm