How to Remove Stop Words Using Nltk or Python

Struggling with removing stop words using nltk

word for word in text iterates over characters of text (not over words!)
you should change your code as below:

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize

stop_words = set(nltk.corpus.stopwords.words('english'))

def stop_word_remover(text):
word_tokens = word_tokenize(text)
word_list = [word for word in word_tokens if word.lower() not in stop_words]
return " ".join(word_list)

stop_word_remover("I don't like ice cream.")

## 'n't like ice cream .'

Stopword removal with NLTK

I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:

operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators

Then you can simply test if a word is in or not in the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.

if word.lower() not in stop:
# use word

Removing Stop Word From a Text in Python Without Using NLTK

Check this out (This only works if the language in question can be broken on spaces, but that hasn't been clarified – Thanks to Oso) :

import numpy as np
your_stop_words = ['something','sth_else','and ...']
new_string = input()
words = np.array(new_string.split())
is_stop_word = np.isin(words,your_stop_words)
filtered_words = words[~is_stop_word]
clean_text = ' '.join(filtered_words)

If the language in question can not be broken to spaces, you can use this solution :

your_stop_words = ['something','sth_else','and ...']
new_string = input()
clean_text = new_string
for stop_word in your_stop_words :
clean_text = clean_text.replace(stop_word,"")

In this case, you need to ensure that a stop word can not be a part of another word. you can do it based on your language. for example you can use spaces around your stop words.

python nltk processing with text, remove stopwords quickly

Try converting stopwords to a set. Using a list, your approach is O(n*m) where n is the number of words in text and m is the number of stop-words, using a set the approach is O(n + m). Let's compare both approaches list vs set:

import timeit
from nltk.corpus import stopwords

def list_clean(text):
stop_words = stopwords.words('english')
return [w for w in text if w.lower() not in stop_words]

def set_clean(text):
set_stop_words = set(stopwords.words('english'))
return [w for w in text if w.lower() not in set_stop_words]

text = ['the', 'cat', 'is', 'on', 'the', 'table', 'that', 'is', 'in', 'some', 'room'] * 100000

if __name__ == "__main__":
print(timeit.timeit('list_clean(text)', 'from __main__ import text,list_clean', number=5))
print(timeit.timeit('set_clean(text)', 'from __main__ import text,set_clean', number=5))

Output

7.6629380420199595
0.8327891009976156

In the code above list_clean is a function that removes stopwords using a list and set_clean is a function that removes stopwords using a set. The first time corresponds to list_clean and the second time corresponds to set_clean. For the given example the set_clean is almost 10 times faster.

UPDATE

The O(n*m) and O(n + m) are examples of big o notation, a theoretical approach of measuring the efficiency of algorithms. Basically the larger the polynomial the less efficient the algorithm, in this case O(n*m) is larger than O(n + m) so the list_clean method is theoretically less efficient than the set_clean method. This numbers come from the fact that search in list is O(n) and searching in a set takes a constant amount of time, often referred as O(1).

Remove Stop words from multi-lingual Text

I want to remove stopwords from all the languages at once.

Merge the results of each stopwords(cc) call, and pass that to a single tm_map(corpus, removeWords, allStopwords) call.

I don't want to write the name of every language in the code to remove the stopwords

You could use stopwords_getlanguages() to get a list of all the supported languages, and do it as a loop. See an example at https://www.rdocumentation.org/packages/stopwords/versions/2.3

For what its worth, I think this (using stopwords of all languages) is a bad idea. What is a stop word in one language could be a high information word in another language. E.g. just skimming https://github.com/stopwords-iso/stopwords-es/blob/master/stopwords-es.txt I spotted "embargo", "final", "mayor", "salvo", "sea", which are not in the English stopword list, and could carry information.

Of course it depends on what you are doing with the data once all these words have been stripped out.

But if something like searching for drug names, or other keywords, just do that on the original data, without removing stopwords.



Related Topics



Leave a reply



Submit