nltk NaiveBayesClassifier training for sentiment analysis
You need to change your data structure. Here is your train
list as it currently stands:
>>> train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
The problem is, though, that the first element of each tuple should be a dictionary of features. So I will change your list into a data structure that the classifier can work with:
>>> from nltk.tokenize import word_tokenize # or use some other tokenizer
>>> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
>>> t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]
Your data should now be structured like this:
>>> t
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), . . .]
Note that the first element of each tuple is now a dictionary. Now that your data is in place and the first element of each tuple is a dictionary, you can train the classifier like so:
>>> import nltk
>>> classifier = nltk.NaiveBayesClassifier.train(t)
>>> classifier.show_most_informative_features()
Most Informative Features
this = True neg : pos = 2.3 : 1.0
this = False pos : neg = 1.8 : 1.0
an = False neg : pos = 1.6 : 1.0
. = True pos : neg = 1.4 : 1.0
. = False neg : pos = 1.4 : 1.0
awesome = False neg : pos = 1.2 : 1.0
of = False pos : neg = 1.2 : 1.0
feel = False neg : pos = 1.2 : 1.0
place = False neg : pos = 1.2 : 1.0
horrible = False pos : neg = 1.2 : 1.0
If you want to use the classifier, you can do it like this. First, you begin with a test sentence:
>>> test_sentence = "This is the best band I've ever heard!"
Then, you tokenize the sentence and figure out which words the sentence shares with all_words. These constitute the sentence's features.
>>> test_sent_features = {word: (word in word_tokenize(test_sentence.lower())) for word in all_words}
Your features will now look like this:
>>> test_sent_features
{'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False}
Then you simply classify those features:
>>> classifier.classify(test_sent_features)
'pos' # note 'best' == True in the sentence features above
This test sentence appears to be positive.
How to predict Sentiments after training and testing the model by using NLTK NaiveBayesClassifier in Python?
Documentation and example
The line that gives you the error calls the method SentimentAnalyzer.evaluate(...) .
This method does the following.
Evaluate and print classifier performance on the test set.
See SentimentAnalyzer.evaluate.
The method has one mandatory parameter: test_set .
test_set – A list of (tokens, label) tuples to use as gold set.
In the example at http://www.nltk.org/howto/sentiment.html test_set has the following structure:
[({'contains(,)': False, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ({'contains(,)': True, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ...]
Here is a symbolic representation of the structure.
[(dictionary,label), ... , (dictionary,label)]
Error in your code
You are passing
list(zip(new_data['Articles']))
to SentimentAnalyzer.evaluate. I assume your getting the error because
list(zip(new_data['Articles']))
does not create a list of (tokens, label) tuples. You can check that by creating a variable which contains the list and printing it or looking at the value of the variable while debugging.
E.G.
test_set = list(zip(new_data['Articles']))
print("begin test_set")
print(test_set)
print("end test_set")
You are calling evaluate correctly 3 lines above the one that is giving the error.
score = analyzer.evaluate(list(zip(_test_X, test_y)))
I guess you want to call SentimentAnalyzer.classify(instance) to predict unlabeled data. See SentimentAnalyzer.classify.
NLTK Naive Bayes Classifier Training issues
There is a typo in your code:
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents]
This causes sentiment
to have the same value all the time (namely the value of the last tweet from your preprocessing step) so training is pointless and all features are irrelevant.
Fix it and you will get:
('Naive Bayes Accuracy:', 66.75)
Most Informative Features
-- = True positi : negati = 6.9 : 1.0
these = True positi : negati = 5.6 : 1.0
face = True positi : negati = 5.6 : 1.0
saw = True positi : negati = 5.6 : 1.0
] = True positi : negati = 4.4 : 1.0
later = True positi : negati = 4.4 : 1.0
love = True positi : negati = 4.1 : 1.0
ta = True positi : negati = 4.0 : 1.0
quite = True positi : negati = 4.0 : 1.0
trying = True positi : negati = 4.0 : 1.0
small = True positi : negati = 4.0 : 1.0
thx = True positi : negati = 4.0 : 1.0
music = True positi : negati = 4.0 : 1.0
p = True positi : negati = 4.0 : 1.0
husband = True positi : negati = 4.0 : 1.0
NLTK NaiveBayesClassifier classifier issues
Training a sentiment model means that your model learns how words affect the sentiment. Thus it's not about specifying which words are positive and which are negative — it's about how to train your model to understand that from a text by itself.
The simplest implementation is called "bag of words" (which is usually used with TF-IDF normalization). Bag of words works this way: you split your text by words and count occurrences of each word within the given text block (or review). In this way rows correspond to different reviews, and columns correspond to the number of occurrences of the given word within the given review. This table becomes your X
and the target sentiment to predict becomes your Y
(say 0 for negative and 1 for positive) .
Then you train your classifier:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
reviews, Y = your_load_function()
vectorizer = TfidfVectorizer() # or CountVectorizer()
X = vectorizer.fit_transform(reviews) # convert text to words counts
model = MultinomialNB()
model.fit(X, Y)
After the model is trained you can make predictions:
new_reviews = your_load_function2()
new_X = vectorizer.transform(new_reviews)
predicted_Y = model.predict(new_X)
Further reading:
https://en.wikipedia.org/wiki/Bag-of-words_model
https://en.wikipedia.org/wiki/Tf-idf
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Naive Bayes Classifier and training data
When doing machine learning, we want to learn an algorithms that performs well on new (unseen) data. This is called generalization.
The purpose of the test set is, amongst others, to verify the generalization behavior of your classifier. If your model predicts the same labels for each test instance, than we cannot confirm that hypothesis. The test set should be representative of the conditions in which you apply it later.
As a rule of thumb, I like to think that you keep 50-25% of their data as a test set. This of course depends on the situation. 30/4000 is less than one percent.
A second point that comes to mind is that when your classifier is biased towards one class, make sure each class is represented nearly equally in the training and validation set. This prevents the classifier from 'just' learning the distribution of the whole set, instead of learning which features are relevant.
As a final note, normally we report metrics such as precision, recall and Fβ=1 to evaluate our classifier. The code in your sample seems to report something based on the global sentiment in all tweets, are you sure that is what you want? Are the tweets a representative collection?
Why did NLTK NaiveBayes classifier misclassify one record?
Here is the modified code for you
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.') # these are actually list of sentences
for sent in sentences:
if sent != "":
words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
classResult = classifier.classify(word_feats(words))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(sent) + ' --> ' + str(classResult))
print
I modified where you are considering 'list of words' as an input to your classifier. But Actually you need to pass sentence one by one, which means you need to pass 'list of sentences'
Also, for each sentence, you need to pass 'words as features', which means you need to split the sentence on white-space character.
Also, if you want your classifier to work properly for sentiment analysis, you need to give less preference to "stop-words" like "it, they, is etc". As these words are not sufficient to decide if the sentence is positive, negative or neutral.
The above code gives below output
awesome movie --> pos
i like it --> pos
it is so bad --> neg
So for any classifier, the input format for training classifier and predicting classifier should be same. While training you are providing list of words, try to use the same method to convert your test set as well.
Sentiment Analysis, Naive Bayes Accuracy
Well, as the error message says, the classifier you are trying to use (NaiveBayesClassifier
) doesn't have the method classify_many
that the nltk.classify.util.accuracy
function requires.
(Reference: https://www.nltk.org/_modules/nltk/classify/naivebayes.html)
Now, that looks like an NLTK bug, but you can get your answer easily on your own:
from sklearn.metrics import accuracy_score
y_predicted = [classifier.classify(x) for x in proc_set]
accuracy = accuracy_score(y_true, y_predicted)
Where y_true
are the sentiment values corresponding to proc_set
inputs (which I don't see you actually creating in your code shown above, though).
Hope that helps.
EDIT:
Or, without using the sklearn
accuracy function, but pure Python:
hits = [yp == yt for yp, yt in zip(y_predicted, y_true)]
accuracy = sum(hits)/len(hits)
Related Topics
Python Webdriver to Handle Pop Up Browser Windows Which Is Not an Alert
How to Remove Duplicates from a CSV File
Attributeerror: 'Pandasexprvisitor' Object Has No Attribute 'Visit_Ellipsis', Using Pandas Eval
How to Use Valgrind with Python
How to Change the Host and Port That the Flask Command Uses
Fitting a 2D Gaussian Function Using Scipy.Optimize.Curve_Fit - Valueerror and Minpack.Error
Why Should I Close Files in Python
Command Executed with Paramiko Does Not Produce Any Output
Python: My Function Returns "None" After It Does What I Want It To
Catch Exception and Continue Try Block in Python
Why Apply Sometimes Isn't Faster Than For-Loop in a Pandas Dataframe
How to Erase the File Contents of Text File in Python
No Module Named When Using Pyinstaller