Nltk Naivebayesclassifier Training for Sentiment Analysis

nltk NaiveBayesClassifier training for sentiment analysis

You need to change your data structure. Here is your train list as it currently stands:

>>> train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

The problem is, though, that the first element of each tuple should be a dictionary of features. So I will change your list into a data structure that the classifier can work with:

>>> from nltk.tokenize import word_tokenize # or use some other tokenizer
>>> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
>>> t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]

Your data should now be structured like this:

>>> t
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), . . .]

Note that the first element of each tuple is now a dictionary. Now that your data is in place and the first element of each tuple is a dictionary, you can train the classifier like so:

>>> import nltk
>>> classifier = nltk.NaiveBayesClassifier.train(t)
>>> classifier.show_most_informative_features()
Most Informative Features
                    this = True              neg : pos    =      2.3 : 1.0
                    this = False             pos : neg    =      1.8 : 1.0
                      an = False             neg : pos    =      1.6 : 1.0
                       . = True              pos : neg    =      1.4 : 1.0
                       . = False             neg : pos    =      1.4 : 1.0
                 awesome = False             neg : pos    =      1.2 : 1.0
                      of = False             pos : neg    =      1.2 : 1.0
                    feel = False             neg : pos    =      1.2 : 1.0
                   place = False             neg : pos    =      1.2 : 1.0
                horrible = False             pos : neg    =      1.2 : 1.0

If you want to use the classifier, you can do it like this. First, you begin with a test sentence:

>>> test_sentence = "This is the best band I've ever heard!"

Then, you tokenize the sentence and figure out which words the sentence shares with all_words. These constitute the sentence's features.

>>> test_sent_features = {word: (word in word_tokenize(test_sentence.lower())) for word in all_words}

Your features will now look like this:

>>> test_sent_features
{'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False}

Then you simply classify those features:

>>> classifier.classify(test_sent_features)
'pos' # note 'best' == True in the sentence features above

This test sentence appears to be positive.

How to predict Sentiments after training and testing the model by using NLTK NaiveBayesClassifier in Python?

Documentation and example

The line that gives you the error calls the method SentimentAnalyzer.evaluate(...) .
This method does the following.

Evaluate and print classifier performance on the test set.

See SentimentAnalyzer.evaluate.

The method has one mandatory parameter: test_set .

test_set – A list of (tokens, label) tuples to use as gold set.

In the example at http://www.nltk.org/howto/sentiment.html test_set has the following structure:

[({'contains(,)': False, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ({'contains(,)': True, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ...]

Here is a symbolic representation of the structure.

[(dictionary,label), ... , (dictionary,label)]

Error in your code

You are passing

list(zip(new_data['Articles']))

to SentimentAnalyzer.evaluate. I assume your getting the error because

list(zip(new_data['Articles']))

does not create a list of (tokens, label) tuples. You can check that by creating a variable which contains the list and printing it or looking at the value of the variable while debugging.
E.G.

test_set = list(zip(new_data['Articles']))
print("begin test_set")
print(test_set)
print("end test_set")

You are calling evaluate correctly 3 lines above the one that is giving the error.

score = analyzer.evaluate(list(zip(_test_X, test_y)))

I guess you want to call SentimentAnalyzer.classify(instance) to predict unlabeled data. See SentimentAnalyzer.classify.

NLTK Naive Bayes Classifier Training issues

There is a typo in your code:

feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents]

This causes sentiment to have the same value all the time (namely the value of the last tweet from your preprocessing step) so training is pointless and all features are irrelevant.

Fix it and you will get:

('Naive Bayes Accuracy:', 66.75)
Most Informative Features
                  -- = True           positi : negati =      6.9 : 1.0
               these = True           positi : negati =      5.6 : 1.0
                face = True           positi : negati =      5.6 : 1.0
                 saw = True           positi : negati =      5.6 : 1.0
                   ] = True           positi : negati =      4.4 : 1.0
               later = True           positi : negati =      4.4 : 1.0
                love = True           positi : negati =      4.1 : 1.0
                  ta = True           positi : negati =      4.0 : 1.0
               quite = True           positi : negati =      4.0 : 1.0
              trying = True           positi : negati =      4.0 : 1.0
               small = True           positi : negati =      4.0 : 1.0
                 thx = True           positi : negati =      4.0 : 1.0
               music = True           positi : negati =      4.0 : 1.0
                   p = True           positi : negati =      4.0 : 1.0
             husband = True           positi : negati =      4.0 : 1.0

NLTK NaiveBayesClassifier classifier issues

Training a sentiment model means that your model learns how words affect the sentiment. Thus it's not about specifying which words are positive and which are negative — it's about how to train your model to understand that from a text by itself.

The simplest implementation is called "bag of words" (which is usually used with TF-IDF normalization). Bag of words works this way: you split your text by words and count occurrences of each word within the given text block (or review). In this way rows correspond to different reviews, and columns correspond to the number of occurrences of the given word within the given review. This table becomes your X and the target sentiment to predict becomes your Y (say 0 for negative and 1 for positive) .

Then you train your classifier:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

reviews, Y = your_load_function()

vectorizer = TfidfVectorizer()  # or CountVectorizer()
X = vectorizer.fit_transform(reviews)  # convert text to words counts

model = MultinomialNB()
model.fit(X, Y)

After the model is trained you can make predictions:

new_reviews = your_load_function2()
new_X = vectorizer.transform(new_reviews)
predicted_Y = model.predict(new_X)

Further reading:

https://en.wikipedia.org/wiki/Bag-of-words_model

https://en.wikipedia.org/wiki/Tf-idf

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Naive Bayes Classifier and training data

When doing machine learning, we want to learn an algorithms that performs well on new (unseen) data. This is called generalization.

The purpose of the test set is, amongst others, to verify the generalization behavior of your classifier. If your model predicts the same labels for each test instance, than we cannot confirm that hypothesis. The test set should be representative of the conditions in which you apply it later.

As a rule of thumb, I like to think that you keep 50-25% of their data as a test set. This of course depends on the situation. 30/4000 is less than one percent.

A second point that comes to mind is that when your classifier is biased towards one class, make sure each class is represented nearly equally in the training and validation set. This prevents the classifier from 'just' learning the distribution of the whole set, instead of learning which features are relevant.

As a final note, normally we report metrics such as precision, recall and F_β=1 to evaluate our classifier. The code in your sample seems to report something based on the global sentiment in all tweets, are you sure that is what you want? Are the tweets a representative collection?

Why did NLTK NaiveBayes classifier misclassify one record?

Here is the modified code for you

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords

positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]

def word_feats(words):
    return dict([(word, True) for word in words])

positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]

train_set = negative_features_1 + positive_features_1 + neutral_features_1

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.')   # these are actually list of sentences

for sent in sentences:
    if sent != "":
        words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
        classResult = classifier.classify(word_feats(words))
        if classResult == 'neg':
            neg = neg + 1
        if classResult == 'pos':
            pos = pos + 1
        print(str(sent) + ' --> ' + str(classResult))
        print

I modified where you are considering 'list of words' as an input to your classifier. But Actually you need to pass sentence one by one, which means you need to pass 'list of sentences'

Also, for each sentence, you need to pass 'words as features', which means you need to split the sentence on white-space character.

Also, if you want your classifier to work properly for sentiment analysis, you need to give less preference to "stop-words" like "it, they, is etc". As these words are not sufficient to decide if the sentence is positive, negative or neutral.

The above code gives below output

awesome movie --> pos

 i like it --> pos

 it is so bad --> neg

So for any classifier, the input format for training classifier and predicting classifier should be same. While training you are providing list of words, try to use the same method to convert your test set as well.

Sentiment Analysis, Naive Bayes Accuracy

Well, as the error message says, the classifier you are trying to use (NaiveBayesClassifier) doesn't have the method classify_many that the nltk.classify.util.accuracy function requires.

(Reference: https://www.nltk.org/_modules/nltk/classify/naivebayes.html)

Now, that looks like an NLTK bug, but you can get your answer easily on your own:

from sklearn.metrics import accuracy_score

y_predicted = [classifier.classify(x) for x in proc_set]

accuracy = accuracy_score(y_true, y_predicted)

Where y_true are the sentiment values corresponding to proc_set inputs (which I don't see you actually creating in your code shown above, though).

Hope that helps.

EDIT:

Or, without using the sklearn accuracy function, but pure Python:

hits = [yp == yt for yp, yt in zip(y_predicted, y_true)]

accuracy = sum(hits)/len(hits)

Nltk Naivebayesclassifier Training for Sentiment Analysis