Implementing Bayesian Classifier in Ruby

Implementing Bayesian classifier in Ruby?

Ilya Grigorik has a nice answer to this problem over on this blog post on Bayesian Classifiers

Additionally, you may wish to take a look at the ai4r rubygem for some alternates to Bayesian Classifiers.

ID3 is a good choice because it gives a decision tree that is "understandable" to even someone without any real understanding of machine learning techniques.

What does a Bayesian Classifier score represent?

It's the logarithm of a probability. With a large trained set, the actual probabilities are very small numbers, so the logarithms are easier to compare. Theoretically, scores will range from infinitesimally close to zero down to negative infinity. 10**score * 100.0 will give you the actual probability, which indeed has a maximum difference of 100.

What does a Bayesian Classifier score represent?

Implementation details of a Bayesian classifier

Usually the way you handle this is by taking logs and using adds, and then doing an exp if you want to get back into probability space.

p1 * p2 * p3 * ... * pn = exp(log(p1) + log(p2) + log(p3) + ... log(pn))

You avoid under flows by working in log space.

Training Naive Bayes Classifier on ngrams

If you're ok with python, I'd say nltk would be perfect for you.

For example:

>>> import nltk
>>> s = "This is some sample data.  Nltk will use the words in this string to make ngrams.  I hope that this is useful.".split()
>>> model = nltk.NgramModel(2, s)
>>> model._ngrams
set([('to', 'make'), ('sample', 'data.'), ('the', 'words'), ('will', 'use'), ('some', 'sample'), ('', 'This'), ('use', 'the'), ('make', 'ngrams.'), ('ngrams.', 'I'), ('hope', 'that'
), ('is', 'some'), ('is', 'useful.'), ('I', 'hope'), ('this', 'string'), ('Nltk', 'will'), ('words', 'in'), ('this', 'is'), ('data.', 'Nltk'), ('that', 'this'), ('string', 'to'), ('
in', 'this'), ('This', 'is')])

You even have a method nltk.NaiveBayesClassifier

What would be a good language to implement a naive bayes classifier from scratch?

I would do it in C#, but that's only because it's the language that I'm most familiar with at the moment, and because I know it's got strong string handling. It can also be done in C++ with stl::string classes, Ruby, Java, etc.

If I were building a naive bayes classifier, I'd start with a simple example, like the one in Russell & Norvig's book (the one I learned off of way back when, in the second edition of the book) or the one in Mitchell's book (I used his because he taught the class). Make your learner generate rules in a general fashion; that is, given input data, produce output rules, and have the input data be a generalizable thing (could be a block of text that for spam detection, could be a weather report to predict if someone's going to play tennis).

If you're trying to learn Bayes classifiers, a simple example like this is better to start with than a full-blown spam filter. Language parsing is hard in and of itself, and then determining whether or not there's garbage language is also difficult. Better to have a simple, small dataset, one where you can derive how your learner should learn and make sure that your program matches what you want it to do. Then, you can grow your dataset, or modify your program to incorporate things like language parsing.

Naive Bayesian for Topic detection using Bag of Words approach

Existing Implementations of Naive Bayes

You would probably be better off just using one of the existing packages that supports document classification using naive Bayes, e.g.:

Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.

Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.

Perl - Perl has the Algorithm::NaiveBayes module, complete with a sample usage snippet in the package synopsis.

C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.

Java - Java folks have Classifier4J. You can see a training and scoring code snippet here.

Bootstrapping Classification from Keywords

It sounds like you want to start with a set of keywords that are known to cue for certain topics and then use those keywords to bootstrap a classifier.

This is a reasonably clever idea. Take a look at the paper Text Classication by Bootstrapping with Keywords, EM and Shrinkage by McCallum and Nigam (1999). By following this approach, they were able to improve classification accuracy from the 45% they got by using hard-coded keywords alone to 66% using a bootstrapped Naive Bayes classifier. For their data, the latter is close to human levels of agreement, as people agreed with each other about document labels 72% of the time.

Implementing Bayesian Classifier in Ruby