Use Scikit-Learn to Classify into Multiple Categories

Use scikit-learn to classify into multiple categories

What you want is called multi-label classification. Scikits-learn can do that. See here: http://scikit-learn.org/dev/modules/multiclass.html.

I'm not sure what's going wrong in your example, my version of sklearn apparently doesn't have WordNGramAnalyzer. Perhaps it's a question of using more training examples or trying a different classifier? Though note that the multi-label classifier expects the target to be a list of tuples/lists of labels.

The following works for me:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]]
X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'hello welcome to new york. enjoy it here and london too'])   
target_names = ['New York', 'London']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_n=1,max_n=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

For me, this produces the output:

nice day in nyc => New York
welcome to london => London
hello welcome to new york. enjoy it here and london too => New York, London

Classify text into multiple categories from scikit learn

If you use SVM this question at cross validated may get you started. The idea is to interpret the classification weights, but that is not trivial.

Personally, I prefer to use a RandomForestClassifier, which has feature ranking built in. It's exposed by the feature_importances_ attribute. There is even an example at the scikit-learn documentation.

Using scikit-learn classifier inside nltk, multiclass case

The NLTK wrapper for scikit-learn doesn't know about multilabel classification, and it shouldn't because it doesn't implement MultiClassifierI. Implementing that would require a separate class.

You can either implement the missing functionality, or use scikit-learn without the wrapper. Newer versions of scikit-learn have a DictVectorizer that accepts roughly the same inputs that the NLTK wrapper accepts:

from sklearn.feature_extraction import DictVectorizer

X_train_raw = [{'a': 1}, {'b': 1}, {'c': 1}]
y_train = [('first',), ('second',), ('first', 'second')]

v = DictVectorizer()
X_train = v.fit_transform(X_train_raw)

clf = OneVsRestClassifier(MultinomialNB())
clf.fit(X_train, y_train)

You can then use X_test = v.transform(X_test_raw) to transform test samples to matrices. A sklearn.pipeline.Pipeline makes this easier by tying a vectorizer and a classifier together in a single object.

Disclaimer: according to the FAQ, I should disclose my affiliation. I wrote both DictVectorizer and the NLTK wrapper for scikit-learn.

Can OneVsRestClassifier be used to produce individual binary classifier models in Python Scikit-Learn?

Once you have trained your OneVsRestClassifier model, all the binary classifiers are saved in the estimators_ attribute. This is how you can use them using a quick example:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split

iris = load_iris() #iris has 3 classes, just like your example
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)

RFC = RandomForestClassifier(100, random_state = 42)
OVRC = OneVsRestClassifier(RFC)

OVRC.fit(X_train, y_train)

Your three classifiers can be accessed via:

OVRC.estimators_[0] # label 0 vs the rest
OVRC.estimators_[1] # label 1 vs the rest
OVRC.estimators_[2] # label 2 vs the rest

Their individual predictions can be get as following:

print(OVRC.estimators_[0].predict_proba(X_test[0:5]))
print(OVRC.estimators_[1].predict_proba(X_test[0:5]))
print(OVRC.estimators_[2].predict_proba(X_test[0:5]))

>>> [[1.   0.  ]
     [0.03 0.97] # vote for label 0
     [1.   0.  ]
     [1.   0.  ]
     [1.   0.  ]]
    [[0.02 0.98] # vote for label 1
     [0.97 0.03]
     [0.97 0.03]
     [0.   1.  ] # vote for label 1
     [0.19 0.81]] # vote for label 1
    [[0.99 0.01] 
     [1.   0.  ]
     [0.   1.  ] # vote for label 2
     [0.99 0.01]
     [0.85 0.15]]

This is consistant with the overall prediction, which is:

print(OVRC.predict_proba(X_test[0:5]))

>>> [[0.         0.98989899 0.01010101]
     [0.97       0.03       0.        ]
     [0.         0.02912621 0.97087379]
     [0.         0.99009901 0.00990099]
     [0.         0.84375    0.15625   ]]

Why do predictions and scores return different results in classification using scikit-learn?

For multilabel classification you should use

y_pred_ = np.where(classifier.decision_function(X_test) > 0, 1, 0)

to replicate the output of the predict() method as in this case the different classes are not mutually exclusive, i.e. a given sample can belong to multiple classes.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report

# Load the data
iris = load_iris()
X = iris.data
y = label_binarize(iris.target, classes=[0, 1, 2])

# Split the data into training and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0
)

# Create classifier
classifier = OneVsRestClassifier(
    make_pipeline(StandardScaler(), LinearSVC(random_state=0))
)

# Train the model
classifier.fit(X_train, y_train)

# Make predictions
y_pred  = classifier.predict(X_test)
y_pred_ = np.where(classifier.decision_function(X_test) > 0, 1, 0)

print(classification_report(y_test, y_pred))
#               precision    recall  f1-score   support
#            0       1.00      1.00      1.00        21
#            1       0.58      0.37      0.45        30
#            2       0.95      0.83      0.89        24
#    micro avg       0.85      0.69      0.76        75
#    macro avg       0.84      0.73      0.78        75
# weighted avg       0.82      0.69      0.74        75
#  samples avg       0.66      0.69      0.67        75

print(classification_report(y_test, y_pred_))
#               precision    recall  f1-score   support
#            0       1.00      1.00      1.00        21
#            1       0.58      0.37      0.45        30
#            2       0.95      0.83      0.89        24
#    micro avg       0.85      0.69      0.76        75
#    macro avg       0.84      0.73      0.78        75
# weighted avg       0.82      0.69      0.74        75
#  samples avg       0.66      0.69      0.67        75

For multiclass classification you can instead use

y_pred_ = np.argmax(classifier.decision_function(X_test), axis=1)

as in your code, as in this case the different classes are mutually exclusive, i.e. each sample is assigned to only one class.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report

# Load the data
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0
)

# Create classifier
classifier = OneVsRestClassifier(
    make_pipeline(StandardScaler(), LinearSVC(random_state=0))
)

# Train the model
classifier.fit(X_train, y_train)

# Make predictions
y_pred  = classifier.predict(X_test)
y_pred_ = np.argmax(classifier.decision_function(X_test), axis=1)

print(classification_report(y_test, y_pred))
#               precision    recall  f1-score   support
#            0       1.00      1.00      1.00        21
#            1       0.85      0.73      0.79        30
#            2       0.71      0.83      0.77        24
#     accuracy                           0.84        75
#    macro avg       0.85      0.86      0.85        75
# weighted avg       0.85      0.84      0.84        75

print(classification_report(y_test, y_pred_))
#               precision    recall  f1-score   support
#            0       1.00      1.00      1.00        21
#            1       0.85      0.73      0.79        30
#            2       0.71      0.83      0.77        24
#     accuracy                           0.84        75
#    macro avg       0.85      0.86      0.85        75
# weighted avg       0.85      0.84      0.84        75

Use Scikit-Learn to Classify into Multiple Categories