Predict Classes or Class Probabilities

Predict classes or class probabilities?

In principle & in theory, hard & soft classification (i.e. returning classes & probabilities respectively) are different approaches, each one with its own merits & downsides. Consider for example the following, from the paper Hard or Soft Classification? Large-margin Unified Machines:

Margin-based classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers while some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits.

That said, in practice, most of the classifiers used today, including Random Forest (the only exception I can think of is the SVM family) are in fact soft classifiers: what they actually produce underneath is a probability-like measure, which subsequently, combined with an implicit threshold (usually 0.5 by default in the binary case), gives a hard class membership like 0/1 or True/False.

What is the right way to get the classified prediction result?

For starters, it is always possible to go from probabilities to hard classes, but the opposite is not true.

Generally speaking, and given the fact that your classifier is in fact a soft one, getting just the end hard classifications (True/False) gives a "black box" flavor to the process, which in principle should be undesirable; handling directly the produced probabilities, and (important!) controlling explicitly the decision threshold should be the preferable way here. According to my experience, these are subtleties that are often lost to new practitioners; consider for example the following, from the Cross Validated thread Reduce Classification probability threshold:

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

Apart from "soft" arguments (pun unintended) like the above, there are cases where you need to handle directly the underlying probabilities and thresholds, i.e. cases where the default threshold of 0.5 in binary classification will lead you astray, most notably when your classes are imbalanced; see my answer in High AUC but bad predictions with imbalanced data (and the links therein) for a concrete example of such a case.

To be honest, I am rather surprised by the behavior of H2O you report (I haven't use it personally), i.e. that the kind of the output is affected by the representation of the input; this should not be the case, and if it is indeed, we may have an issue of bad design. Compare for example the Random Forest classifier in scikit-learn, which includes two different methods, predict and predict_proba, to get the hard classifications and the underlying probabilities respectively (and checking the docs, it is apparent that the output of predict is based on the probability estimates, which have been computed already before).

If probabilities are the outcomes for numerical target values, then how do I handle it in case of a multiclass classification?

There is nothing new here in principle, apart from the fact that a simple threshold is no longer meaningful; again, from the Random Forest predict docs in scikit-learn:

the predicted class is the one with highest mean probability estimate

That is, for 3 classes (0, 1, 2), you get an estimate of [p0, p1, p2] (with elements summing up to one, as per the rules of probability), and the predicted class is the one with the highest probability, e.g. class #1 for the case of [0.12, 0.60, 0.28]. Here is a reproducible example with the 3-class iris dataset (it's for the GBM algorithm and in R, but the rationale is the same).

How to get independent probabilities of all classes for each sample with predict_proba?

Random forest is an ensemble method. Basically it builds individual decision trees with different subsets of the data (called bagging) and averages predictions across all trees to give you the probabilities. The help page is actually a good place to start:

In averaging methods, the driving principle is to build several
estimators independently and then to average their predictions. On
average, the combined estimator is usually better than any of the
single base estimator because its variance is reduced.
Examples: Bagging methods, Forests of randomized trees, …

Hence the probabilities will always sum up to one. Below is an example of how you access individual prediction of each tree:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10)
model.fit(X_train, y_train)

pred = model.predict_proba(X_test)
pred[:5,:]

array([[0. , 1. , 0. ],
       [1. , 0. , 0. ],
       [0. , 0. , 1. ],
       [0. , 0.9, 0.1],
       [0. , 0.9, 0.1]])

This is the prediction for the first tree:

model.estimators_[0].predict(X_test)
Out[42]: 
array([1., 0., 2., 2., 1., 0., 1., 2., 2., 1., 2., 0., 0., 0., 0., 2., 2.,
       1., 1., 2., 0., 2., 0., 2., 2., 2., 2., 2., 0., 0., 0., 0., 1., 0.,
       0., 2., 1., 0., 0., 0., 2., 2., 1., 0., 0., 1., 1., 2., 1., 2.])

We tally across all trees:

result = np.zeros((len(X_test),3))
for i in range(len(model.estimators_)):
    p = model.estimators_[i].predict(X_test).astype(int)
    result[range(len(X_test)),p] += 1

result[:5,:]
Out[63]: 
array([[ 0., 10.,  0.],
       [10.,  0.,  0.],
       [ 0.,  0., 10.],
       [ 0.,  9.,  1.],
       [ 0.,  9.,  1.]])

Dividing this by the number of trees gives the probability you obtained before:

result/10
Out[65]: 
array([[0. , 1. , 0. ],
       [1. , 0. , 0. ],
       [0. , 0. , 1. ],
       [0. , 0.9, 0.1],
       [0. , 0.9, 0.1],

How to get probabilities along with classification in LogisticRegression?

For most models in scikit-learn, we can get the probability estimates for the classes through predict_proba. Bear in mind that this is the actual output of the logistic function, the resulting classification is obtained by selecting the output with highest probability, i.e. an argmax is applied on the output. If we see the implementation here, you can see that it is essentially doing:

def predict(self, X):
    # decision func on input array
    scores = self.decision_function(X)
    # column indices of max values per row
    indices = scores.argmax(axis=1)
    # index class array using indices
    return self.classes_[indices]

In the case of calling predict_proba rather than predict, scores is returned. Here's an example use case training a LogisticRegression:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

lr= LogisticRegression()
lr.fit(X_train, y_train)
y_pred_prob = lr.predict_proba(X_test)

y_pred_prob
array([[1.06906558e-02, 9.02308167e-01, 8.70011771e-02],
       [2.57953117e-06, 7.88832490e-03, 9.92109096e-01],
       [2.66690975e-05, 6.73454730e-02, 9.32627858e-01],
       [9.88612145e-01, 1.13878133e-02, 4.12714660e-08],
       ...

And we can obtain the probabilities by taking the argmax, as mentioned, and index the array of classes as:

classes = load_iris().target_names
classes[indices]
array(['virginica', 'virginica', 'versicolor', 'virginica', 'setosa',
       'versicolor', 'versicolor', 'setosa', 'virginica', 'setosa',...

So for a single prediction, through the predicted probabilities we could easily do something like:

y_pred_prob = lr.predict_proba(X_test[0,None])
ix = y_pred_prob.argmax(1).item()

print(f'predicted class = {classes[ix]} and confidence = {y_pred_prob[0,ix]:.2%}')
# predicted class = virginica and confidence = 90.75%

How to get the 'predict_proba' for the class predicted by 'predict' in Random Forest Classifier?

The predict_proba() method returns a two-dimensional array, containing the estimated probabilities for each instance and each class:

import numpy as np
from sklearn.ensemble import RandomForestClassifier

X = np.array([[1, 2, 3], 
              [4, 5, 6], 
              [7, 8, 9], 
              [10, 11, 12]])
y = np.array([0, 0, 1, 1])

model = RandomForestClassifier()
model.fit(X, y)

model.predict_proba(X)

array([[0.91, 0.09],
       [0.91, 0.09],
       [0.25, 0.75],
       [0.05, 0.95]])

As you note, for each instance the predicted class is the class with the maximum probability. So one simple way to get the estimated probabilities for the predicted classes is to use np.max():

np.max(model.predict_proba(X), axis=1)

array([0.91, 0.91, 0.75, 0.95])

Predicted class probabilities for both classes in R

Since this is a binary classification, the Prob(C2) = 1 - Prob(C1)

How to get the probability and label for each class?

We can zip the labelencoder.classes_ and confidence_score and pass the zip object to dict in order to create a dictionary

dict(zip(labelencoder.classes_, confidence_score.squeeze()))

In case you want to predict multiple samples in one go

[dict(zip(labelencoder.classes_, cs)) for cs in confidence_score]

Predict Classes or Class Probabilities