How Does Sklearn.Svm.Svc's Function Predict_Proba() Work Internally

How does sklearn.svm.svc's function predict_proba() work internally?

Scikit-learn uses LibSVM internally, and this in turn uses Platt scaling, as detailed in this note by the LibSVM authors, to calibrate the SVM to produce probabilities in addition to class predictions.

Platt scaling requires first training the SVM as usual, then optimizing parameter vectors A and B such that

P(y|X) = 1 / (1 + exp(A * f(X) + B))

where f(X) is the signed distance of a sample from the hyperplane (scikit-learn's decision_function method). You may recognize the logistic sigmoid in this definition, the same function that logistic regression and neural nets use for turning decision functions into probability estimates.

Mind you: the B parameter, the "intercept" or "bias" or whatever you like to call it, can cause predictions based on probability estimates from this model to be inconsistent with the ones you get from the SVM decision function f. E.g. suppose that f(X) = 10, then the prediction for X is positive; but if B = -9.9 and A = 1, then P(y|X) = .475. I'm pulling these numbers out of thin air, but you've noticed that this can occur in practice.

Effectively, Platt scaling trains a probability model on top of the SVM's outputs under a cross-entropy loss function. To prevent this model from overfitting, it uses an internal five-fold cross validation, meaning that training SVMs with probability=True can be quite a lot more expensive than a vanilla, non-probabilistic SVM.

why the predict_proba function of sklearn.svm.svc is giving probability greater than 1?

Please notice the e-06 or e-08 in those probabilities. Thats equivalent to 10^(-08) in scientific notation. So the above 1 probability you are thinking is very very less.

For example:

2.798594e-06 = 0.000002798594

Similarly,

7.7173288137e-08 = 0.000000077173288137

So when you sum those values you will get 1. If not 1, then it will be something like 0.99999999... . Thats expected due to rounding off of the results displayed.

So the predict_proba results are not inconsistent. They are actually correct.

Now as for why the predicted result does not match with the highest predicted probability, thats described in the documentation and is expected behaviour due to internals of the algorithm. Please look at the documentation:-

http://scikit-learn.org/dev/modules/svm.html#scores-and-probabilities

The probability estimates may be inconsistent with the scores, in the
sense that the “argmax” of the scores may not be the argmax of the
probabilities. (E.g., in binary classification, a sample may be
labeled by predict as belonging to a class that has probability <½
according to predict_proba.)

How does sklearn's MLP predict_proba function work internally?

Looking within the source code, i found:

def _initialize(self, y, layer_units):

    # set all attributes, allocate weights etc for first call
    # Initialize parameters
    self.n_iter_ = 0
    self.t_ = 0
    self.n_outputs_ = y.shape[1]

    # Compute the number of layers
    self.n_layers_ = len(layer_units)

    # Output for regression
    if not is_classifier(self):
        self.out_activation_ = 'identity'
    # Output for multi class
    elif self._label_binarizer.y_type_ == 'multiclass':
        self.out_activation_ = 'softmax'
    # Output for binary class and multi-label
    else:
        self.out_activation_ = 'logistic'

It seems that MLP Classifier uses a logistic function for binary classification and a softmax function for multi-label classification in order to build the output layer. This suggests that the output of the net is a probability vector, based on which the net deduces predictions.

If I look to the predict_proba method:

def predict_proba(self, X):
    """Probability estimates.
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input data.
    Returns
    -------
    y_prob : ndarray of shape (n_samples, n_classes)
        The predicted probability of the sample for each class in the
        model, where classes are ordered as they are in `self.classes_`.
    """
    check_is_fitted(self)
    y_pred = self._predict(X)

    if self.n_outputs_ == 1:
        y_pred = y_pred.ravel()

    if y_pred.ndim == 1:
        return np.vstack([1 - y_pred, y_pred]).T
    else:
        return y_pred

That confirms the action of a softmax or a logistic as activation function for the output layer in order to have a probability vector.

Hoping this can help you.

Confusing probabilities of the predict_proba of scikit-learn's svm

As the documentation states, there is no guarantee that predict_proba and predict will give consistent results on SVC.
You can simply use decision_function. That is true for both linear and kernel SVM.

How can i know probability of class predicted by predict() function in Support Vector Machine?

Use clf.predict_proba([fv]) to obtain a list with predicted probabilities per class. However, this function is not available for all classifiers.

Regarding your comment, consider the following:

>> prob = [ 0.01357713, 0.00662571, 0.00782155, 0.3841413, 0.07487401, 0.09861277, 0.00644468, 0.40790285]
>> sum(prob)
1.0

The probabilities sum to 1.0, so multiply by 100 to get percentage.

How does the predict_proba() function in LightGBM work internally?

LightGBM, like all gradient boosting methods for classification, essentially combines decision trees and logistic regression. We start with the same logistic function representing the probabilities (a.k.a. softmax):

P(y = 1 | X) = 1/(1 + exp(Xw))

The interesting twist is that the feature matrix X is composed from the terminal nodes from a decision tree ensemble. These are all then weighted by w, a parameter that must be learned. The mechanism used to learn the weights depends on the precise learning algorithm used. Similarly, the construction of X also depends on the algorithm. LightGBM, for example, introduced two novel features which won them the performance improvements over XGBoost: "Gradient-based One-Side Sampling" and "Exclusive Feature Bundling". Generally though, each row collects the terminal leafs for each sample and the columns represent the terminal leafs.

So here is what the docs could say...

Probability estimates.
The predicted class probabilities of an input sample are computed as the
softmax of the weighted terminal leaves from the decision tree ensemble corresponding to the provided sample.

For further details, you'd have to delve into the details of boosting, XGBoost, and finally the LightGBM paper, but that seems a bit heavy handed given the other documentation examples you've given.

Converting LinearSVC's decision function to probabilities (Scikit learn python )

I took a look at the apis in sklearn.svm.* family. All below models, e.g.,

sklearn.svm.SVC
sklearn.svm.NuSVC
sklearn.svm.SVR
sklearn.svm.NuSVR

have a common interface that supplies a

probability: boolean, optional (default=False)

parameter to the model. If this parameter is set to True, libsvm will train a probability transformation model on top of the SVM's outputs based on idea of Platt Scaling. The form of transformation is similar to a logistic function as you pointed out, however two specific constants A and B are learned in a post-processing step. Also see this stackoverflow post for more details.

Sample Image

I actually don't know why this post-processing is not available for LinearSVC. Otherwise, you would just call predict_proba(X) to get the probability estimate.

Of course, if you just apply a naive logistic transform, it will not perform as well as a calibrated approach like Platt Scaling. If you can understand the underline algorithm of platt scaling, probably you can write your own or contribute to the scikit-learn svm family. :) Also feel free to use the above four SVM variations that support predict_proba.

Scikit-learn predict_proba gives wrong answers

if you use svm.LinearSVC() as estimator, and .decision_function() (which is like svm.SVC's .predict_proba()) for sorting the results from most probable class to the least probable one. this agrees with .predict() function. Plus, this estimator is faster and gives almost the same results with svm.SVC()

the only drawback for you might be that .decision_function() gives a signed value sth like between -1 and 3 instead of a probability value. but it agrees with the prediction.

How Does Sklearn.Svm.Svc's Function Predict_Proba() Work Internally