Comparison of R, Statmodels, Sklearn for a Classification Task with Logistic Regression

Comparison of R, statmodels, sklearn for a classification task with logistic regression

I ran into a similar issue and ended up posting about it on /r/MachineLearning. It turns out the difference can be attributed to data standardization. Whatever approach scikit-learn is using to find the parameters of the model will yield better results if the data is standardized. scikit-learn has some documentation discussing preprocessing data (including standardization), which can be found here.

Results

Number of 'default' values : 333
Intercept: [-6.12556565]
Coefficients: [[ 2.73145133  0.27750788]]

Confusion matrix
[[9629   38]
 [ 225  108]]

Score          0.9737
Precision      0.7397
Recall         0.3243

Code

# scikit-learn vs. R
# http://stackoverflow.com/questions/28747019/comparison-of-r-statmodels-sklearn-for-a-classification-task-with-logistic-reg

import pandas as pd
import sklearn

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing

# Data is available here.
Default = pd.read_csv('https://d1pqsl2386xqi9.cloudfront.net/notebooks/Default.csv', index_col = 0)

Default['default'] = Default['default'].map({'No':0, 'Yes':1})
Default['student'] = Default['student'].map({'No':0, 'Yes':1})

I = Default['default'] == 0
print("Number of 'default' values : {0}".format(Default[~I]['balance'].count()))

feats = ['balance', 'income']

Default[feats] = preprocessing.scale(Default[feats])

# C = 1e6 ~ no regularization.
classifier = LogisticRegression(C = 1e6, random_state = 42) 

classifier.fit(Default[feats], Default['default'])  #fit classifier on whole base
print("Intercept: {0}".format(classifier.intercept_))
print("Coefficients: {0}".format(classifier.coef_))

y_true = Default['default']
y_pred_cls = classifier.predict_proba(Default[feats])[:,1] > 0.5

confusion = confusion_matrix(y_true, y_pred_cls)
score = float((confusion[0, 0] + confusion[1, 1])) / float((confusion[0, 0] + confusion[1, 1] + confusion[0, 1] + confusion[1, 0]))
precision = float((confusion[1, 1])) / float((confusion[1, 1] + confusion[0, 1]))
recall = float((confusion[1, 1])) / float((confusion[1, 1] + confusion[1, 0]))
print("\nConfusion matrix")
print(confusion)
print('\n{s:{c}<{n}}{num:2.4}'.format(s = 'Score', n = 15, c = '', num = score))
print('{s:{c}<{n}}{num:2.4}'.format(s = 'Precision', n = 15, c = '', num = precision))
print('{s:{c}<{n}}{num:2.4}'.format(s = 'Recall', n = 15, c = '', num = recall))

Comparison of R and scikit-learn for a classification task with logistic regression

After some more reading I understand that scikit-learn implements a regularized logistic regression model, whereas glm in R is not regularized. Statsmodels' GLM implementation (python) is unregularized and gives identical results as in R.

http://statsmodels.sourceforge.net/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html#statsmodels.genmod.generalized_linear_model.GLM

The R package LiblineaR is similar to scikit-learn's logistic regression (when using 'liblinear' solver).

https://cran.r-project.org/web/packages/LiblineaR/

Coefficients for Logistic Regression scikit-learn vs statsmodels

There are some issues with your code.

To start with, the two models you show here are not equivalent: although you fit your scikit-learn LogisticRegression with fit_intercept=True (which is the default setting), you don't do so with your statsmodels one; from the statsmodels docs:

An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.

It seems that this is a frequent point of confusion - see for example scikit-learn & statsmodels - which R-squared is correct? (and own answer there as well).

The other issue is that, although you are in a binary classification setting, you ask for multi_class='multinomial' in your LogisticRegression, which should not be the case.

The third issue is that, as explained in the relevant Cross Validated thread Logistic Regression: Scikit Learn vs Statsmodels:

There is no way to switch off regularization in scikit-learn, but you can make it ineffective by setting the tuning parameter C to a large number.

which makes the two models again non-comparable in principle, but you have successfully addressed it here by setting C=1e8. In fact, since then (2016), scikit-learn has indeed added a way to switch regularization off, by setting penalty='none' since, according to the docs:

If ‘none’ (not supported by the liblinear solver), no regularization is applied.

which should now be considered the canonical way to switch off the regularization.

So, incorporating these changes in your code, we have:

np.random.seed(42) # for reproducibility

#### Statsmodels
# first artificially add intercept to x, as advised in the docs:
x_ = sm.add_constant(x)
res_sm = sm.Logit(y, x_).fit(method="ncg", maxiter=max_iter) # x_ here
print(res_sm.params)

Which gives the result:

Optimization terminated successfully.
         Current function value: 0.403297
         Iterations: 5
         Function evaluations: 6
         Gradient evaluations: 10
         Hessian evaluations: 5
[-1.65822763  3.65065752]

with the first element of the array being the intercept and the second the coefficient of x. While for scikit learn we have:

#### Scikit-Learn

res_sk = LogisticRegression(solver='newton-cg', max_iter=max_iter, fit_intercept=True, penalty='none')
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.intercept_, res_sk.coef_)

with the result being:

[-1.65822806] [[3.65065707]]

These results are practically identical, within the machine's numeric precision.

Repeating the procedure for different values of np.random.seed() does not change the essence of the results shown above.

scikit-learn & statsmodels - which R-squared is correct?

Arguably, the real challenge in such cases is to be sure that you compare apples to apples. And in your case, it seems that you don't. Our best friend is always the relevant documentation, combined with simple experiments. So...

Although scikit-learn's LinearRegression() (i.e. your 1st R-squared) is fitted by default with fit_intercept=True (docs), this is not the case with statsmodels' OLS (your 2nd R-squared); quoting from the docs:

An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.

Keeping this important detail in mind, let's run some simple experiments with dummy data:

import numpy as np
import statsmodels.api as sm
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

# dummy data:
y = np.array([1,3,4,5,2,3,4])
X = np.array(range(1,8)).reshape(-1,1) # reshape to column

# scikit-learn:
lr = LinearRegression()
lr.fit(X,y)
# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
#     normalize=False)

lr.score(X,y)
# 0.16118421052631582

y_pred=lr.predict(X)
r2_score(y, y_pred)
# 0.16118421052631582

# statsmodels
# first artificially add intercept to X, as advised in the docs:
X_ = sm.add_constant(X)

model = sm.OLS(y,X_) # X_ here
results = model.fit()
results.rsquared
# 0.16118421052631593

For all practical purposes, these two values of R-squared produced by scikit-learn and statsmodels are identical.

Let's go a step further, and try a scikit-learn model without intercept, but where we use the artificially "intercepted" data X_ we have already built for use with statsmodels:

lr2 = LinearRegression(fit_intercept=False)
lr2.fit(X_,y) # X_ here
# LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None,
#         normalize=False)

lr2.score(X_, y)
# 0.16118421052631593

y_pred2 = lr2.predict(X_)
r2_score(y, y_pred2)
# 0.16118421052631593

Again, the R-squared is identical with the previous values.

So, what happens when we "accidentally" forget to account for the fact that statsmodels OLS is fitted without an intercept? Let's see:

model3 = sm.OLS(y,X) # X here, i.e. no intercept
results3 = model2.fit()
results3.rsquared
# 0.8058035714285714

Well, an R-squared of 0.80 is indeed very far from the one of 0.16 returned by a model with an intercept, and arguably this is exactly what has happened in your case.

So far so good, and I could easily finish the answer here; but there is indeed a point where this harmonious world breaks down: let's see what happens when we fit both models without intercept and with the initial data X where we have not artificially added any interception. We have already fitted the OLS model above, and got an R-squared of 0.80; what about a similar model from scikit-learn?

# scikit-learn
lr3 = LinearRegression(fit_intercept=False)
lr3.fit(X,y) # X here
lr3.score(X,y)
# -0.4309210526315792

y_pred3 = lr3.predict(X)
r2_score(y, y_pred3)
# -0.4309210526315792

Ooops...! What the heck??

It seems that scikit-earn, when computes the r2_score, always assumes an intercept, either explicitly in the model (fit_intercept=True) or implicitly in the data (the way we have produced X_ from X above, using statsmodels' add_constant); digging a little online reveals a Github thread (closed without a remedy) where it is confirmed that the situation is indeed like that.

[UPDATE Dec 2021: for a more detailed & in-depth investigation and explanation of why the two scores are different in this particular case (i.e. both models fitted without an intercept), see this great answer by Flavia]

Let me clarify that the discrepancy I have described above has nothing to do with your issue: in your case, the real issue is that you are actually comparing apples (a model with intercept) with oranges (a model without intercept).

So, why scikit-learn not only fails in such an (admittedly edge) case, but even when the fact emerges in a Github issue it is actually treated with indifference? (Notice also that the scikit-learn core developer who replies in the above thread casually admits that "I'm not super familiar with stats"...).

The answer goes a little beyond coding issues, such as the ones SO is mainly about, but it may be worth elaborating a little here.

Arguably, the reason is that the whole R-squared concept comes in fact directly from the world of statistics, where the emphasis is on interpretative models, and it has little use in machine learning contexts, where the emphasis is clearly on predictive models; at least AFAIK, and beyond some very introductory courses, I have never (I mean never...) seen a predictive modeling problem where the R-squared is used for any kind of performance assessment; neither it's an accident that popular machine learning introductions, such as Andrew Ng's Machine Learning at Coursera, do not even bother to mention it. And, as noted in the Github thread above (emphasis added):

In particular when using a test set, it's a bit unclear to me what the R^2 means.

with which I certainly concur.

As for the edge case discussed above (to include or not an intercept term?), I suspect it would sound really irrelevant to modern deep learning practitioners, where the equivalent of an intercept (bias parameters) is always included by default in neural network models...

See the accepted (and highly upvoted) answer in the Cross Validated question Difference between statsmodel OLS and scikit linear regression for a more detailed discussion along these last lines. The discussion (and links) in Is R-squared Useless?, triggered by some relevant (negative) remarks by the great statistician Cosma Shalizi, is also enlightening and highly recommended.

Logistic Regression different results with R and Python?

You seem to be missing the constant (offset) parameter in the Python logistic model.

To use R's formula syntax you're fitting two different models:

Python model: INFECTION ~ 0 + Flushed
R model     : INFECTION ~ Flushed

To add a constant to the Python model use sm.add_constant(...).

How to find the importance of the features for a logistic regression model?

One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.

Consider this example:

import numpy as np    
from sklearn.linear_model import LogisticRegression

x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])

m = LogisticRegression()
m.fit(X, y)

# The estimated coefficients will all be around 1:
print(m.coef_)

# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)

An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:

m.fit(X / np.std(X, 0), y)
print(m.coef_)

Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).

I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.

Logistic regression, second column of confusion matrix shows zeros

the array from con_matrix is as follow , tn, fp, fn, tp.

your true negative are 1006, meaning people that the model consider that aren't able to buy a house,
and your false positive is 0, meaning that your model didn't predict that someone is able to buy a house while the can't in reality.

Your false negative is 125,meaning that these people in reality they can afford to buy a house but your model is saying they can.
and your true positive is also 0, meaning that your model didn't correctly predict the person who can afford a to buy a house as someone who actually can.

MY overall guess is that you might have a lot of people who can't buy a house compare to those who can and probably the features(balance in the bank, age ) are similar to both.

I would advise you to add the class_weight parameters in case you the dataset is imbalanced, if the class label are 0 for not able to buy a house, then set {0: 0.1} in case you have 90 records of not able to buy a house and 10 records of being able to buy a house

Fitting a Logistic Curve, obtain parameters for each record

Something like:

library(drc)

Split the data frame into a list of data frames, one per county:

split_df <- split(final.df, final.df$county)

A function to fit the model to a data set and return the coefficients:

fitfun <- function(d) {
    mL <- drm(percent_farm_tractor ~ year, data = d, fct = L.3(), type = "continuous")
    return(coef(mL))
}

Apply the function to each chunk of the data:

lapply(split_df, fitfun)

Comparison of R, Statmodels, Sklearn for a Classification Task with Logistic Regression