Find P-Value (Significance) in Scikit-Learn Linearregression

Python sklearn - how to calculate p-values

Just run the significance test on X, y directly. Example using 20news and chi2:

>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> from sklearn.feature_selection import chi2
>>> data = fetch_20newsgroups_vectorized()
>>> X, y = data.data, data.target
>>> scores, pvalues = chi2(X, y)
>>> pvalues
array([  4.10171798e-17,   4.34003018e-01,   9.99999996e-01, ...,
         9.99999995e-01,   9.99999869e-01,   9.99981414e-01])

get p value and r value from HuberRegressor in Sklearn

You can also use robust linear models in statsmodels. For example:

import statsmodels.api as sm
from sklearn import datasets

x = iris.data[:,0]
y = iris.data[:,2]
rlm_model = sm.RLM(y, sm.add_constant(x),
M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()

The p value you get from scipy.lingress is the p-value that the slope is not zero, this you can get by doing:

rlm_results.summary()
                     
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -7.1311      0.539    -13.241      0.000      -8.187      -6.076
x1             1.8648      0.091     20.434      0.000       1.686       2.044
==============================================================================

Now the r_value from lingress is a correlation coefficient and it stays as that. With robust linear model, you are weighing your observations differently, hence making it less sensitive to outliers, therefore, the r squared calculation does not make sense here. You might get a lower r squared since you are avoiding the line towards the outlier data points.

See comments by @Josef (who maintains statsmodels) from this question, this answer. You can try this calculation if you would like a meaningful r-squared

How to get R-squared for robust regression (RLM) in Statsmodels?

Find P-Value (Significance) in Scikit-Learn Linearregression

Python sklearn - how to calculate p-values

get p value and r value from HuberRegressor in Sklearn

Related Topics

Leave a reply