Python sklearn - how to calculate p-values
Just run the significance test on X, y
directly. Example using 20news and chi2
:
>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> from sklearn.feature_selection import chi2
>>> data = fetch_20newsgroups_vectorized()
>>> X, y = data.data, data.target
>>> scores, pvalues = chi2(X, y)
>>> pvalues
array([ 4.10171798e-17, 4.34003018e-01, 9.99999996e-01, ...,
9.99999995e-01, 9.99999869e-01, 9.99981414e-01])
get p value and r value from HuberRegressor in Sklearn
You can also use robust linear models in statsmodels. For example:
import statsmodels.api as sm
from sklearn import datasets
x = iris.data[:,0]
y = iris.data[:,2]
rlm_model = sm.RLM(y, sm.add_constant(x),
M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()
The p value you get from scipy.lingress is the p-value that the slope is not zero, this you can get by doing:
rlm_results.summary()
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -7.1311 0.539 -13.241 0.000 -8.187 -6.076
x1 1.8648 0.091 20.434 0.000 1.686 2.044
==============================================================================
Now the r_value from lingress is a correlation coefficient and it stays as that. With robust linear model, you are weighing your observations differently, hence making it less sensitive to outliers, therefore, the r squared calculation does not make sense here. You might get a lower r squared since you are avoiding the line towards the outlier data points.
See comments by @Josef (who maintains statsmodels) from this question, this answer. You can try this calculation if you would like a meaningful r-squared
How to get R-squared for robust regression (RLM) in Statsmodels?
Related Topics
Is It Still Necessary to Install Cuda Before Using the Conda Tensorflow-Gpu Package
Chain-Calling Parent Initialisers in Python
Getting Started with the Python Debugger, Pdb
Python Dictionary Keys. "In" Complexity
Creating Dynamically Named Variables from User Input
Python: Find_Element_By_Css_Selector
Call a Python Function from Jinja2
How to Convert Strings in a Pandas Data Frame to a 'Date' Data Type
How to Convert List of Key-Value Tuples into Dictionary
Failed to Catch Syntax Error Python
Conda' Is Not Recognized as Internal or External Command
Include Intermediary (Through Model) in Responses in Django Rest Framework
Flask at First Run: Do Not Use the Development Server in a Production Environment
List() Uses Slightly More Memory Than List Comprehension
Generating Matplotlib Graphs Without a Running X Server