Multiple linear regression in Python
sklearn.linear_model.LinearRegression
will do it:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
[t.y for t in texts])
Then clf.coef_
will have the regression coefficients.
sklearn.linear_model
also has similar interfaces to do various kinds of regularizations on the regression.
Multiple Linear Regression with TensorFlow
The key issues with your code are the following:
- While it is necessary to add a column of ones to the features matrix
x_data
before running the regression with statsmodels, this is not necessary when running the regression with tensorflow. This means that you are passing 3 features to tensorflow instead of 2, where the additional feature (the first column ofx_data
) is constant. - You are normalizing
x_data
after the first column of ones has already been added withx_data = sm.add_constant(x_data)
. As a column of ones has zero variance, after normalization you get a column ofnan
(as you are dividing by zero). This means that the first of the 3 features that you are passing to tensorflow is completely missing (i.e. it's alwaysnan
). - While statsmodels takes as inputs at first
y
and thenX
, tensorflow takes as inputs at firstX
and theny
. This means that you have switched the features and target when running the regression in tensorflow.
I included a complete example below.
import numpy as np
import statsmodels.api as sm
import tensorflow as tf
np.random.seed(0)
N = 500 # number of samples
K = 2 # number of features
# generate the features
m = np.array([0, 0]) # means
s = np.array([5, 1000]) # variances
X = np.random.multivariate_normal(m * np.ones(K), s * np.eye(K), N)
# generate the target
b = 0.5 # bias
w = np.array([0.107, 0]) # weights
y = b + np.dot(X, w) + np.random.normal(0, 1, N)
# normalize the features
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0, ddof=1)
# run a linear regression with statsmodels
reg = sm.OLS(y, sm.add_constant(X)).fit()
# run a linear regression with tensorflow
model = tf.keras.Sequential([
tf.keras.layers.Dense(units=1)
])
model.compile(loss=tf.losses.MeanSquaredError(), optimizer=tf.keras.optimizers.SGD(learning_rate=0.01))
model.fit(X, y, epochs=1000, verbose=0)
bias = model.layers[0].get_weights()[1]
weights = model.layers[0].get_weights()[0].flatten()
# compare the parameters
print('Statsmodels parameters:')
print(np.round(reg.params, 3))
# [0.523 0.25 0.063]
print('Tensorflow parameters:')
print(np.round(np.append(bias, weights), 3))
# [0.528 0.25 0.066]
python sklearn multiple linear regression display r-squared
There are many different ways to compute R^2
and the adjusted R^2
, the following are few of them (computed with the data you provided):
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = df[['NumberofEmployees','ValueofContract']], df.AverageNumberofTickets
model.fit(X, y)
SST = SSR + SSE (ref definitions)
# compute with formulas from the theory
yhat = model.predict(X)
SS_Residual = sum((y-yhat)**2)
SS_Total = sum((y-np.mean(y))**2)
r_squared = 1 - (float(SS_Residual))/SS_Total
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
print r_squared, adjusted_r_squared
# 0.877643371323 0.863248473832
# compute with sklearn linear_model, although could not find any function to compute adjusted-r-square directly from documentation
print model.score(X, y), 1 - (1-model.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1)
# 0.877643371323 0.863248473832
Another way:
# compute with statsmodels, by adding intercept manually
import statsmodels.api as sm
X1 = sm.add_constant(X)
result = sm.OLS(y, X1).fit()
#print dir(result)
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832
Yet another way:
# compute with statsmodels, another way, using formula
import statsmodels.formula.api as sm
result = sm.ols(formula="AverageNumberofTickets ~ NumberofEmployees + ValueofContract", data=df).fit()
#print result.summary()
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832
Perform multiple linear regression for groups based on column unique values
One way to do this is the following. If you are to do the regression for the whole dataframe:
X = df[['a', 'b', 'c', 'd']]
Y = df['y_Values']
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
which will return
OLS Regression Results
=======================================================================================
Dep. Variable: y_Values R-squared (uncentered): 0.973
Model: OLS Adj. R-squared (uncentered): 0.946
Method: Least Squares F-statistic: 35.97
Date: Wed, 27 Oct 2021 Prob (F-statistic): 0.00216
Time: 13:12:10 Log-Likelihood: -37.992
No. Observations: 8 AIC: 83.98
Df Residuals: 4 BIC: 84.30
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
a 167.3835 45.459 3.682 0.021 41.170 293.597
b 1.6286 0.621 2.622 0.059 -0.096 3.353
c 83.8313 55.572 1.509 0.206 -70.461 238.123
d -2.7363 6.841 -0.400 0.710 -21.729 16.256
==============================================================================
Omnibus: 1.673 Durbin-Watson: 2.460
Prob(Omnibus): 0.433 Jarque-Bera (JB): 0.446
Skew: 0.574 Prob(JB): 0.800
Kurtosis: 2.860 Cond. No. 146.
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
You can pick and choose what values to extract.
To do it by individual status:
status = list(set(df['status']))
for status in status:
print( status)
df_redux = df[df['status']==status]
print(df_redux)
X = df_redux[['a', 'b', 'c', 'd']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
Y = df_redux['y_Values']
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
which gives:
1
ID status y_Values a b c d
0 1 1 150.51 0.26 23 0.151 1.215
1 2 1 153.11 0.86 14 0.156 1.651
2 3 1 189.32 0.46 51 0.151 2.154
OLS Regression Results
==============================================================================
Dep. Variable: y_Values R-squared: 1.000
Model: OLS Adj. R-squared: nan
Method: Least Squares F-statistic: nan
Date: Wed, 27 Oct 2021 Prob (F-statistic): nan
Time: 13:12:15 Log-Likelihood: 77.832
No. Observations: 3 AIC: -149.7
Df Residuals: 0 BIC: -152.4
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
a -248.8778 inf -0 nan nan nan
b -4.9837 inf -0 nan nan nan
c 229.5383 inf 0 nan nan nan
d 242.9489 inf 0 nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: 0.443
Prob(Omnibus): nan Jarque-Bera (JB): 0.281
Skew: 0.016 Prob(JB): 0.869
Kurtosis: 1.500 Cond. No. 554.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
4
ID status y_Values a b c d
4 5 4 189.65 0.91 11 0.123 2.104
5 6 4 144.23 0.69 16 0.178 3.515
6 7 4 198.02 0.62 18 0.891 1.561
OLS Regression Results
==============================================================================
Dep. Variable: y_Values R-squared: 1.000
Model: OLS Adj. R-squared: nan
Method: Least Squares F-statistic: nan
Date: Wed, 27 Oct 2021 Prob (F-statistic): nan
Time: 13:12:15 Log-Likelihood: 82.381
No. Observations: 3 AIC: -158.8
Df Residuals: 0 BIC: -161.5
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
a 183.1273 inf 0 nan nan nan
b 8.9478 inf 0 nan nan nan
c -25.7862 inf -0 nan nan nan
d -34.3392 inf -0 nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: 1.154
Prob(Omnibus): nan Jarque-Bera (JB): 0.284
Skew: 0.072 Prob(JB): 0.868
Kurtosis: 1.500 Cond. No. 67.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
9
ID status y_Values a b c d
3 4 9 145.65 0.46 62 0.157 3.145
7 8 9 178.09 0.91 22 0.156 9.155
OLS Regression Results
==============================================================================
Dep. Variable: y_Values R-squared: 1.000
Model: OLS Adj. R-squared: nan
Method: Least Squares F-statistic: nan
Date: Wed, 27 Oct 2021 Prob (F-statistic): nan
Time: 13:12:15 Log-Likelihood: 58.629
No. Observations: 2 AIC: -113.3
Df Residuals: 0 BIC: -115.9
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
a 1.4521 inf 0 nan nan nan
b 1.5473 inf 0 nan nan nan
c 0.1974 inf 0 nan nan nan
d 15.5869 inf 0 nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: 1.800
Prob(Omnibus): nan Jarque-Bera (JB): 0.333
Skew: 0.000 Prob(JB): 0.846
Kurtosis: 1.000 Cond. No. 8.72
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
Of course, given the size of your subsets, the regression results aren't that nice. I assume you have a much larger dataframe.
To extract specific info (like R2) just add print(model.rsquared)
.
Update :
A more complete method to extract information is to add this:
stats_1 = pd.read_html(model.summary().tables[0].as_html(),header=0,index_col=0)[0]
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
which with return two dataframes:
stat_1
Dep. Variable: y_Values R-squared (uncentered): 0.973
0 Model: OLS Adj. R-squared (uncentered): 0.94600
1 Method: Least Squares F-statistic: 35.97000
2 Date: Wed, 27 Oct 2021 Prob (F-statistic): 0.00216
3 Time: 13:52:04 Log-Likelihood: -37.99200
4 No. Observations: 8 AIC: 83.98000
5 Df Residuals: 4 BIC: 84.30000
6 Df Model: 4 NaN NaN
7 Covariance Type: nonrobust NaN NaN
and
stat_2
index coef std err t P>|t| [0.025 0.975]
0 a 167.3835 45.459 3.682 0.021 41.170 293.597
1 b 1.6286 0.621 2.622 0.059 -0.096 3.353
2 c 83.8313 55.572 1.509 0.206 -70.461 238.123
3 d -2.7363 6.841 -0.400 0.710 -21.729 16.256
You can now pick the columns you want, e.g.:
stat_2['coeff']
index coef
0 a 167.3835
1 b 1.6286
2 c 83.8313
3 d -2.7363
so your loop would be something like:
df_coef =[]
status = list(set(df['status']))
for status in status:
df_redux = df[df['status']==status]
print(df_redux)
X = df_redux[['a', 'b', 'c', 'd']]
Y = df_redux['y_Values']
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
stats_1 = pd.read_html(model.summary().tables[0].as_html(),header=0,index_col=0)[0]
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
if len(stats_2)!=0:
stats_2['status'] = status
df_coef.append(stats_2)
else:
0
all_coef = pd.concat(df_coef)
df = all_coef[['status', 'coef']]
print(df)
which gives:
status coef
a 1 -248.8778
b 1 -4.9837
c 1 229.5383
d 1 242.9489
a 4 183.1273
b 4 8.9478
c 4 -25.7862
d 4 -34.3392
a 9 1.4521
b 9 1.5473
c 9 0.1974
d 9 15.5869
Then append it to your original df by merging on status
UPDATE 2
Thanks for the solution, got all coef, but what I meant with merging/concatenating predicted values is that when I print out prediction, I got these four tables with row ID and predicted value. What I need is to merge these four tables (which are stored in one variable predictions
), create it as Dataframe with column names ID
and results
.
After that I can merge the new dataframe to original by column 'ID'.
....
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(predictions)
0 401.094849
1 420.949054
2 407.918627
4 363.367876
8 255.865852
...
1556 430.050556
1558 292.949037
1559 306.011285
1560 412.041196
1561 360.829533
Length: 958, dtype: float64
5 366.159418
12 204.606629
18 400.767161
20 401.544449
21 267.192577
...
1530 384.151730
1533 275.356699
1539 376.165539
1543 334.024327
1547 272.197374
Length: 205, dtype: float64
I tried to convert predictions
variable to list or dict, but could not figure out how to concatenate all four tables. Probably easy solution, however I couldn't find it.
Update 3
Does this work for you?
df = pd.read_csv("df.csv", sep=";")
df_coef =[]
status = list(set(df['status']))
for status in status:
df_redux = df[df['status']==status]
X = df_redux[['a', 'b', 'c', 'd']]
Y = df_redux['y_Values']
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
predictions = pd.DataFrame(predictions, columns = ['predictions'])
gf = pd.concat([predictions, df_redux], axis=1)
df_coef.append(gf)
all_coef = pd.concat(df_coef)
which produces:
predictions ID status y_Values a b c d
0 150.51 1 1 150.51 0.26 23 0.151 1.215
1 153.11 2 1 153.11 0.86 14 0.156 1.651
2 189.32 3 1 189.32 0.46 51 0.151 2.154
4 189.65 5 4 189.65 0.91 11 0.123 2.104
5 144.23 6 4 144.23 0.69 16 0.178 3.515
6 198.02 7 4 198.02 0.62 18 0.891 1.561
3 145.65 4 9 145.65 0.46 62 0.157 3.145
7 178.09 8 9 178.09 0.91 22 0.156 9.155
Observe that in the example here y_Values
and predictions
will coincide due to the lack of data.
Automatic Linear/Multiple Regression in Python with 50+ columns
If you are looking for columns with high r squared values, just try a correlation matrix. To ease the visualization, I would recommend you to plot a heat map using seaborn:
import seaborn as sns
import matplotlib.pyplot as plt
df_corr = df.corr()
sns.heatmap(df_corr, cmap="coolwarm", annot=True)
plt.show()
Other suggestion I have to you is to run a Principal Component Analysis (PCA) in your dataset to find the features with highest variability. Usually, these variables are the most important, and can be used to make the best predictions. Just let me know if want more info on this technique.
Can I use numpy.polyfit(x, y, deg) for multiple linear regression
Assuming your equation is a * exercise + b * age + intercept = y
, you can fit a multiple linear regression with numpy or scikit-learn as follows:
from sklearn import linear_model
import numpy as np
np.random.seed(42)
X = np.random.randint(low=1, high=10, size=20).reshape(10, 2)
X = np.c_[X, np.ones(X.shape[0])] # add intercept
y = np.random.randint(low=1, high=10, size=10)
# Option 1
a, b, intercept = np.linalg.pinv((X.T).dot(X)).dot(X.T.dot(y))
print(a, b, intercept)
# Option 2
a, b, intercept = np.linalg.lstsq(X,y, rcond=None)[0]
print(a, b, intercept)
# Option 3
clf = linear_model.LinearRegression(fit_intercept=False)
clf.fit(X, y)
print(clf.coef_)
Related Topics
Difference Between Except: and Except Exception as E:
Why Can a Python Dict Have Multiple Keys with the Same Hash
Error Installing Geopandas:" a Gdal API Version Must Be Specified " in Anaconda
How to Remove Anaconda from Windows Completely
Download Image with Selenium Python
How to Set Folder Permissions in Windows
Capture Arbitrary Path in Flask Route
Make Requests Using Python Over Tor
How to Specify Your Own Distance Function Using Scikit-Learn K-Means Clustering
Right Way to Reverse a Pandas Dataframe
How to Create a Read-Only Class Property in Python
Display Special Characters When Using Print Statement
Handle Flask Requests Concurrently with Threaded=True
Time Complexity of String Slice