Multiple Linear Regression in Python

Multiple linear regression in Python

sklearn.linear_model.LinearRegression will do it:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
[t.y for t in texts])

Then clf.coef_ will have the regression coefficients.

sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.

Multiple Linear Regression with TensorFlow

The key issues with your code are the following:

  1. While it is necessary to add a column of ones to the features matrix x_data before running the regression with statsmodels, this is not necessary when running the regression with tensorflow. This means that you are passing 3 features to tensorflow instead of 2, where the additional feature (the first column of x_data) is constant.
  2. You are normalizing x_data after the first column of ones has already been added with x_data = sm.add_constant(x_data). As a column of ones has zero variance, after normalization you get a column of nan (as you are dividing by zero). This means that the first of the 3 features that you are passing to tensorflow is completely missing (i.e. it's always nan).
  3. While statsmodels takes as inputs at first y and then X, tensorflow takes as inputs at first X and then y. This means that you have switched the features and target when running the regression in tensorflow.

I included a complete example below.

import numpy as np
import statsmodels.api as sm
import tensorflow as tf

np.random.seed(0)

N = 500 # number of samples
K = 2 # number of features

# generate the features
m = np.array([0, 0]) # means
s = np.array([5, 1000]) # variances
X = np.random.multivariate_normal(m * np.ones(K), s * np.eye(K), N)

# generate the target
b = 0.5 # bias
w = np.array([0.107, 0]) # weights
y = b + np.dot(X, w) + np.random.normal(0, 1, N)

# normalize the features
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0, ddof=1)

# run a linear regression with statsmodels
reg = sm.OLS(y, sm.add_constant(X)).fit()

# run a linear regression with tensorflow
model = tf.keras.Sequential([
tf.keras.layers.Dense(units=1)
])

model.compile(loss=tf.losses.MeanSquaredError(), optimizer=tf.keras.optimizers.SGD(learning_rate=0.01))
model.fit(X, y, epochs=1000, verbose=0)

bias = model.layers[0].get_weights()[1]
weights = model.layers[0].get_weights()[0].flatten()

# compare the parameters
print('Statsmodels parameters:')
print(np.round(reg.params, 3))
# [0.523 0.25 0.063]

print('Tensorflow parameters:')
print(np.round(np.append(bias, weights), 3))
# [0.528 0.25 0.066]

python sklearn multiple linear regression display r-squared

There are many different ways to compute R^2 and the adjusted R^2, the following are few of them (computed with the data you provided):

from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = df[['NumberofEmployees','ValueofContract']], df.AverageNumberofTickets
model.fit(X, y)

SST = SSR + SSE (ref definitions)

# compute with formulas from the theory
yhat = model.predict(X)
SS_Residual = sum((y-yhat)**2)
SS_Total = sum((y-np.mean(y))**2)
r_squared = 1 - (float(SS_Residual))/SS_Total
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
print r_squared, adjusted_r_squared
# 0.877643371323 0.863248473832

# compute with sklearn linear_model, although could not find any function to compute adjusted-r-square directly from documentation
print model.score(X, y), 1 - (1-model.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1)
# 0.877643371323 0.863248473832

Another way:

# compute with statsmodels, by adding intercept manually
import statsmodels.api as sm
X1 = sm.add_constant(X)
result = sm.OLS(y, X1).fit()
#print dir(result)
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832

Yet another way:

# compute with statsmodels, another way, using formula
import statsmodels.formula.api as sm
result = sm.ols(formula="AverageNumberofTickets ~ NumberofEmployees + ValueofContract", data=df).fit()
#print result.summary()
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832

Perform multiple linear regression for groups based on column unique values

One way to do this is the following. If you are to do the regression for the whole dataframe:

X = df[['a', 'b', 'c', 'd']]
Y = df['y_Values']


model = sm.OLS(Y, X).fit()
predictions = model.predict(X)

print_model = model.summary()
print(print_model)

which will return

                     OLS Regression Results                                
=======================================================================================
Dep. Variable: y_Values R-squared (uncentered): 0.973
Model: OLS Adj. R-squared (uncentered): 0.946
Method: Least Squares F-statistic: 35.97
Date: Wed, 27 Oct 2021 Prob (F-statistic): 0.00216
Time: 13:12:10 Log-Likelihood: -37.992
No. Observations: 8 AIC: 83.98
Df Residuals: 4 BIC: 84.30
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
a 167.3835 45.459 3.682 0.021 41.170 293.597
b 1.6286 0.621 2.622 0.059 -0.096 3.353
c 83.8313 55.572 1.509 0.206 -70.461 238.123
d -2.7363 6.841 -0.400 0.710 -21.729 16.256
==============================================================================
Omnibus: 1.673 Durbin-Watson: 2.460
Prob(Omnibus): 0.433 Jarque-Bera (JB): 0.446
Skew: 0.574 Prob(JB): 0.800
Kurtosis: 2.860 Cond. No. 146.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

You can pick and choose what values to extract.

To do it by individual status:

status = list(set(df['status']))
for status in status:
print( status)
df_redux = df[df['status']==status]
print(df_redux)
X = df_redux[['a', 'b', 'c', 'd']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
Y = df_redux['y_Values']

model = sm.OLS(Y, X).fit()
predictions = model.predict(X)

print_model = model.summary()
print(print_model)

which gives:

1
ID status y_Values a b c d
0 1 1 150.51 0.26 23 0.151 1.215
1 2 1 153.11 0.86 14 0.156 1.651
2 3 1 189.32 0.46 51 0.151 2.154
OLS Regression Results
==============================================================================
Dep. Variable: y_Values R-squared: 1.000
Model: OLS Adj. R-squared: nan
Method: Least Squares F-statistic: nan
Date: Wed, 27 Oct 2021 Prob (F-statistic): nan
Time: 13:12:15 Log-Likelihood: 77.832
No. Observations: 3 AIC: -149.7
Df Residuals: 0 BIC: -152.4
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
a -248.8778 inf -0 nan nan nan
b -4.9837 inf -0 nan nan nan
c 229.5383 inf 0 nan nan nan
d 242.9489 inf 0 nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: 0.443
Prob(Omnibus): nan Jarque-Bera (JB): 0.281
Skew: 0.016 Prob(JB): 0.869
Kurtosis: 1.500 Cond. No. 554.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
4
ID status y_Values a b c d
4 5 4 189.65 0.91 11 0.123 2.104
5 6 4 144.23 0.69 16 0.178 3.515
6 7 4 198.02 0.62 18 0.891 1.561
OLS Regression Results
==============================================================================
Dep. Variable: y_Values R-squared: 1.000
Model: OLS Adj. R-squared: nan
Method: Least Squares F-statistic: nan
Date: Wed, 27 Oct 2021 Prob (F-statistic): nan
Time: 13:12:15 Log-Likelihood: 82.381
No. Observations: 3 AIC: -158.8
Df Residuals: 0 BIC: -161.5
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
a 183.1273 inf 0 nan nan nan
b 8.9478 inf 0 nan nan nan
c -25.7862 inf -0 nan nan nan
d -34.3392 inf -0 nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: 1.154
Prob(Omnibus): nan Jarque-Bera (JB): 0.284
Skew: 0.072 Prob(JB): 0.868
Kurtosis: 1.500 Cond. No. 67.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
9
ID status y_Values a b c d
3 4 9 145.65 0.46 62 0.157 3.145
7 8 9 178.09 0.91 22 0.156 9.155
OLS Regression Results
==============================================================================
Dep. Variable: y_Values R-squared: 1.000
Model: OLS Adj. R-squared: nan
Method: Least Squares F-statistic: nan
Date: Wed, 27 Oct 2021 Prob (F-statistic): nan
Time: 13:12:15 Log-Likelihood: 58.629
No. Observations: 2 AIC: -113.3
Df Residuals: 0 BIC: -115.9
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
a 1.4521 inf 0 nan nan nan
b 1.5473 inf 0 nan nan nan
c 0.1974 inf 0 nan nan nan
d 15.5869 inf 0 nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: 1.800
Prob(Omnibus): nan Jarque-Bera (JB): 0.333
Skew: 0.000 Prob(JB): 0.846
Kurtosis: 1.000 Cond. No. 8.72
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.

Of course, given the size of your subsets, the regression results aren't that nice. I assume you have a much larger dataframe.

To extract specific info (like R2) just add print(model.rsquared).

Update :

A more complete method to extract information is to add this:

stats_1 = pd.read_html(model.summary().tables[0].as_html(),header=0,index_col=0)[0]
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]

which with return two dataframes:

stat_1 

Dep. Variable: y_Values R-squared (uncentered): 0.973
0 Model: OLS Adj. R-squared (uncentered): 0.94600
1 Method: Least Squares F-statistic: 35.97000
2 Date: Wed, 27 Oct 2021 Prob (F-statistic): 0.00216
3 Time: 13:52:04 Log-Likelihood: -37.99200
4 No. Observations: 8 AIC: 83.98000
5 Df Residuals: 4 BIC: 84.30000
6 Df Model: 4 NaN NaN
7 Covariance Type: nonrobust NaN NaN

and

stat_2

index coef std err t P>|t| [0.025 0.975]
0 a 167.3835 45.459 3.682 0.021 41.170 293.597
1 b 1.6286 0.621 2.622 0.059 -0.096 3.353
2 c 83.8313 55.572 1.509 0.206 -70.461 238.123
3 d -2.7363 6.841 -0.400 0.710 -21.729 16.256

You can now pick the columns you want, e.g.:

stat_2['coeff']

index coef
0 a 167.3835
1 b 1.6286
2 c 83.8313
3 d -2.7363

so your loop would be something like:

df_coef =[]
status = list(set(df['status']))
for status in status:

df_redux = df[df['status']==status]
print(df_redux)
X = df_redux[['a', 'b', 'c', 'd']]
Y = df_redux['y_Values']

model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
stats_1 = pd.read_html(model.summary().tables[0].as_html(),header=0,index_col=0)[0]
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
if len(stats_2)!=0:
stats_2['status'] = status
df_coef.append(stats_2)
else:
0

all_coef = pd.concat(df_coef)
df = all_coef[['status', 'coef']]
print(df)

which gives:

status      coef
a 1 -248.8778
b 1 -4.9837
c 1 229.5383
d 1 242.9489
a 4 183.1273
b 4 8.9478
c 4 -25.7862
d 4 -34.3392
a 9 1.4521
b 9 1.5473
c 9 0.1974
d 9 15.5869

Then append it to your original df by merging on status

UPDATE 2

Thanks for the solution, got all coef, but what I meant with merging/concatenating predicted values is that when I print out prediction, I got these four tables with row ID and predicted value. What I need is to merge these four tables (which are stored in one variable predictions ), create it as Dataframe with column names ID and results.

After that I can merge the new dataframe to original by column 'ID'.

....
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(predictions)

0 401.094849
1 420.949054
2 407.918627
4 363.367876
8 255.865852
...
1556 430.050556
1558 292.949037
1559 306.011285
1560 412.041196
1561 360.829533

Length: 958, dtype: float64
5 366.159418
12 204.606629
18 400.767161
20 401.544449
21 267.192577
...
1530 384.151730
1533 275.356699
1539 376.165539
1543 334.024327
1547 272.197374
Length: 205, dtype: float64

I tried to convert predictions variable to list or dict, but could not figure out how to concatenate all four tables. Probably easy solution, however I couldn't find it.

Update 3

Does this work for you?

df = pd.read_csv("df.csv", sep=";")
df_coef =[]
status = list(set(df['status']))
for status in status:
df_redux = df[df['status']==status]
X = df_redux[['a', 'b', 'c', 'd']]
Y = df_redux['y_Values']

model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
predictions = pd.DataFrame(predictions, columns = ['predictions'])
gf = pd.concat([predictions, df_redux], axis=1)
df_coef.append(gf)

all_coef = pd.concat(df_coef)

which produces:

predictions  ID  status  y_Values     a   b      c      d
0 150.51 1 1 150.51 0.26 23 0.151 1.215
1 153.11 2 1 153.11 0.86 14 0.156 1.651
2 189.32 3 1 189.32 0.46 51 0.151 2.154
4 189.65 5 4 189.65 0.91 11 0.123 2.104
5 144.23 6 4 144.23 0.69 16 0.178 3.515
6 198.02 7 4 198.02 0.62 18 0.891 1.561
3 145.65 4 9 145.65 0.46 62 0.157 3.145
7 178.09 8 9 178.09 0.91 22 0.156 9.155

Observe that in the example here y_Values and predictions will coincide due to the lack of data.

Automatic Linear/Multiple Regression in Python with 50+ columns

If you are looking for columns with high r squared values, just try a correlation matrix. To ease the visualization, I would recommend you to plot a heat map using seaborn:

import seaborn as sns
import matplotlib.pyplot as plt

df_corr = df.corr()
sns.heatmap(df_corr, cmap="coolwarm", annot=True)
plt.show()

Other suggestion I have to you is to run a Principal Component Analysis (PCA) in your dataset to find the features with highest variability. Usually, these variables are the most important, and can be used to make the best predictions. Just let me know if want more info on this technique.

Can I use numpy.polyfit(x, y, deg) for multiple linear regression

Assuming your equation is a * exercise + b * age + intercept = y, you can fit a multiple linear regression with numpy or scikit-learn as follows:

from sklearn import linear_model
import numpy as np
np.random.seed(42)

X = np.random.randint(low=1, high=10, size=20).reshape(10, 2)
X = np.c_[X, np.ones(X.shape[0])] # add intercept
y = np.random.randint(low=1, high=10, size=10)

# Option 1
a, b, intercept = np.linalg.pinv((X.T).dot(X)).dot(X.T.dot(y))
print(a, b, intercept)

# Option 2
a, b, intercept = np.linalg.lstsq(X,y, rcond=None)[0]
print(a, b, intercept)

# Option 3
clf = linear_model.LinearRegression(fit_intercept=False)
clf.fit(X, y)
print(clf.coef_)


Related Topics



Leave a reply



Submit