Multiple Linear Regression in Python

Multiple linear regression in Python

sklearn.linear_model.LinearRegression will do it:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
        [t.y for t in texts])

Then clf.coef_ will have the regression coefficients.

sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.

Multiple Linear Regression with TensorFlow

The key issues with your code are the following:

While it is necessary to add a column of ones to the features matrix x_data before running the regression with statsmodels, this is not necessary when running the regression with tensorflow. This means that you are passing 3 features to tensorflow instead of 2, where the additional feature (the first column of x_data) is constant.
You are normalizing x_data after the first column of ones has already been added with x_data = sm.add_constant(x_data). As a column of ones has zero variance, after normalization you get a column of nan (as you are dividing by zero). This means that the first of the 3 features that you are passing to tensorflow is completely missing (i.e. it's always nan).
While statsmodels takes as inputs at first y and then X, tensorflow takes as inputs at first X and then y. This means that you have switched the features and target when running the regression in tensorflow.

I included a complete example below.

import numpy as np
import statsmodels.api as sm
import tensorflow as tf

np.random.seed(0)

N = 500    # number of samples
K = 2      # number of features

# generate the features
m = np.array([0, 0])     # means
s = np.array([5, 1000])  # variances
X = np.random.multivariate_normal(m * np.ones(K), s * np.eye(K), N)

# generate the target
b = 0.5                   # bias
w = np.array([0.107, 0])  # weights
y = b + np.dot(X, w) + np.random.normal(0, 1, N)

# normalize the features
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0, ddof=1)

# run a linear regression with statsmodels
reg = sm.OLS(y, sm.add_constant(X)).fit()

# run a linear regression with tensorflow
model = tf.keras.Sequential([
  tf.keras.layers.Dense(units=1)
])

model.compile(loss=tf.losses.MeanSquaredError(), optimizer=tf.keras.optimizers.SGD(learning_rate=0.01))
model.fit(X, y, epochs=1000, verbose=0)

bias = model.layers[0].get_weights()[1]
weights = model.layers[0].get_weights()[0].flatten()

# compare the parameters
print('Statsmodels parameters:')
print(np.round(reg.params, 3))
# [0.523 0.25 0.063]

print('Tensorflow parameters:')
print(np.round(np.append(bias, weights), 3))
# [0.528 0.25 0.066]

python sklearn multiple linear regression display r-squared

There are many different ways to compute R^2 and the adjusted R^2, the following are few of them (computed with the data you provided):

from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = df[['NumberofEmployees','ValueofContract']], df.AverageNumberofTickets
model.fit(X, y)

SST = SSR + SSE (ref definitions)

# compute with formulas from the theory
yhat = model.predict(X)
SS_Residual = sum((y-yhat)**2)       
SS_Total = sum((y-np.mean(y))**2)     
r_squared = 1 - (float(SS_Residual))/SS_Total
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
print r_squared, adjusted_r_squared
# 0.877643371323 0.863248473832

# compute with sklearn linear_model, although could not find any function to compute adjusted-r-square directly from documentation
print model.score(X, y), 1 - (1-model.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1)
# 0.877643371323 0.863248473832

Another way:

# compute with statsmodels, by adding intercept manually
import statsmodels.api as sm
X1 = sm.add_constant(X)
result = sm.OLS(y, X1).fit()
#print dir(result)
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832

Yet another way:

# compute with statsmodels, another way, using formula
import statsmodels.formula.api as sm
result = sm.ols(formula="AverageNumberofTickets ~ NumberofEmployees + ValueofContract", data=df).fit()
#print result.summary()
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832

Perform multiple linear regression for groups based on column unique values

One way to do this is the following. If you are to do the regression for the whole dataframe:

X = df[['a', 'b', 'c', 'd']]
Y = df['y_Values']
 

model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

which will return

                     OLS Regression Results                                
=======================================================================================
Dep. Variable:               y_Values   R-squared (uncentered):                   0.973
Model:                            OLS   Adj. R-squared (uncentered):              0.946
Method:                 Least Squares   F-statistic:                              35.97
Date:                Wed, 27 Oct 2021   Prob (F-statistic):                     0.00216
Time:                        13:12:10   Log-Likelihood:                         -37.992
No. Observations:                   8   AIC:                                      83.98
Df Residuals:                       4   BIC:                                      84.30
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
a            167.3835     45.459      3.682      0.021      41.170     293.597
b              1.6286      0.621      2.622      0.059      -0.096       3.353
c             83.8313     55.572      1.509      0.206     -70.461     238.123
d             -2.7363      6.841     -0.400      0.710     -21.729      16.256
==============================================================================
Omnibus:                        1.673   Durbin-Watson:                   2.460
Prob(Omnibus):                  0.433   Jarque-Bera (JB):                0.446
Skew:                           0.574   Prob(JB):                        0.800
Kurtosis:                       2.860   Cond. No.                         146.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

You can pick and choose what values to extract.

To do it by individual status:

status = list(set(df['status']))
for status in status:
    print( status)
    df_redux = df[df['status']==status]
    print(df_redux)
    X = df_redux[['a', 'b', 'c', 'd']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
    Y = df_redux['y_Values']

    model = sm.OLS(Y, X).fit()
    predictions = model.predict(X) 

    print_model = model.summary()
    print(print_model)

which gives:

1
   ID  status  y_Values     a   b      c      d
0   1       1    150.51  0.26  23  0.151  1.215
1   2       1    153.11  0.86  14  0.156  1.651
2   3       1    189.32  0.46  51  0.151  2.154
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               y_Values   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Wed, 27 Oct 2021   Prob (F-statistic):                nan
Time:                        13:12:15   Log-Likelihood:                 77.832
No. Observations:                   3   AIC:                            -149.7
Df Residuals:                       0   BIC:                            -152.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
a           -248.8778        inf         -0        nan         nan         nan
b             -4.9837        inf         -0        nan         nan         nan
c            229.5383        inf          0        nan         nan         nan
d            242.9489        inf          0        nan         nan         nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.443
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.281
Skew:                           0.016   Prob(JB):                        0.869
Kurtosis:                       1.500   Cond. No.                         554.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
4
   ID  status  y_Values     a   b      c      d
4   5       4    189.65  0.91  11  0.123  2.104
5   6       4    144.23  0.69  16  0.178  3.515
6   7       4    198.02  0.62  18  0.891  1.561
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               y_Values   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Wed, 27 Oct 2021   Prob (F-statistic):                nan
Time:                        13:12:15   Log-Likelihood:                 82.381
No. Observations:                   3   AIC:                            -158.8
Df Residuals:                       0   BIC:                            -161.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
a            183.1273        inf          0        nan         nan         nan
b              8.9478        inf          0        nan         nan         nan
c            -25.7862        inf         -0        nan         nan         nan
d            -34.3392        inf         -0        nan         nan         nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.154
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.284
Skew:                           0.072   Prob(JB):                        0.868
Kurtosis:                       1.500   Cond. No.                         67.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
9
   ID  status  y_Values     a   b      c      d
3   4       9    145.65  0.46  62  0.157  3.145
7   8       9    178.09  0.91  22  0.156  9.155
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               y_Values   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Wed, 27 Oct 2021   Prob (F-statistic):                nan
Time:                        13:12:15   Log-Likelihood:                 58.629
No. Observations:                   2   AIC:                            -113.3
Df Residuals:                       0   BIC:                            -115.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
a              1.4521        inf          0        nan         nan         nan
b              1.5473        inf          0        nan         nan         nan
c              0.1974        inf          0        nan         nan         nan
d             15.5869        inf          0        nan         nan         nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.800
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.333
Skew:                           0.000   Prob(JB):                        0.846
Kurtosis:                       1.000   Cond. No.                         8.72
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.

Of course, given the size of your subsets, the regression results aren't that nice. I assume you have a much larger dataframe.

To extract specific info (like R2) just add print(model.rsquared).

Update :

A more complete method to extract information is to add this:

stats_1 = pd.read_html(model.summary().tables[0].as_html(),header=0,index_col=0)[0]
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]

which with return two dataframes:

stat_1 

Dep. Variable:          y_Values       R-squared (uncentered):     0.973
0             Model:               OLS  Adj. R-squared (uncentered):   0.94600
1            Method:     Least Squares                  F-statistic:  35.97000
2              Date:  Wed, 27 Oct 2021           Prob (F-statistic):   0.00216
3              Time:          13:52:04               Log-Likelihood: -37.99200
4  No. Observations:                 8                          AIC:  83.98000
5      Df Residuals:                 4                          BIC:  84.30000
6          Df Model:                 4                           NaN       NaN
7   Covariance Type:         nonrobust                           NaN       NaN

and

stat_2

index      coef  std err      t  P>|t|  [0.025   0.975]
0     a  167.3835   45.459  3.682  0.021  41.170  293.597
1     b    1.6286    0.621  2.622  0.059  -0.096    3.353
2     c   83.8313   55.572  1.509  0.206 -70.461  238.123
3     d   -2.7363    6.841 -0.400  0.710 -21.729   16.256

You can now pick the columns you want, e.g.:

stat_2['coeff']

index      coef
0     a  167.3835
1     b    1.6286
2     c   83.8313
3     d   -2.7363

so your loop would be something like:

df_coef =[]
status = list(set(df['status']))
for status in status:
    
    df_redux = df[df['status']==status]
    print(df_redux)
    X = df_redux[['a', 'b', 'c', 'd']]
    Y = df_redux['y_Values']

    model = sm.OLS(Y, X).fit()
    predictions = model.predict(X)
    stats_1 = pd.read_html(model.summary().tables[0].as_html(),header=0,index_col=0)[0]
    stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
    if len(stats_2)!=0:
        stats_2['status'] = status
        df_coef.append(stats_2)
    else:
        0
        
all_coef = pd.concat(df_coef)
df = all_coef[['status', 'coef']]
print(df)

which gives:

status      coef
a       1 -248.8778
b       1   -4.9837
c       1  229.5383
d       1  242.9489
a       4  183.1273
b       4    8.9478
c       4  -25.7862
d       4  -34.3392
a       9    1.4521
b       9    1.5473
c       9    0.1974
d       9   15.5869

Then append it to your original df by merging on status

UPDATE 2

Thanks for the solution, got all coef, but what I meant with merging/concatenating predicted values is that when I print out prediction, I got these four tables with row ID and predicted value. What I need is to merge these four tables (which are stored in one variable predictions ), create it as Dataframe with column names ID and results.

After that I can merge the new dataframe to original by column 'ID'.

....
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(predictions)

0       401.094849
1       420.949054
2       407.918627
4       363.367876
8       255.865852
           ...    
1556    430.050556
1558    292.949037
1559    306.011285
1560    412.041196
1561    360.829533

Length: 958, dtype: float64
5       366.159418
12      204.606629
18      400.767161
20      401.544449
21      267.192577
           ...    
1530    384.151730
1533    275.356699
1539    376.165539
1543    334.024327
1547    272.197374
Length: 205, dtype: float64

I tried to convert predictions variable to list or dict, but could not figure out how to concatenate all four tables. Probably easy solution, however I couldn't find it.

Update 3

Does this work for you?

df = pd.read_csv("df.csv", sep=";")
df_coef =[]
status = list(set(df['status']))
for status in status:
    df_redux = df[df['status']==status]
    X = df_redux[['a', 'b', 'c', 'd']]
    Y = df_redux['y_Values']

    model = sm.OLS(Y, X).fit()
    predictions = model.predict(X)
    stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
    predictions = pd.DataFrame(predictions, columns = ['predictions'])
    gf = pd.concat([predictions, df_redux], axis=1)
    df_coef.append(gf)

all_coef = pd.concat(df_coef)

which produces:

predictions  ID  status  y_Values     a   b      c      d
0       150.51   1       1    150.51  0.26  23  0.151  1.215
1       153.11   2       1    153.11  0.86  14  0.156  1.651
2       189.32   3       1    189.32  0.46  51  0.151  2.154
4       189.65   5       4    189.65  0.91  11  0.123  2.104
5       144.23   6       4    144.23  0.69  16  0.178  3.515
6       198.02   7       4    198.02  0.62  18  0.891  1.561
3       145.65   4       9    145.65  0.46  62  0.157  3.145
7       178.09   8       9    178.09  0.91  22  0.156  9.155

Observe that in the example here y_Values and predictions will coincide due to the lack of data.

Automatic Linear/Multiple Regression in Python with 50+ columns

If you are looking for columns with high r squared values, just try a correlation matrix. To ease the visualization, I would recommend you to plot a heat map using seaborn:

import seaborn as sns
import matplotlib.pyplot as plt

df_corr = df.corr()
sns.heatmap(df_corr, cmap="coolwarm", annot=True)
plt.show()

Other suggestion I have to you is to run a Principal Component Analysis (PCA) in your dataset to find the features with highest variability. Usually, these variables are the most important, and can be used to make the best predictions. Just let me know if want more info on this technique.

Can I use numpy.polyfit(x, y, deg) for multiple linear regression

Assuming your equation is a * exercise + b * age + intercept = y, you can fit a multiple linear regression with numpy or scikit-learn as follows:

from sklearn import linear_model
import numpy as np
np.random.seed(42)

X = np.random.randint(low=1, high=10, size=20).reshape(10, 2)
X = np.c_[X, np.ones(X.shape[0])]  # add intercept
y = np.random.randint(low=1, high=10, size=10)

# Option 1
a, b, intercept = np.linalg.pinv((X.T).dot(X)).dot(X.T.dot(y))
print(a, b, intercept)

# Option 2
a, b, intercept = np.linalg.lstsq(X,y, rcond=None)[0]
print(a, b, intercept)

# Option 3
clf = linear_model.LinearRegression(fit_intercept=False)
clf.fit(X, y)
print(clf.coef_)

Multiple Linear Regression in Python