Fama MACbeth Regression in Python (Pandas or Statsmodels)

Fama Macbeth Regression in Python (Pandas or Statsmodels)

An update to reflect the library situation for Fama-MacBeth as of Fall 2018. The fama_macbeth function has been removed from pandas for a while now. So what are your options?

If you're using python 3, then you can use the Fama-MacBeth method in LinearModels: https://github.com/bashtage/linearmodels/blob/master/linearmodels/panel/model.py
If you're using python 2 or just don't want to use LinearModels, then probably your best option is to roll you own.

For example, suppose you have the Fama-French industry portfolios in a panel like the following (you've also computed some variables like past beta or past returns to use as your x-variables):

In [1]: import pandas as pd
        import numpy as np
        import statsmodels.formula.api as smf

In [4]: df = pd.read_csv('industry.csv',parse_dates=['caldt'])
        df.query("caldt == '1995-07-01'")

In [5]: Out[5]: 
      industry      caldt    ret    beta  r12to2  r36to13
18432     Aero 1995-07-01   6.26  0.9696  0.2755   0.3466
18433    Agric 1995-07-01   3.37  1.0412  0.1260   0.0581
18434    Autos 1995-07-01   2.42  1.0274  0.0293   0.2902
18435    Banks 1995-07-01   4.82  1.4985  0.1659   0.2951

Fama-MacBeth primarily involves computing the same cross-sectional regression model month by month, so you can implement it using a groupby. You can create a function that takes a dataframe (it will come from the groupby) and a patsy formula; it then fits the model and returns the parameter estimates. Here is a barebones version of how you could implement it (note this is what the original questioner tried to do a few years ago ... not sure why it didn't work although it's possible back then statsmodels result object method params wasn't returning a pandas Series so the return needed to be converted to a Series explicitly ... it does work fine in the current version of pandas, 0.23.4):

def ols_coef(x,formula):
    return smf.ols(formula,data=x).fit().params

In [9]: gamma = (df.groupby('caldt')
                .apply(ols_coef,'ret ~ 1 + beta + r12to2 + r36to13'))
        gamma.head()

In [10]: Out[10]: 
            Intercept      beta     r12to2   r36to13
caldt                                               
1963-07-01  -1.497012 -0.765721   4.379128 -1.918083
1963-08-01  11.144169 -6.506291   5.961584 -2.598048
1963-09-01  -2.330966 -0.741550  10.508617 -4.377293
1963-10-01   0.441941  1.127567   5.478114 -2.057173
1963-11-01   3.380485 -4.792643   3.660940 -1.210426

Then just compute the mean, standard error on the mean, and a t-test (or whatever statistics you want). Something like the following:

def fm_summary(p):
    s = p.describe().T
    s['std_error'] = s['std']/np.sqrt(s['count'])
    s['tstat'] = s['mean']/s['std_error']
    return s[['mean','std_error','tstat']]

In [12]: fm_summary(gamma)
Out[12]: 
               mean  std_error     tstat
Intercept  0.754904   0.177291  4.258000
beta      -0.012176   0.202629 -0.060092
r12to2     1.794548   0.356069  5.039896
r36to13    0.237873   0.186680  1.274230

Improving Speed

Using statsmodels for the regressions has significant overhead (particularly given you only need the estimated coefficients). If you want better efficiency, then you could switch from statsmodels to numpy.linalg.lstsq. Write a new function that does the ols estimation ... something like the following (notice I'm not doing anything like checking the rank of these matrices ...):

def ols_np(data,yvar,xvar):
    gamma,_,_,_ = np.linalg.lstsq(data[xvar],data[yvar],rcond=None)
    return pd.Series(gamma)

And if you're still using an older version of pandas, the following will work:

Here is an example of using the fama_macbeth function in pandas:

>>> df

                y    x
date       id
2012-01-01 1   0.1  0.4
           2   0.3  0.6
           3   0.4  0.2
           4   0.0  1.2
2012-02-01 1   0.2  0.7
           2   0.4  0.5
           3   0.2  0.1
           4   0.1  0.0
2012-03-01 1   0.4  0.8
           2   0.6  0.1
           3   0.7  0.6
           4   0.4 -0.1

Notice, the structure. The fama_macbeth function expects the y-var and x-vars to have a multi-index with date as the first variable and the stock/firm/entity id as the second variable in the index:

>>> fm  = pd.fama_macbeth(y=df['y'],x=df[['x']])
>>> fm

----------------------Summary of Fama-MacBeth Analysis-------------------------

Formula: Y ~ x + intercept
# betas :   3

----------------------Summary of Estimated Coefficients------------------------
     Variable          Beta       Std Err        t-stat       CI 2.5%      CI 97.5%
          (x)       -0.0227        0.1276         -0.18       -0.2728        0.2273
  (intercept)        0.3531        0.0842          4.19        0.1881        0.5181

--------------------------------End of Summary---------------------------------

Note that just printing fm calls fm.summary

>>> fm.summary

----------------------Summary of Fama-MacBeth Analysis-------------------------

Formula: Y ~ x + intercept
# betas :   3

----------------------Summary of Estimated Coefficients------------------------
     Variable          Beta       Std Err        t-stat       CI 2.5%      CI 97.5%
          (x)       -0.0227        0.1276         -0.18       -0.2728        0.2273
  (intercept)        0.3531        0.0842          4.19        0.1881        0.5181

--------------------------------End of Summary---------------------------------

Also, note the fama_macbeth function automatically adds an intercept (as opposed to statsmodels routines). Also the x-var has to be a dataframe so if you pass just one column you need to pass it as df[['x']].

If you don't want an intercept you have to do:

>>> fm  = pd.fama_macbeth(y=df['y'],x=df[['x']],intercept=False)

Fixed effect in Pandas or Statsmodels

As noted in the comments, PanelOLS has been removed from Pandas as of version 0.20.0. So you really have three options:

If you use Python 3 you can use linearmodels as specified in the more recent answer: https://stackoverflow.com/a/44836199/3435183
Just specify various dummies in your statsmodels specification, e.g. using pd.get_dummies. May not be feasible if the number of fixed effects is large.

Or do some groupby based demeaning and then use statsmodels (this would work if you're estimating lots of fixed effects). Here is a barebones version of what you could do for one way fixed effects:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy

def areg(formula,data=None,absorb=None,cluster=None): 

    y,X = patsy.dmatrices(formula,data,return_type='dataframe')

    ybar = y.mean()
    y = y -  y.groupby(data[absorb]).transform('mean') + ybar

    Xbar = X.mean()
    X = X - X.groupby(data[absorb]).transform('mean') + Xbar

    reg = sm.OLS(y,X)
    # Account for df loss from FE transform
    reg.df_resid -= (data[absorb].nunique() - 1)

    return reg.fit(cov_type='cluster',cov_kwds={'groups':data[cluster].values})

For example, suppose you have a panel of stock data: stock returns and other stock data for all stocks, every month over a number of months and you want to regress returns on lagged returns with calendar month fixed effects (where the calender month variable is called caldt) and you also want to cluster the standard errors by calendar month. You can estimate such a fixed effect model with the following:

reg0 = areg('ret~retlag',data=df,absorb='caldt',cluster='caldt')

And here is what you can do if using an older version of Pandas:

An example with time fixed effects using pandas' PanelOLS (which is in the plm module). Notice, the import of PanelOLS:

>>> from pandas.stats.plm import PanelOLS
>>> df

                y    x
date       id
2012-01-01 1   0.1  0.2
           2   0.3  0.5
           3   0.4  0.8
           4   0.0  0.2
2012-02-01 1   0.2  0.7 
           2   0.4  0.5
           3   0.2  0.3
           4   0.1  0.1
2012-03-01 1   0.6  0.9
           2   0.7  0.5
           3   0.9  0.6
           4   0.4  0.5

Note, the dataframe must have a multindex set ; panelOLS determines the time and entity effects based on the index:

>>> reg  = PanelOLS(y=df['y'],x=df[['x']],time_effects=True)
>>> reg

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x>

Number of Observations:         12
Number of Degrees of Freedom:   4

R-squared:         0.2729
Adj R-squared:     0.0002

Rmse:              0.1588

F-stat (1, 8):     1.0007, p-value:     0.3464

Degrees of Freedom: model 3, resid 8

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.3694     0.2132       1.73     0.1214    -0.0485     0.7872
---------------------------------End of Summary---------------------------------

Docstring:

PanelOLS(self, y, x, weights = None, intercept = True, nw_lags = None,
entity_effects = False, time_effects = False, x_effects = None,
cluster = None, dropped_dummies = None, verbose = False,
nw_overlap = False)

Implements panel OLS.

See ols function docs

This is another function (like fama_macbeth) where I believe the plan is to move this functionality to statsmodels.

PatsyError when using statsmodels for regression

The problem is that you're passing a grouped dataframe into thepasty.dmatrices function. Since the grouped dataframe is iterable, you can do it in a loop like this, and store all of your X dataframs (one for each group) into a dictionary:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
import patsy

# Loading data
df = sm.datasets.get_rdataset("Guerry", "HistData").data

# Extracting Independent variables
formula = 'Suicides ~ Crime_parents + Infanticide'
data = df.groupby(['Region'])[['Suicides', 'Crime_parents', 'Infanticide', 'Region']]
X = {}
for name, group in data: 
    Y, X[name] = patsy.dmatrices(formula, group, return_type='dataframe')

print(X)

Fama MACbeth Regression in Python (Pandas or Statsmodels)