Double Clustered Standard Errors for Panel Data

Double clustered standard errors for panel data

Frank Harrell's package rms (which used to be named Design) has a function that I use often when clustering: robcov.

See this part of ?robcov, for example.

cluster: a variable indicating groupings. ‘cluster’ may be any type of
      vector (factor, character, integer).  NAs are not allowed.
      Unique values of ‘cluster’ indicate possibly correlated
      groupings of observations. Note the data used in the fit and
      stored in ‘fit$x’ and ‘fit$y’ may have had observations
      containing missing values deleted. It is assumed that if any
      NAs were removed during the original model fitting, an
      ‘naresid’ function exists to restore NAs so that the rows of
      the score matrix coincide with ‘cluster’. If ‘cluster’ is
      omitted, it defaults to the integers 1,2,...,n to obtain the
      "sandwich" robust covariance matrix estimate.

Higher level cluster standard errors for panel data

I wrote a package (clubTamal) when I encountered the same issue. clubTamal transform a plm object (by re-estimation) into an lm object in order to be able to cluster standard errors using teh multiwayvcov package. You find a Rpubs example and documentation here: https://rpubs.com/eliascis/clubTamal.

The package works for plm estimations with Fixed-Effects (model='within') or First-Difference models (model='fd').

To obtain a clustered covariance matrix use the vcovTamal command.

The package is still under development but can be installed directly from github:

library(devtools)
install_github("eliascis/clubTamal")

Unfortunalty the link to your example data does not work, but clubTamal further installs spd4testing, which constructs a simulated Small Panel Data set for testing purposes.

## packages
library(foreign)
library(lmtest)
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
library(plm)
library(multiwayvcov)
library(spd4testing)
library(clubTamal)

## simulated data
d <- spd4testing()

## formula
f <- formula(y ~ x + factor(year))

## standard estimation
e <- plm(formula = f, data = d, model = "fd")
summary(e)
#> Oneway (individual) effect First-Difference Model
#> 
#> Call:
#> plm(formula = f, data = d, model = "fd")
#> 
#> Unbalanced Panel: n = 6, T = 3-5, N = 26
#> Observations used in estimation: 20
#> 
#> Residuals:
#>     Min.  1st Qu.   Median  3rd Qu.     Max. 
#> -250.333 -115.219   12.651  108.390  228.110 
#> 
#> Coefficients:
#>                  Estimate Std. Error t-value Pr(>|t|)  
#> (Intercept)       -80.937    199.405 -0.4059  0.69096  
#> x                  71.858     25.974  2.7666  0.01514 *
#> factor(year)2002  194.842    216.449  0.9002  0.38325  
#> factor(year)2003  109.118    414.298  0.2634  0.79609  
#> factor(year)2004  446.147    583.234  0.7650  0.45700  
#> factor(year)2005  451.514    752.479  0.6000  0.55807  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Total Sum of Squares:    1078300
#> Residual Sum of Squares: 394270
#> R-Squared:      0.63435
#> Adj. R-Squared: 0.50376
#> F-statistic: 4.85757 on 5 and 14 DF, p-value: 0.0087377
e <- plm(formula = f, data = d, model = "within")
summary(e)
#> Oneway (individual) effect Within Model
#> 
#> Call:
#> plm(formula = f, data = d, model = "within")
#> 
#> Unbalanced Panel: n = 6, T = 3-5, N = 26
#> 
#> Residuals:
#>      Min.   1st Qu.    Median   3rd Qu.      Max. 
#> -167.4294  -59.3741   -6.9404   73.7132  146.4199 
#> 
#> Coefficients:
#>                  Estimate Std. Error t-value Pr(>|t|)   
#> x                  72.362     23.434  3.0879 0.007501 **
#> factor(year)2002  113.786     77.276  1.4725 0.161569   
#> factor(year)2003  -67.413     75.012 -0.8987 0.383013   
#> factor(year)2004  200.420     83.649  2.3960 0.030062 * 
#> factor(year)2005  127.170     81.030  1.5694 0.137400   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Total Sum of Squares:    441370
#> Residual Sum of Squares: 190660
#> R-Squared:      0.56803
#> Adj. R-Squared: 0.28005
#> F-statistic: 3.94491 on 5 and 15 DF, p-value: 0.017501

## clustering
# no clustering
v <- e$vcov
coeftest(e)
#> 
#> t test of coefficients:
#> 
#>                  Estimate Std. Error t value Pr(>|t|)   
#> x                  72.362     23.434  3.0879 0.007501 **
#> factor(year)2002  113.786     77.276  1.4725 0.161569   
#> factor(year)2003  -67.413     75.012 -0.8987 0.383013   
#> factor(year)2004  200.420     83.649  2.3960 0.030062 * 
#> factor(year)2005  127.170     81.030  1.5694 0.137400   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# clustering at id level with plm-package
v <- vcovHC(e, type = "HC1", cluster = "group", tol = 1 * 10^-20)
coeftest(e, v)
#> 
#> t test of coefficients:
#> 
#>                  Estimate Std. Error t value Pr(>|t|)   
#> x                  72.362     24.586  2.9433 0.010070 * 
#> factor(year)2002  113.786     76.548  1.4865 0.157870   
#> factor(year)2003  -67.413     63.962 -1.0540 0.308585   
#> factor(year)2004  200.420     61.464  3.2608 0.005266 **
#> factor(year)2005  127.170     67.310  1.8893 0.078338 . 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## clustering at group level with clubTamal
v <- vcovTamal(estimate = e, data = d, groupvar = "gid")
#> Error in vcovTamal(estimate = e, data = d, groupvar = "gid"): better use the very fast and powerful lfe::felm
coeftest(e, v)
#> 
#> t test of coefficients:
#> 
#>                  Estimate Std. Error t value Pr(>|t|)   
#> x                  72.362     24.586  2.9433 0.010070 * 
#> factor(year)2002  113.786     76.548  1.4865 0.157870   
#> factor(year)2003  -67.413     63.962 -1.0540 0.308585   
#> factor(year)2004  200.420     61.464  3.2608 0.005266 **
#> factor(year)2005  127.170     67.310  1.8893 0.078338 . 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

^{Created on 2020-05-29 by the reprex package (v0.3.0)}

R - fixed effect of panel data analysis and robust standard errors

Set cluster='group' if you want to cluster on the variable serving as the individual index (city in your example).

Set cluster='time' if you want to cluster on the variable serving as the time index (yearin your example).

You can cluster on the time index even for a fixed effects one-way individual model.

For clustering on both index variables, you cannot do that with plm::vcovHC. Look at vcovDC from the same packages which provides double clustering (DC = double clustering), e.g.,

coeftest(fem_city, vcovDC(fem_city)

Panel data regression: Robust standard errors

After some time playing around, it works for me and gives me:

                         Estimate  Std. Error t value  Pr(>|t|)    
(Intercept)            4.5099e-16  5.2381e-16  0.8610  0.389254    
C.MCAP_SEC            -5.9769e-07  1.2677e-07 -4.7149 2.425e-06 ***
C.Impact_change       -5.3908e-04  7.5601e-05 -7.1306 1.014e-12 ***
C.Mom                  3.7560e-04  3.3378e-03  0.1125  0.910406    
C.BM                  -1.6438e-04  1.7368e-05 -9.4645 < 2.2e-16 ***
C.PD                   6.2153e-02  3.8766e-02  1.6033  0.108885    
C.CashGen             -2.7876e-04  1.4031e-02 -0.0199  0.984149    
C.NITA                -8.1792e-02  3.2153e-02 -2.5438  0.010969 *  
C.PE                  -6.6170e-06  4.0138e-06 -1.6485  0.099248 .  
C.PEdummy              1.3143e-02  4.8864e-03  2.6897  0.007154 ** 
factor(DS_CODE)130324 -5.2497e-16  5.2683e-16 -0.9965  0.319028    
factor(DS_CODE)130409 -4.0276e-16  5.2384e-16 -0.7689  0.441986    
factor(DS_CODE)130775 -4.4113e-16  5.2424e-16 -0.8415  0.400089  
...

This leaves us with the question why it doesn't for you. I guess it has something to do with the format of your data. Is everything numeric? I converted the column classes and it looks like that for me:

str(dat)
'data.frame':   48251 obs. of  12 variables:
 $ DS_CODE      : chr  "902172" "902172" "902172" "902172" ...
 $ DNEW         : num  2e+05 2e+05 2e+05 2e+05 2e+05 ...
 $ MCAP_SEC     : num  78122 71421 81907 80010 82462 ...
 $ NITA         : num  0.135 0.135 0.135 0.135 0.135 ...
 $ CashGen      : num  0.198 0.198 0.198 0.198 0.198 ...
 $ BM           : num  0.1074 0.1108 0.097 0.0968 0.0899 ...
 $ PE           : num  57 55.3 63.1 63.2 68 ...
 $ PEdummy      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ L1.retE1M    : num  -0.72492 0.13177 0.00122 0.07214 -0.07332 ...
 $ Mom          : num  0 0 0 0 0 ...
 $ PD           : num  5.41e-54 1.51e-66 3.16e-80 2.87e-79 4.39e-89 ...
 $ Impact_change: num  0 -10.59 -10.43 0.7 -6.97 ...

What does str(data) return for you?

Double Clustered Standard Errors for Panel Data