Logistic regression with robust clustered standard errors in R
You might want to look at the rms
(regression modelling strategies) package. So, lrm
is logistic regression model, and if fit
is the name of your output, you'd have something like this:
fit=lrm(disease ~ age + study + rcs(bmi,3), x=T, y=T, data=dataf)
fit
robcov(fit, cluster=dataf$id)
bootcov(fit,cluster=dataf$id)
You have to specify x=T
, y=T
in the model statement. rcs
indicates restricted cubic splines with 3 knots.
robust and clustered standard error in R for probit and logit regression
I prefer the sandwich
package to compute robust standard errors. One reason is its excellent documentation. See vignette("sandwich")
which clearly shows all available defaults and options, and the corresponding article which explains how you can use ?sandwich
with custom bread
and meat
for special cases.
We can use sandwich
to figure out the difference between the options you posted. The difference will most likely be the degree of freedom correction. Here a comparison for the simple linear regression:
library(rms)
library(sandwich)
fitlm <-lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width,iris)
#Your Blog Post:
X <- model.matrix(fitlm)
n <- dim(X)[1]; k <- dim(X)[2]; dfc <- n/(n-k)
u <- matrix(resid(fitlm))
meat1 <- t(X) %*% diag(diag(crossprod(t(u)))) %*% X
Blog <- sqrt(dfc*diag(solve(crossprod(X)) %*% meat1 %*% solve(crossprod(X))))
# rms fits:
fitols <- ols(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, x=T, y=T, data=iris)
Harrell <- sqrt(diag(robcov(fitols, method="huber")$var))
Harrell_2 <- sqrt(diag(robcov(fitols, method="efron")$var))
# Variations available in sandwich:
variations <- c("const", "HC0", "HC1", "HC2","HC3", "HC4", "HC4m", "HC5")
Zeileis <- t(sapply(variations, function(x) sqrt(diag(vcovHC(fitlm, type = x)))))
rbind(Zeileis, Harrell, Harrell_2, Blog)
(Intercept) Sepal.Width Petal.Length Petal.Width
const 0.2507771 0.06664739 0.05671929 0.1275479
HC0 0.2228915 0.05965267 0.06134461 0.1421440
HC1 0.2259241 0.06046431 0.06217926 0.1440781
HC2 0.2275785 0.06087143 0.06277905 0.1454783
HC3 0.2324199 0.06212735 0.06426019 0.1489170
HC4 0.2323253 0.06196108 0.06430852 0.1488708
HC4m 0.2339698 0.06253635 0.06482791 0.1502751
HC5 0.2274557 0.06077326 0.06279005 0.1454329
Harrell 0.2228915 0.05965267 0.06134461 0.1421440
Harrell_2 0.2324199 0.06212735 0.06426019 0.1489170
Blog 0.2259241 0.06046431 0.06217926 0.1440781
- The result from the blog entry is equivalent to
HC1
. If the blog entry is similar to yourStata
output,Stata
usesHC1
. - Frank Harrel's function yields results similar to
HC0
. As far as I understand, this was the first proposed solution and when you look throughvignette(sandwich)
or the articles mentioned in?sandwich::vcovHC
, other methods have slightly better properties. They differ in their degree of freedom adjustments. Also note that the call torobcov(., method = "efron")
is similar toHC3
.
In any case, if you want identical output, use HC1
or just adjust the variance-covariance matrix approriately. After all, after looking at vignette(sandwich)
for the differences between different versions, you see that you just need to rescale with a constant to get from HC1
to HC0
, which should not be too difficult. By the way, note that HC3
or HC4
are typically preferred due to better small sample properties, and their behavior in the presence of influential observations. So, you probably want to change the defaults in Stata
.
You can use these variance-covariance matrices by supplying it to appropriate functions, such as lmtest::coeftest
or car::linearHypothesis
. For instance:
library(lmtest)
coeftest(fitlm, vcov=vcovHC(fitlm, "HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.855997 0.225924 8.2151 1.038e-13 ***
Sepal.Width 0.650837 0.060464 10.7640 < 2.2e-16 ***
Petal.Length 0.709132 0.062179 11.4046 < 2.2e-16 ***
Petal.Width -0.556483 0.144078 -3.8624 0.0001683 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
For cluster-robust standard errors, you'll have to adjust the meat of the sandwich (see ?sandwich
) or look for a function doing that. There are already several sources explaining in excruciating detail how to do it with appropriate codes or functions. There is no reason for me to reinvent the wheel here, so I skip this.
There is also a relatively new and convenient package computing cluster-robust standard errors for linear models and generalized linear models. See here.
Different Robust Standard Errors of Logit Regression in Stata and R
The default so-called "robust" standard errors in Stata correspond to what sandwich()
from the package of the same name computes. The only difference is how the finite-sample adjustment is done. In the sandwich(...)
function no finite-sample adjustment is done at all by default, i.e., the sandwich is divided by 1/n where n is the number of observations. Alternatively, sandwich(..., adjust = TRUE)
can be used which divides by 1/(n - k) where k is the number of regressors. And Stata divides by 1/(n - 1).
Of course, asymptotically these do not differ at all. And except for a few special cases (e.g., OLS linear regression) there is no argument for 1/(n - k) or 1/(n - 1) to work "correctly" in finite samples (e.g., unbiasedness). At least not to the best of my knowledge.
So to obtain the same results as in Stata you can do do:
sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1)
coeftest(myfit, vcov = sandwich1)
This yields
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.9899791 1.1380890 -3.5059 0.0004551 ***
gre 0.0022644 0.0011027 2.0536 0.0400192 *
gpa 0.8040375 0.3451359 2.3296 0.0198259 *
rank2 -0.6754429 0.3144686 -2.1479 0.0317228 *
rank3 -1.3402039 0.3445257 -3.8900 0.0001002 ***
rank4 -1.5514637 0.4160544 -3.7290 0.0001922 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And just for the record: In the binary response case, these "robust" standard errors are not robust against anything. Provided that the model is correctly specified, they are consistent and it's ok to use them but they don't guard against any misspecification in the model. Because the basic assumption for the sandwich standard errors to work is that the model equation (or more precisely the corresponding score function) is correctly specified while the rest of the model may be misspecified. However, in a binary regression there is no room for misspecification because the model equation just consists of the mean (= probability) and the likelihood is the mean and 1 - mean, respectively. This is in contrast to linear or count data regression where there may be heteroskedasticity, overdispersion, etc.
logit binomial regression with clustered standard errors
I was able to replicate the results in stata by doing:
clrobustse <- function(fit.model, clusterid) {
rank=fit.model$rank
N.obs <- length(clusterid)
N.clust <- length(unique(clusterid))
dfc <- N.clust/(N.clust-1)
vcv <- vcov(fit.model)
estfn <- estfun(fit.model)
uj <- apply(estfn, 2, function(x) tapply(x, clusterid, sum))
N.VCV <- N.obs * vcv
ujuj.nobs <- crossprod(uj)/N.obs
vcovCL <- dfc*(1/N.obs * (N.VCV %*% ujuj.nobs %*% N.VCV))
coeftest(fit.model, vcov=vcovCL)
}
clrobustse(logit.model, data$rep78)
How to run fixed-effects logit model with clustered standard errors and survey weights in R?
I would check out the survey
package which provides everything for which you are asking. The first step is to create the survey object, specify the survey weights and then you are off to the races.
library(survey)
my_survey <- svydesign(ids= ~1, strata = ~country, wts = ~wts, data = your_data)
# Then you can use the survey glm to do what you want via
svy_fit <- svy_glm(ethnic ~ elec_prox +
elec_comp + round + country, data = my_survey, family = binomial())
Or at least I would go down this path given you are using survey data.
Related Topics
Plotting Continuous and Discrete Series in Ggplot with Facet
Constrain Multiple Sliderinput in Shiny to Sum to 100
How to Flip Rows and Columns in R
How to Control Ggplot's Plotting Area Proportions Instead of Fitting Them to Devices in R
A^K for Matrix Multiplication in R
Shiny: Switching Between Reactive Data Sets with Rhandsontable
Reading a CSV File with Repeated Row Names in R
R: Merge Based on Multiple Conditions (With Non-Equal Criteria)
Have Lubridate Subtraction Return Only a Numeric Value
Observeevent Shiny Function Used in a Module Does Not Work
How to Manually Set Geom_Bar Fill Color in Ggplot
Generating a Heatmap That Depicts the Clusters in a Dataset Using Hierarchical Clustering in R
Ggplot2 Error "No Layers in Plot"
Fama MACbeth Standard Errors in R
Subsetting a Data.Table by Range Making Use of Binary Search