Confidence intervals for predictions from logistic regression
The usual way is to compute a confidence interval on the scale of the linear predictor, where things will be more normal (Gaussian) and then apply the inverse of the link function to map the confidence interval from the linear predictor scale to the response scale.
To do this you need two things;
- call
predict()
withtype = "link"
, and - call
predict()
withse.fit = TRUE
.
The first produces predictions on the scale of the linear predictor, the second returns the standard errors of the predictions. In pseudo code
## foo <- mtcars[,c("mpg","vs")]; names(foo) <- c("x","y") ## Working example data
mod <- glm(y ~ x, data = foo, family = binomial)
preddata <- with(foo, data.frame(x = seq(min(x), max(x), length = 100)))
preds <- predict(mod, newdata = preddata, type = "link", se.fit = TRUE)
preds
is then a list with components fit
and se.fit
.
The confidence interval on the linear predictor is then
critval <- 1.96 ## approx 95% CI
upr <- preds$fit + (critval * preds$se.fit)
lwr <- preds$fit - (critval * preds$se.fit)
fit <- preds$fit
critval
is chosen from a t or z (normal) distribution as required (I forget exactly now which to use for which type of GLM and what the properties are) with the coverage required. The 1.96
is the value of the Gaussian distribution giving 95% coverage:
> qnorm(0.975) ## 0.975 as this is upper tail, 2.5% also in lower tail
[1] 1.959964
Now for fit
, upr
and lwr
we need to apply the inverse of the link function to them.
fit2 <- mod$family$linkinv(fit)
upr2 <- mod$family$linkinv(upr)
lwr2 <- mod$family$linkinv(lwr)
Now you can plot all three and the data.
preddata$lwr <- lwr2
preddata$upr <- upr2
ggplot(data=foo, mapping=aes(x=x,y=y)) + geom_point() +
stat_smooth(method="glm", method.args=list(family=binomial)) +
geom_line(data=preddata, mapping=aes(x=x, y=upr), col="red") +
geom_line(data=preddata, mapping=aes(x=x, y=lwr), col="red")
Confidence interval of probability prediction from logistic regression statsmodels
You can use delta method to find approximate variance for predicted probability. Namely,
var(proba) = np.dot(np.dot(gradient.T, cov), gradient)
where gradient
is the vector of derivatives of predicted probability by model coefficients, and cov
is the covariance matrix of coefficients.
Delta method is proven to work asymptotically for all maximum likelihood estimates. However, if you have a small training sample, asymptotic methods may not work well, and you should consider bootstrapping.
Here is a toy example of applying delta method to logistic regression:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
# generate data
np.random.seed(1)
x = np.arange(100)
y = (x * 0.5 + np.random.normal(size=100,scale=10)>30)
# estimate the model
X = sm.add_constant(x)
model = sm.Logit(y, X).fit()
proba = model.predict(X) # predicted probability
# estimate confidence interval for predicted probabilities
cov = model.cov_params()
gradient = (proba * (1 - proba) * X.T).T # matrix of gradients for each observation
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient])
c = 1.96 # multiplier for confidence interval
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))
plt.plot(x, proba)
plt.plot(x, lower, color='g')
plt.plot(x, upper, color='g')
plt.show()
It draws the following nice picture:
For your example the code would be
proba = logit.predict(age_range_poly)
cov = logit.cov_params()
gradient = (proba * (1 - proba) * age_range_poly.T).T
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient])
c = 1.96
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))
plt.plot(age_range_poly[:, 1], proba)
plt.plot(age_range_poly[:, 1], lower, color='g')
plt.plot(age_range_poly[:, 1], upper, color='g')
plt.show()
and it would give the following picture
Looks pretty much like a boa-constrictor with an elephant inside.
You could compare it with the bootstrap estimates:
preds = []
for i in range(1000):
boot_idx = np.random.choice(len(age), replace=True, size=len(age))
model = sm.Logit(wage['wage250'].iloc[boot_idx], age[boot_idx]).fit(disp=0)
preds.append(model.predict(age_range_poly))
p = np.array(preds)
plt.plot(age_range_poly[:, 1], np.percentile(p, 97.5, axis=0))
plt.plot(age_range_poly[:, 1], np.percentile(p, 2.5, axis=0))
plt.show()
Results of delta method and bootstrap look pretty much the same.
Authors of the book, however, go the third way. They use the fact that
proba = np.exp(np.dot(x, params)) / (1 + np.exp(np.dot(x, params)))
and calculate confidence interval for the linear part, and then transform with the logit function
xb = np.dot(age_range_poly, logit.params)
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in age_range_poly])
upper_xb = xb + c * std_errors
lower_xb = xb - c * std_errors
upper = np.exp(upper_xb) / (1 + np.exp(upper_xb))
lower = np.exp(lower_xb) / (1 + np.exp(lower_xb))
plt.plot(age_range_poly[:, 1], upper)
plt.plot(age_range_poly[:, 1], lower)
plt.show()
So they get the diverging interval:
These methods produce so different results because they assume different things (predicted probability and log-odds) being distributed normally. Namely, delta method assumes predicted probabilites are normal, and in the book, log-odds are normal. In fact, none of them are normal in finite samples, and they all converge to normal in infinite samples, but their variances converge to zero at the same time. Maximum likelihood estimates are insensitive to reparametrization, but their estimated distribution is, and that's the problem.
Confidence interval of prediction from glm model
predict.glm()
doesn't take the same arguments as predict.lm
(see ?predict.glm
): you have to do this by hand (or find a package with helper functions). The following code constructs the lower and upper 95% Wald confidence limits on the logit (log-odds) scale and then uses plogis()
to back-transform to the probability scale ...
pp <- predict(glm.fit, se.fit = TRUE)
ci_lwr <- with(pp, plogis(fit + qnorm(0.025)*se.fit))
ci_upr <- with(pp, plogis(fit + qnorm(0.975)*se.fit))
> head(ci_lwr)
1 2 3 4 5 6
0.4842931 0.4596593 0.4451171 0.4780052 0.4796479 0.4759596
> head(ci_upr)
1 2 3 4 5 6
0.5433766 0.5347319 0.5339426 0.5581846 0.5492351 0.5398233
How to plot logistic glm predicted values and confidence interval in R
This is a lowest-common-denominator, base-package-only, solution.
Fit the model:
mm <- glm(y~site_name,data=dd,family=binomial)
Make up a prediction frame with the site names:
pframe <- data.frame(site_name=unique(dd$site_name))
Predict (on the logit/linear-predictor scale), with standard errors
pp <- predict(mm,newdata=pframe,se.fit=TRUE)
linkinv <- family(mm)$linkinv ## inverse-link function
Put together the prediction, lower and upper bounds, and back-transform to the probability scale:
pframe$pred0 <- pp$fit
pframe$pred <- linkinv(pp$fit)
alpha <- 0.95
sc <- abs(qnorm((1-alpha)/2)) ## Normal approx. to likelihood
alpha2 <- 0.5
sc2 <- abs(qnorm((1-alpha2)/2)) ## Normal approx. to likelihood
pframe <- transform(pframe,
lwr=linkinv(pred0-sc*pp$se.fit),
upr=linkinv(pred0+sc*pp$se.fit),
lwr2=linkinv(pred0-sc2*pp$se.fit),
upr2=linkinv(pred0+sc2*pp$se.fit))
Plot.
with(pframe,
{
plot(site_name,pred,ylim=c(0,1))
arrows(as.numeric(site_name),lwr,as.numeric(site_name),upr,
angle=90,code=3,length=0.1)
})
As a boxplot:
with(pframe,
{
bxp(list(stats=rbind(lwr,lwr2,pred,upr2,upr),
n = rep(1,nrow(pframe)),
conf = NA,
out = NULL,
group = NULL,
names=as.character(site_name)))
})
There are lots of other ways to do this; I would recommend
library("ggplot2")
ggplot(pframe,aes(site_name,pred))+
geom_pointrange(aes(ymin=lwr,ymax=upr))+
geom_linerange(aes(ymin=lwr2,ymax=upr2),lwd=1.5)+
coord_flip()
An alternative solution is to fit the model via y~site_name-1
, which in this case will assign a separate parameter to the probability of each site, and use profile()
/confint()
to find the confidence intervals; this will be slightly more accurate than relying on the Normality of the sampling distributions of the parameters/predictions as done in the answer above.
Generating confidence intervals for predicted probabilities after running mlogit function in R
One approach here is Monte Carlo simulation. You'd simulate repeated draws from a multivariate-normal sampling distribution whose parameters are given by your model results.
For each simulation, estimate your predicted probabilities, and use their empirical distribution over simulations to get your confidence intervals.
library(MASS)
est_betas <- m$coefficients
est_preds <- predict(m, newdata = Fish_test)
sim_betas <- mvrnorm(1000, m$coefficients, vcov(m))
sim_preds <- apply(sim_betas, 1, function(x) {
m$coefficients <- x
predict(m, newdata = Fish_test)
})
sim_ci <- apply(sim_preds, 1, quantile, c(.025, .975))
cbind(prob = est_preds, t(sim_ci))
# prob 2.5% 97.5%
# beach 0.1414336 0.10403634 0.1920795
# boat 0.3869535 0.33521346 0.4406527
# charter 0.3363766 0.28751240 0.3894717
# pier 0.1352363 0.09858375 0.1823240
Related Topics
Subset Dataframe Such That All Values in Each Row Are Less Than a Certain Value
Delete Rows with Blank Values in One Particular Column
How to Install Multiple Packages
Avoid Rbind()/Cbind() Conversion from Numeric to Factor
What Is a Neat Command Line Equivalent to Rstudio's Knit HTML
Ggplot2 - Shade Area Above Line
R: Unexpected Results from P.Adjust (Fdr)
Calculating Length of 95%-Ci Using Dplyr
How to Create a Range of Dates in R
Generate a Sequence of Characters from 'A'-'Z'
R: Arranging Multiple Plots Together Using Gridextra
How to Plot the Results of a Mixed Model
Rearrange Dataframe to a Table, the Opposite of "Melt"
Add Author Affiliation in R Markdown Beamer Presentation
Label Minimum and Maximum of Scale Fill Gradient Legend with Text: Ggplot2
For Each Group Summarise Means for All Variables in Dataframe (Ddply? Split)