Confidence Intervals for Predictions from Logistic Regression

Confidence intervals for predictions from logistic regression

The usual way is to compute a confidence interval on the scale of the linear predictor, where things will be more normal (Gaussian) and then apply the inverse of the link function to map the confidence interval from the linear predictor scale to the response scale.

To do this you need two things;

  1. call predict() with type = "link", and
  2. call predict() with se.fit = TRUE.

The first produces predictions on the scale of the linear predictor, the second returns the standard errors of the predictions. In pseudo code

## foo <- mtcars[,c("mpg","vs")]; names(foo) <- c("x","y") ## Working example data
mod <- glm(y ~ x, data = foo, family = binomial)
preddata <- with(foo, data.frame(x = seq(min(x), max(x), length = 100)))
preds <- predict(mod, newdata = preddata, type = "link", se.fit = TRUE)

preds is then a list with components fit and se.fit.

The confidence interval on the linear predictor is then

critval <- 1.96 ## approx 95% CI
upr <- preds$fit + (critval * preds$se.fit)
lwr <- preds$fit - (critval * preds$se.fit)
fit <- preds$fit

critval is chosen from a t or z (normal) distribution as required (I forget exactly now which to use for which type of GLM and what the properties are) with the coverage required. The 1.96 is the value of the Gaussian distribution giving 95% coverage:

> qnorm(0.975) ## 0.975 as this is upper tail, 2.5% also in lower tail
[1] 1.959964

Now for fit, upr and lwr we need to apply the inverse of the link function to them.

fit2 <- mod$family$linkinv(fit)
upr2 <- mod$family$linkinv(upr)
lwr2 <- mod$family$linkinv(lwr)

Now you can plot all three and the data.

preddata$lwr <- lwr2 
preddata$upr <- upr2
ggplot(data=foo, mapping=aes(x=x,y=y)) + geom_point() +
stat_smooth(method="glm", method.args=list(family=binomial)) +
geom_line(data=preddata, mapping=aes(x=x, y=upr), col="red") +
geom_line(data=preddata, mapping=aes(x=x, y=lwr), col="red")

Sample Image

Confidence interval of probability prediction from logistic regression statsmodels

You can use delta method to find approximate variance for predicted probability. Namely,

var(proba) = np.dot(np.dot(gradient.T, cov), gradient)

where gradient is the vector of derivatives of predicted probability by model coefficients, and cov is the covariance matrix of coefficients.

Delta method is proven to work asymptotically for all maximum likelihood estimates. However, if you have a small training sample, asymptotic methods may not work well, and you should consider bootstrapping.

Here is a toy example of applying delta method to logistic regression:

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

# generate data
np.random.seed(1)
x = np.arange(100)
y = (x * 0.5 + np.random.normal(size=100,scale=10)>30)
# estimate the model
X = sm.add_constant(x)
model = sm.Logit(y, X).fit()
proba = model.predict(X) # predicted probability

# estimate confidence interval for predicted probabilities
cov = model.cov_params()
gradient = (proba * (1 - proba) * X.T).T # matrix of gradients for each observation
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient])
c = 1.96 # multiplier for confidence interval
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))

plt.plot(x, proba)
plt.plot(x, lower, color='g')
plt.plot(x, upper, color='g')
plt.show()

It draws the following nice picture:
Sample Image

For your example the code would be

proba = logit.predict(age_range_poly)
cov = logit.cov_params()
gradient = (proba * (1 - proba) * age_range_poly.T).T
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient])
c = 1.96
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))

plt.plot(age_range_poly[:, 1], proba)
plt.plot(age_range_poly[:, 1], lower, color='g')
plt.plot(age_range_poly[:, 1], upper, color='g')
plt.show()

and it would give the following picture

Sample Image

Looks pretty much like a boa-constrictor with an elephant inside.

You could compare it with the bootstrap estimates:

preds = []
for i in range(1000):
boot_idx = np.random.choice(len(age), replace=True, size=len(age))
model = sm.Logit(wage['wage250'].iloc[boot_idx], age[boot_idx]).fit(disp=0)
preds.append(model.predict(age_range_poly))
p = np.array(preds)
plt.plot(age_range_poly[:, 1], np.percentile(p, 97.5, axis=0))
plt.plot(age_range_poly[:, 1], np.percentile(p, 2.5, axis=0))
plt.show()

Sample Image

Results of delta method and bootstrap look pretty much the same.

Authors of the book, however, go the third way. They use the fact that

proba = np.exp(np.dot(x, params)) / (1 + np.exp(np.dot(x, params)))

and calculate confidence interval for the linear part, and then transform with the logit function

xb = np.dot(age_range_poly, logit.params)
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in age_range_poly])
upper_xb = xb + c * std_errors
lower_xb = xb - c * std_errors
upper = np.exp(upper_xb) / (1 + np.exp(upper_xb))
lower = np.exp(lower_xb) / (1 + np.exp(lower_xb))
plt.plot(age_range_poly[:, 1], upper)
plt.plot(age_range_poly[:, 1], lower)
plt.show()

So they get the diverging interval:

Sample Image

These methods produce so different results because they assume different things (predicted probability and log-odds) being distributed normally. Namely, delta method assumes predicted probabilites are normal, and in the book, log-odds are normal. In fact, none of them are normal in finite samples, and they all converge to normal in infinite samples, but their variances converge to zero at the same time. Maximum likelihood estimates are insensitive to reparametrization, but their estimated distribution is, and that's the problem.

Confidence interval of prediction from glm model

predict.glm() doesn't take the same arguments as predict.lm (see ?predict.glm): you have to do this by hand (or find a package with helper functions). The following code constructs the lower and upper 95% Wald confidence limits on the logit (log-odds) scale and then uses plogis() to back-transform to the probability scale ...

pp <- predict(glm.fit, se.fit = TRUE)
ci_lwr <- with(pp, plogis(fit + qnorm(0.025)*se.fit))
ci_upr <- with(pp, plogis(fit + qnorm(0.975)*se.fit))
> head(ci_lwr)
1 2 3 4 5 6
0.4842931 0.4596593 0.4451171 0.4780052 0.4796479 0.4759596
> head(ci_upr)
1 2 3 4 5 6
0.5433766 0.5347319 0.5339426 0.5581846 0.5492351 0.5398233

How to plot logistic glm predicted values and confidence interval in R

This is a lowest-common-denominator, base-package-only, solution.

Fit the model:

mm <- glm(y~site_name,data=dd,family=binomial)

Make up a prediction frame with the site names:

pframe <- data.frame(site_name=unique(dd$site_name))

Predict (on the logit/linear-predictor scale), with standard errors

pp <- predict(mm,newdata=pframe,se.fit=TRUE)
linkinv <- family(mm)$linkinv ## inverse-link function

Put together the prediction, lower and upper bounds, and back-transform to the probability scale:

pframe$pred0 <- pp$fit
pframe$pred <- linkinv(pp$fit)
alpha <- 0.95
sc <- abs(qnorm((1-alpha)/2)) ## Normal approx. to likelihood
alpha2 <- 0.5
sc2 <- abs(qnorm((1-alpha2)/2)) ## Normal approx. to likelihood
pframe <- transform(pframe,
lwr=linkinv(pred0-sc*pp$se.fit),
upr=linkinv(pred0+sc*pp$se.fit),
lwr2=linkinv(pred0-sc2*pp$se.fit),
upr2=linkinv(pred0+sc2*pp$se.fit))

Plot.

with(pframe,
{
plot(site_name,pred,ylim=c(0,1))
arrows(as.numeric(site_name),lwr,as.numeric(site_name),upr,
angle=90,code=3,length=0.1)
})

As a boxplot:

with(pframe,
{
bxp(list(stats=rbind(lwr,lwr2,pred,upr2,upr),
n = rep(1,nrow(pframe)),
conf = NA,
out = NULL,
group = NULL,
names=as.character(site_name)))
})

There are lots of other ways to do this; I would recommend

library("ggplot2")
ggplot(pframe,aes(site_name,pred))+
geom_pointrange(aes(ymin=lwr,ymax=upr))+
geom_linerange(aes(ymin=lwr2,ymax=upr2),lwd=1.5)+
coord_flip()

An alternative solution is to fit the model via y~site_name-1, which in this case will assign a separate parameter to the probability of each site, and use profile()/confint() to find the confidence intervals; this will be slightly more accurate than relying on the Normality of the sampling distributions of the parameters/predictions as done in the answer above.

Generating confidence intervals for predicted probabilities after running mlogit function in R

One approach here is Monte Carlo simulation. You'd simulate repeated draws from a multivariate-normal sampling distribution whose parameters are given by your model results.

For each simulation, estimate your predicted probabilities, and use their empirical distribution over simulations to get your confidence intervals.

library(MASS)

est_betas <- m$coefficients
est_preds <- predict(m, newdata = Fish_test)

sim_betas <- mvrnorm(1000, m$coefficients, vcov(m))
sim_preds <- apply(sim_betas, 1, function(x) {
m$coefficients <- x
predict(m, newdata = Fish_test)
})
sim_ci <- apply(sim_preds, 1, quantile, c(.025, .975))

cbind(prob = est_preds, t(sim_ci))
# prob 2.5% 97.5%
# beach 0.1414336 0.10403634 0.1920795
# boat 0.3869535 0.33521346 0.4406527
# charter 0.3363766 0.28751240 0.3894717
# pier 0.1352363 0.09858375 0.1823240


Related Topics



Leave a reply



Submit