How does predict.lm() compute confidence interval and prediction interval?
When specifying interval
and level
argument, predict.lm
can return confidence interval (CI) or prediction interval (PI). This answer shows how to obtain CI and PI without setting these arguments. There are two ways:
- use middle-stage result from
predict.lm
; - do everything from scratch.
Knowing how to work with both ways give you a thorough understand of the prediction procedure.
Note that we will only cover the type = "response"
(default) case for predict.lm
. Discussion of type = "terms"
is beyond the scope of this answer.
Setup
I gather your code here to help other readers to copy, paste and run. I also change variable names so that they have clearer meanings. In addition, I expand the newdat
to include more than one rows, to show that our computations are "vectorized".
dat <- structure(list(V1 = c(20L, 60L, 46L, 41L, 12L, 137L, 68L, 89L,
4L, 32L, 144L, 156L, 93L, 36L, 72L, 100L, 105L, 131L, 127L, 57L,
66L, 101L, 109L, 74L, 134L, 112L, 18L, 73L, 111L, 96L, 123L,
90L, 20L, 28L, 3L, 57L, 86L, 132L, 112L, 27L, 131L, 34L, 27L,
61L, 77L), V2 = c(2L, 4L, 3L, 2L, 1L, 10L, 5L, 5L, 1L, 2L, 9L,
10L, 6L, 3L, 4L, 8L, 7L, 8L, 10L, 4L, 5L, 7L, 7L, 5L, 9L, 7L,
2L, 5L, 7L, 6L, 8L, 5L, 2L, 2L, 1L, 4L, 5L, 9L, 7L, 1L, 9L, 2L,
2L, 4L, 5L)), .Names = c("V1", "V2"),
class = "data.frame", row.names = c(NA, -45L))
lmObject <- lm(V1 ~ V2, data = dat)
newdat <- data.frame(V2 = c(6, 7))
The following are the output of predict.lm
, to be compared with our manual computations later.
predict(lmObject, newdat, se.fit = TRUE, interval = "confidence", level = 0.90)
#$fit
# fit lwr upr
#1 89.63133 87.28387 91.9788
#2 104.66658 101.95686 107.3763
#
#$se.fit
# 1 2
#1.396411 1.611900
#
#$df
#[1] 43
#
#$residual.scale
#[1] 8.913508
predict(lmObject, newdat, se.fit = TRUE, interval = "prediction", level = 0.90)
#$fit
# fit lwr upr
#1 89.63133 74.46433 104.7983
#2 104.66658 89.43930 119.8939
#
#$se.fit
# 1 2
#1.396411 1.611900
#
#$df
#[1] 43
#
#$residual.scale
#[1] 8.913508
Use middle-stage result from predict.lm
## use `se.fit = TRUE`
z <- predict(lmObject, newdat, se.fit = TRUE)
#$fit
# 1 2
# 89.63133 104.66658
#
#$se.fit
# 1 2
#1.396411 1.611900
#
#$df
#[1] 43
#
#$residual.scale
#[1] 8.913508
What is
se.fit
?
z$se.fit
is the standard error of the predicted mean z$fit
, used to construct CI for z$fit
. We also need quantiles of t-distribution with a degree of freedom z$df
.
alpha <- 0.90 ## 90%
Qt <- c(-1, 1) * qt((1 - alpha) / 2, z$df, lower.tail = FALSE)
#[1] -1.681071 1.681071
## 90% confidence interval
CI <- z$fit + outer(z$se.fit, Qt)
colnames(CI) <- c("lwr", "upr")
CI
# lwr upr
#1 87.28387 91.9788
#2 101.95686 107.3763
We see that this agrees with predict.lm(, interval = "confidence")
.
What is the standard error for PI?
PI is wider than CI, as it accounts for residual variance:
variance_of_PI = variance_of_CI + variance_of_residual
Note that this is defined point-wise. For a non-weighted linear regression (as in your example), residual variance is equal everywhere (known as homoscedasticity), and it is z$residual.scale ^ 2
. Thus the standard error for PI is
se.PI <- sqrt(z$se.fit ^ 2 + z$residual.scale ^ 2)
# 1 2
#9.022228 9.058082
and PI is constructed as
PI <- z$fit + outer(se.PI, Qt)
colnames(PI) <- c("lwr", "upr")
PI
# lwr upr
#1 74.46433 104.7983
#2 89.43930 119.8939
We see that this agrees with predict.lm(, interval = "prediction")
.
remark
Things are more complicated if you have a weight linear regression, where the residual variance is not equal everywhere so that z$residual.scale ^ 2
should be weighted. It is easier to construct PI for fitted values (that is, you don't set newdata
when using type = "prediction"
in predict.lm
), because the weights are known (you must have provided it via weight
argument when using lm
). For out-of-sample prediction (that is, you pass a newdata
to predict.lm
), predict.lm
expects you to tell it how residual variance should be weighted. You need either use argument pred.var
or weights
in predict.lm
, otherwise you get a warning from predict.lm
complaining insufficient information for constructing PI. The following are quoted from ?predict.lm
:
The prediction intervals are for a single observation at each case
in ‘newdata’ (or by default, the data used for the fit) with error
variance(s) ‘pred.var’. This can be a multiple of ‘res.var’, the
estimated value of sigma^2: the default is to assume that future
observations have the same error variance as those used for
fitting. If ‘weights’ is supplied, the inverse of this is used as
a scale factor. For a weighted fit, if the prediction is for the
original data frame, ‘weights’ defaults to the weights used for
the model fit, with a warning since it might not be the intended
result. If the fit was weighted and ‘newdata’ is given, the
default is to assume constant prediction variance, with a warning.
Note that construction of CI is not affected by the type of regression.
Do everything from scratch
Basically we want to know how to obtain fit
, se.fit
, df
and residual.scale
in z
.
The predicted mean can be computed by a matrix-vector multiplication Xp %*% b
, where Xp
is the linear predictor matrix and b
is regression coefficient vector.
Xp <- model.matrix(delete.response(terms(lmObject)), newdat)
b <- coef(lmObject)
yh <- c(Xp %*% b) ## c() reshape the single-column matrix to a vector
#[1] 89.63133 104.66658
And we see that this agrees with z$fit
. The variance-covariance for yh
is Xp %*% V %*% t(Xp)
, where V
is the variance-covariance matrix of b
which can be computed by
V <- vcov(lmObject) ## use `vcov` function in R
# (Intercept) V2
# (Intercept) 7.862086 -1.1927966
# V2 -1.192797 0.2333733
The full variance-covariance matrix of yh
is not needed to compute point-wise CI or PI. We only need its main diagonal. So instead of doing diag(Xp %*% V %*% t(Xp))
, we can do it more efficiently via
var.fit <- rowSums((Xp %*% V) * Xp) ## point-wise variance for predicted mean
# 1 2
#1.949963 2.598222
sqrt(var.fit) ## this agrees with `z$se.fit`
# 1 2
#1.396411 1.611900
The residual degree of freedom is readily available in the fitted model:
dof <- df.residual(lmObject)
#[1] 43
Finally, to compute residual variance, use Pearson estimator:
sig2 <- c(crossprod(lmObject$residuals)) / dof
# [1] 79.45063
sqrt(sig2) ## this agrees with `z$residual.scale`
#[1] 8.913508
remark
Note that in case of weighted regression, sig2
should be computed as
sig2 <- c(crossprod(sqrt(lmObject$weights) * lmObject$residuals)) / dof
Appendix: a self-written function that mimics predict.lm
The code in "Do everything from scratch" has been cleanly organized into a function lm_predict
in this Q & A: linear model with lm
: how to get prediction variance of sum of predicted values.
why I can't get a confidence interval using predict function in R
First, the function predict() that you are using is the method predict.glm(). If you look at its help file, it does not even have arguments 'interval' or 'level'. It doesn't flag them as erroneous because predict.glm() has the (in)famous ... argument, that absorbs all 'extra' arguments. You can write confidence=34.2 and interval="woohoo" and it still gives the same answer. It only produces the estimate and the standard error.
Second, one COULD then take the fit +/- 2*se to get an approximate 95 percent confidence interval. However, without getting into the weeds of confidence intervals, pivotal statistics, non-normality in the response scale, etc., this doesn't give very satisfying intervals because, for instance, they often include impossible negative values.
So, I think a better approach is to form an interval in the link scale, then transform it (this is still an approximation, but probably better):
X <- c(1,0,2,0,3,1,0,1,2,0)
Y <- c(16,9,17,12,22,13,8,15,19,11)
model <- glm(Y ~ X, family = "poisson")
tmp <- predict(model, newdata=data.frame(X=4),se.fit=TRUE, type = "link")
exp(tmp$fit - 2*tmp$se.fit)
1
19.02976
exp(tmp$fit + 2*tmp$se.fit)
1
47.97273
Matching R's confidence interval from predict.lm()
You are confusing the prediction interval standard error (for a new observation) vs the confidence interval standard error (for the mean response).
Specifically, for prediction, the SE has an extra 1 in the expression of the standard error. I show the full calculation below.
See the code below:
set.seed(3)
x <- rnorm(392, 20, 5)
y <- 2*x + 3 + rnorm(392, sd=3)
lm.fit <- lm(y~x)
If this is a new observation you need a 1 in the scaling factor (see the first term
pred <- predict(lm.fit,data.frame(x=(c(20))), interval="prediction", se.fit = TRUE)
MSE <- sum(lm.fit$residuals^2)/(length(x)-2)
Scaling_factor = (1 + 1/length(x) + ((20 - mean(x))^2) / sum( (x- mean(x))^2 ))
est - qt(.975, length(x)-2)*sqrt(MSE*Scaling_factor)
est + qt(.975, length(x)-2)*sqrt(MSE*Scaling_factor)
If you are predicting the mean response (which is what linear regression usually does) you don't need that term.
pred <- predict(lm.fit,data.frame(x=(c(20))), interval="confidence", se.fit = TRUE)
MSE <- sum(lm.fit$residuals^2)/(length(x)-2)
Scaling_factor = (1/length(x) + ((20 - mean(x))^2) / sum( (x- mean(x))^2 ))
est - qt(.975, length(x)-2)*sqrt(MSE*Scaling_factor)
est + qt(.975, length(x)-2)*sqrt(MSE*Scaling_factor)
Notice how they perfectly now match up. For a mathematical analysis, take a look at this great writeup:
Linear Regression for the mean response and for an individual response
Specifically, for the mean response:
However, for an individual observation:
That extra "1" is what gives you the difference. The basic idea is that when you are predicting an individual response there is additional randomness as your estimate of the MSE calculates the variability around the average response value, not an individual response value.
Intuitively, this makes a lot of sense. As your number of observations -> infinity the 1 keeps the standard error from going to 0 (as there's always some variability for any one person).
Hope that helps!
How to fit confidence intervals using predict function for glmmTMB
You can use the argument se.fit = TRUE
to get the standard errors of the predicted values and then use these to calculate the confidence intervals.
https://www.rdocumentation.org/packages/glmmTMB/versions/1.0.2.1/topics/predict.glmmTMB
Confidence intervals for predictions from logistic regression
The usual way is to compute a confidence interval on the scale of the linear predictor, where things will be more normal (Gaussian) and then apply the inverse of the link function to map the confidence interval from the linear predictor scale to the response scale.
To do this you need two things;
- call
predict()
withtype = "link"
, and - call
predict()
withse.fit = TRUE
.
The first produces predictions on the scale of the linear predictor, the second returns the standard errors of the predictions. In pseudo code
## foo <- mtcars[,c("mpg","vs")]; names(foo) <- c("x","y") ## Working example data
mod <- glm(y ~ x, data = foo, family = binomial)
preddata <- with(foo, data.frame(x = seq(min(x), max(x), length = 100)))
preds <- predict(mod, newdata = preddata, type = "link", se.fit = TRUE)
preds
is then a list with components fit
and se.fit
.
The confidence interval on the linear predictor is then
critval <- 1.96 ## approx 95% CI
upr <- preds$fit + (critval * preds$se.fit)
lwr <- preds$fit - (critval * preds$se.fit)
fit <- preds$fit
critval
is chosen from a t or z (normal) distribution as required (I forget exactly now which to use for which type of GLM and what the properties are) with the coverage required. The 1.96
is the value of the Gaussian distribution giving 95% coverage:
> qnorm(0.975) ## 0.975 as this is upper tail, 2.5% also in lower tail
[1] 1.959964
Now for fit
, upr
and lwr
we need to apply the inverse of the link function to them.
fit2 <- mod$family$linkinv(fit)
upr2 <- mod$family$linkinv(upr)
lwr2 <- mod$family$linkinv(lwr)
Now you can plot all three and the data.
preddata$lwr <- lwr2
preddata$upr <- upr2
ggplot(data=foo, mapping=aes(x=x,y=y)) + geom_point() +
stat_smooth(method="glm", method.args=list(family=binomial)) +
geom_line(data=preddata, mapping=aes(x=x, y=upr), col="red") +
geom_line(data=preddata, mapping=aes(x=x, y=lwr), col="red")
Confidence interval of prediction from glm model
predict.glm()
doesn't take the same arguments as predict.lm
(see ?predict.glm
): you have to do this by hand (or find a package with helper functions). The following code constructs the lower and upper 95% Wald confidence limits on the logit (log-odds) scale and then uses plogis()
to back-transform to the probability scale ...
pp <- predict(glm.fit, se.fit = TRUE)
ci_lwr <- with(pp, plogis(fit + qnorm(0.025)*se.fit))
ci_upr <- with(pp, plogis(fit + qnorm(0.975)*se.fit))
> head(ci_lwr)
1 2 3 4 5 6
0.4842931 0.4596593 0.4451171 0.4780052 0.4796479 0.4759596
> head(ci_upr)
1 2 3 4 5 6
0.5433766 0.5347319 0.5339426 0.5581846 0.5492351 0.5398233
Related Topics
Overlay Normal Curve to Histogram in R
R Spreading Multiple Columns With Tidyr
Applying a Function to Every Row of a Table Using Dplyr
Splitting a Continuous Variable into Equal Sized Groups
Change Bar Plot Colour in Geom_Bar With Ggplot2 in R
What's Wrong With My Function to Load Multiple .Csv Files into Single Dataframe in R Using Rbind
Order Stacked Bar Graph in Ggplot
How to Extract a Single Column from a Data.Frame as a Data.Frame
Split Violin Plot With Ggplot2
Combine Two or More Columns in a Dataframe into a New Column With a New Name
Intelligent Point Label Placement in R
Basic Lag in R Vector/Dataframe
Nested Facets in Ggplot2 Spanning Groups
Rolling Mean (Moving Average) by Group/Id With Dplyr
Subset Data to Contain Only Columns Whose Names Match a Condition