Predict.Lm with Newdata

Feeding newdata to R predict function

See ?predict.lm and the Note section, which I quote below:

Note:

Variables are first looked for in ‘newdata’ and then searched for
in the usual way (which will include the environment of the
formula used in the fit). A warning will be given if the
variables found are not of the same length as those in ‘newdata’
if it was supplied.

Whilst it doesn't state the behaviour in terms of "same name" etc, as far as the formula is concerned the terms you passed in to it were of the form foo$var and there are no such variables with names like that either in newdata or along the search path that R will traverse to look for them.

In your second case, you are totally misusing the model formula notation; the idea is to succinctly and symbolically describe the model. Succinctness and repeating the data object ad nauseum are not compatible.

The behaviour you note is exactly consistent with the documented behaviour. In simple terms, you fitted the model with terms data$x and data$y then tried to predict for terms x and y. As far as R is concerned those are different names and hence different things and it did right to not match them.

Predict() - Maybe I'm not understanding it

First, you want to use

model <- lm(Total ~ Coupon, data=df)

not model <-lm(df$Total ~ df$Coupon, data=df).

Second, by saying lm(Total ~ Coupon), you are fitting a model that uses Total as the response variable, with Coupon as the predictor. That is, your model is of the form Total = a + b*Coupon, with a and b the coefficients to be estimated. Note that the response goes on the left side of the ~, and the predictor(s) on the right.

Because of this, when you ask R to give you predicted values for the model, you have to provide a set of new predictor values, ie new values of Coupon, not Total.

Third, judging by your specification of newdata, it looks like you're actually after a model to fit Coupon as a function of Total, not the other way around. To do this:

model <- lm(Coupon ~ Total, data=df)
new.df <- data.frame(Total=c(79037022, 83100656, 104299800))
predict(model, new.df)

prediction using linear model and the importance of data.frame

When you call predict on a lm object, the function called is predict.lm. When you run it like:

predict(model_1, Sepal.Width=c(1,3,4,5))

What you are doing is providing c(1,3,4,5) an argument or parameter to Sepal.Width, which predict.lm ignores since this argument does not exist for this function.

When there is no new input data, you are running predict.lm(model_1), and getting back the fitted values:

table(predict(model_1) == predict(model_1, Sepal.Width=c(1,3,4,5)))

TRUE
150

In this case, you fitted the model with a formula, the predict.lm function needs your data frame to reconstruct the independent or exogenous matrix, matrix multiply with the coefficients and return you the predicted values.

This is briefly what predict.lm is doing:

newdata = data.frame(Sepal.Width=c(1,3,4,5))
Terms = delete.response(terms(model_1))
X = model.matrix(Terms,newdata)

X
(Intercept) Sepal.Width
1 1 1
2 1 3
3 1 4
4 1 5

X %*% coefficients(model_1)
[,1]
1 6.302861
2 5.856139
3 5.632778
4 5.409417

predict(model_1,newdata)

1 2 3 4
6.302861 5.856139 5.632778 5.409417

Predict Warning on newdata

apl$grp is a vector, but predict requires the newdata argument to be a data frame.* This data frame must contain columns with the same names as the predictor variables used to fit the model (though it can contain other columns as well). So, the following code should work:

predict(mdl, newdata = apl)

You can use predict rather than predict.lm. mdl is an object of class lm, which causes predict to "dispatch" the predict.lm method automatically.


* Strictly speaking, since this is an lm model, the predict "method" that gets dispatched is predict.lm and that method requires that newdata be a data frame. predict.glm also requires a data frame. However, there are some predict methods that can take other types of arguments. For example:

  • The randomForest package has a predict method for randomForest models that can take a data frame or matrix as the newdata argument.
  • The glmnet package has a predict method for glmnet models that requires a matrix, although the argument is called newx rather than newdata in that case.

predic.lm gives wrong number of predicted values when I fit and predict a model using a matrix variable

I can't fix your tidyverse code because I don't work with this package. But I am able to explain why predict fails in the first case.

Let me just use the built-in dataset trees for a demonstration:

head(trees, 2)
# Girth Height Volume
#1 8.3 70 10.3
#2 8.6 65 10.3

The normal way to use lm is

fit <- lm(Girth ~ ., trees)

The variable names (on the RHS of ~) are

attr(terms(fit), "term.labels")
#[1] "Height" "Volume"

You need to provide these variables in the newdata when using predict.

predict(fit, newdata = data.frame(Height = 1, Volume = 2))
# 1
#11.16125

Now if you fit a model using a matrix:

X <- as.matrix(trees[2:3])
y <- trees[[1]]
fit2 <- lm(y ~ X)
attr(terms(fit2), "term.labels")
#[1] "X"

The variable you need to provide in newdata for predict is now X, not Height or Girth. Note that since X is a matrix variable, you need to protect it with I() when feeding it to a data frame.

newdat <- data.frame(X = I(cbind(1, 2)))
str(newdat)
#'data.frame': 1 obs. of 1 variable:
# $ X: AsIs [1, 1:2] 1 2

predict(fit2, newdat)
# 1
#11.16125

It does not matter that cbind(1, 2) has no column names. What is important is that this matrix is named X in newdat.

predict.lm after regression with missing data in Y

There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.

Set up data:

df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA

Fit model: default na.action is na.omit, which simply removes non-complete cases.

mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000

na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:

mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000

R linear model (lm) predict function with one single array

The problem seems to be the use of newdata = as.data.frame.list(feat_vec). As discussed in your previous question, this returns ugly column names. While when you call predict, newdata must have column names consistent with covariates names in your model formula. You should get some warning message when you call predict.

## example data
set.seed(0)
x1 <- runif(20)
x2 <- rnorm(20)
y <- 0.3 * x1 + 0.7 * x2 + rnorm(20, sd = 0.1)

## linear model
model <- lm(y ~ x1 + x2)

## new data
feat_vec <- c(0.4, 0.6)
newdat <- as.data.frame.list(feat_vec)
# X0.4 X0.6
#1 0.4 0.6

## prediction
y_hat <- predict.lm(model, newdata = newdat)
#Warning message:
#'newdata' had 1 row but variables found have 20 rows

What you need is

newdat <- as.data.frame.list(feat_vec,
col.names = attr(model$terms, "term.labels"))
# x1 x2
#1 0.4 0.6

y_hat <- predict.lm(model, newdata = newdat)
# 1
#0.5192413

This is the same as what you can compute manually:

coef = model$coefficients
unname(coef[1] + sum(coef[-1] * feat_vec))
#[1] 0.5192413


Related Topics



Leave a reply



Submit