Feeding newdata to R predict function
See ?predict.lm
and the Note section, which I quote below:
Note:
Variables are first looked for in ‘newdata’ and then searched for
in the usual way (which will include the environment of the
formula used in the fit). A warning will be given if the
variables found are not of the same length as those in ‘newdata’
if it was supplied.
Whilst it doesn't state the behaviour in terms of "same name" etc, as far as the formula is concerned the terms you passed in to it were of the form foo$var
and there are no such variables with names like that either in newdata
or along the search path that R will traverse to look for them.
In your second case, you are totally misusing the model formula notation; the idea is to succinctly and symbolically describe the model. Succinctness and repeating the data object ad nauseum are not compatible.
The behaviour you note is exactly consistent with the documented behaviour. In simple terms, you fitted the model with terms data$x
and data$y
then tried to predict for terms x
and y
. As far as R is concerned those are different names and hence different things and it did right to not match them.
Predict() - Maybe I'm not understanding it
First, you want to use
model <- lm(Total ~ Coupon, data=df)
not model <-lm(df$Total ~ df$Coupon, data=df)
.
Second, by saying lm(Total ~ Coupon)
, you are fitting a model that uses Total
as the response variable, with Coupon
as the predictor. That is, your model is of the form Total = a + b*Coupon
, with a
and b
the coefficients to be estimated. Note that the response goes on the left side of the ~
, and the predictor(s) on the right.
Because of this, when you ask R to give you predicted values for the model, you have to provide a set of new predictor values, ie new values of Coupon
, not Total
.
Third, judging by your specification of newdata
, it looks like you're actually after a model to fit Coupon
as a function of Total
, not the other way around. To do this:
model <- lm(Coupon ~ Total, data=df)
new.df <- data.frame(Total=c(79037022, 83100656, 104299800))
predict(model, new.df)
prediction using linear model and the importance of data.frame
When you call predict
on a lm
object, the function called is predict.lm. When you run it like:
predict(model_1, Sepal.Width=c(1,3,4,5))
What you are doing is providing c(1,3,4,5)
an argument or parameter to Sepal.Width
, which predict.lm
ignores since this argument does not exist for this function.
When there is no new input data, you are running predict.lm(model_1)
, and getting back the fitted values:
table(predict(model_1) == predict(model_1, Sepal.Width=c(1,3,4,5)))
TRUE
150
In this case, you fitted the model with a formula, the predict.lm
function needs your data frame to reconstruct the independent or exogenous matrix, matrix multiply with the coefficients and return you the predicted values.
This is briefly what predict.lm
is doing:
newdata = data.frame(Sepal.Width=c(1,3,4,5))
Terms = delete.response(terms(model_1))
X = model.matrix(Terms,newdata)
X
(Intercept) Sepal.Width
1 1 1
2 1 3
3 1 4
4 1 5
X %*% coefficients(model_1)
[,1]
1 6.302861
2 5.856139
3 5.632778
4 5.409417
predict(model_1,newdata)
1 2 3 4
6.302861 5.856139 5.632778 5.409417
Predict Warning on newdata
apl$grp
is a vector, but predict
requires the newdata
argument to be a data frame.* This data frame must contain columns with the same names as the predictor variables used to fit the model (though it can contain other columns as well). So, the following code should work:
predict(mdl, newdata = apl)
You can use predict
rather than predict.lm
. mdl
is an object of class lm
, which causes predict
to "dispatch" the predict.lm
method automatically.
* Strictly speaking, since this is an lm
model, the predict
"method" that gets dispatched is predict.lm
and that method requires that newdata
be a data frame. predict.glm
also requires a data frame. However, there are some predict
methods that can take other types of arguments. For example:
- The
randomForest
package has a predict method forrandomForest
models that can take a data frame or matrix as thenewdata
argument. - The
glmnet
package has apredict
method forglmnet
models that requires a matrix, although the argument is callednewx
rather thannewdata
in that case.
predic.lm gives wrong number of predicted values when I fit and predict a model using a matrix variable
I can't fix your tidyverse
code because I don't work with this package. But I am able to explain why predict
fails in the first case.
Let me just use the built-in dataset trees
for a demonstration:
head(trees, 2)
# Girth Height Volume
#1 8.3 70 10.3
#2 8.6 65 10.3
The normal way to use lm
is
fit <- lm(Girth ~ ., trees)
The variable names (on the RHS of ~
) are
attr(terms(fit), "term.labels")
#[1] "Height" "Volume"
You need to provide these variables in the newdata
when using predict
.
predict(fit, newdata = data.frame(Height = 1, Volume = 2))
# 1
#11.16125
Now if you fit a model using a matrix:
X <- as.matrix(trees[2:3])
y <- trees[[1]]
fit2 <- lm(y ~ X)
attr(terms(fit2), "term.labels")
#[1] "X"
The variable you need to provide in newdata
for predict
is now X
, not Height
or Girth
. Note that since X
is a matrix variable, you need to protect it with I()
when feeding it to a data frame.
newdat <- data.frame(X = I(cbind(1, 2)))
str(newdat)
#'data.frame': 1 obs. of 1 variable:
# $ X: AsIs [1, 1:2] 1 2
predict(fit2, newdat)
# 1
#11.16125
It does not matter that cbind(1, 2)
has no column names. What is important is that this matrix is named X
in newdat
.
predict.lm after regression with missing data in Y
There is built-in functionality for this in R (but not necessarily obvious): it's the na.action
argument/?na.exclude
function. With this option set, predict()
(and similar downstream processing functions) will automatically fill in NA
values in the relevant spots.
Set up data:
df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA
Fit model: default na.action
is na.omit
, which simply removes non-complete cases.
mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000
na.exclude
removes non-complete cases before fitting, but then restores them (filled with NA
) in predicted vectors:
mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000
R linear model (lm) predict function with one single array
The problem seems to be the use of newdata = as.data.frame.list(feat_vec)
. As discussed in your previous question, this returns ugly column names. While when you call predict
, newdata
must have column names consistent with covariates names in your model formula. You should get some warning message when you call predict
.
## example data
set.seed(0)
x1 <- runif(20)
x2 <- rnorm(20)
y <- 0.3 * x1 + 0.7 * x2 + rnorm(20, sd = 0.1)
## linear model
model <- lm(y ~ x1 + x2)
## new data
feat_vec <- c(0.4, 0.6)
newdat <- as.data.frame.list(feat_vec)
# X0.4 X0.6
#1 0.4 0.6
## prediction
y_hat <- predict.lm(model, newdata = newdat)
#Warning message:
#'newdata' had 1 row but variables found have 20 rows
What you need is
newdat <- as.data.frame.list(feat_vec,
col.names = attr(model$terms, "term.labels"))
# x1 x2
#1 0.4 0.6
y_hat <- predict.lm(model, newdata = newdat)
# 1
#0.5192413
This is the same as what you can compute manually:
coef = model$coefficients
unname(coef[1] + sum(coef[-1] * feat_vec))
#[1] 0.5192413
Related Topics
Using Inst/Extdata with Vignette During Package Checking R 2.14.0
Cumulative Sums Over Run Lengths. Can This Loop Be Vectorized
Split Data.Frame Row into Multiple Rows Based on Commas
Shiny Sliderinput from Max to Min
R Bookdown - Custom Title Page
Summing Multiple Columns in an R Data-Frame Quickly
Merge Data Based on Nearest Date R
Error: C Stack Usage Is Too Close to The Limit in R
Small Ggplot Object (1 Mb) Turns into 7 Gigabyte .Rdata Object When Saved
Control The Fill Order and Groups for a Ggplot2 Geom_Bar
Convert a Row of a Data Frame to a Simple Vector in R
Linear Regression with Constraints on The Coefficients
Error Trying to Read a PDF Using Readpdf from The Tm Package