prediction using linear model and the importance of data.frame
When you call predict
on a lm
object, the function called is predict.lm. When you run it like:
predict(model_1, Sepal.Width=c(1,3,4,5))
What you are doing is providing c(1,3,4,5)
an argument or parameter to Sepal.Width
, which predict.lm
ignores since this argument does not exist for this function.
When there is no new input data, you are running predict.lm(model_1)
, and getting back the fitted values:
table(predict(model_1) == predict(model_1, Sepal.Width=c(1,3,4,5)))
TRUE
150
In this case, you fitted the model with a formula, the predict.lm
function needs your data frame to reconstruct the independent or exogenous matrix, matrix multiply with the coefficients and return you the predicted values.
This is briefly what predict.lm
is doing:
newdata = data.frame(Sepal.Width=c(1,3,4,5))
Terms = delete.response(terms(model_1))
X = model.matrix(Terms,newdata)
X
(Intercept) Sepal.Width
1 1 1
2 1 3
3 1 4
4 1 5
X %*% coefficients(model_1)
[,1]
1 6.302861
2 5.856139
3 5.632778
4 5.409417
predict(model_1,newdata)
1 2 3 4
6.302861 5.856139 5.632778 5.409417
Applying lm() and predict() to multiple columns in a data frame
I had an inclination to close your question as a duplicate to Fitting a linear model with multiple LHS, but sadly the prediction issue is not addressed over there. On the other hand, Prediction of 'mlm' linear model object from lm()
talks about prediction, but is a little bit far off your situation, as you work with formula interface instead of matrix interface.
I did not manage to locate a perfect duplicate target in "mlm" tag. So I think it a good idea to contribute another answer for this tag. As I said in linked questions, predict.mlm
does not support se.fit
, and at the moment, this is also a missing issue in "mlm" tag. So I would take this chance to fill such gap.
Here is a function to get standard error of prediction:
f <- function (mlmObject, newdata) {
## model formula
form <- formula(mlmObject)
## drop response (LHS)
form[[2]] <- NULL
## prediction matrix
X <- model.matrix(form, newdata)
Q <- forwardsolve(t(qr.R(mlmObject$qr)), t(X))
## unscaled prediction standard error
unscaled.se <- sqrt(colSums(Q ^ 2))
## residual standard error
sigma <- sqrt(colSums(residuals(mlmObject) ^ 2) / mlmObject$df.residual)
## scaled prediction standard error
tcrossprod(unscaled.se, sigma)
}
For your given example, you can do
## fit an `mlm`
fit <- lm(cbind(x3, x4, x5) ~ x1 + x2, data = train)
## prediction (mean only)
pred <- predict(fit, newdata = test)
# x3 x4 x5
#1 0.555956679 0.38628159 0.60649819
#2 0.003610108 0.47653430 0.95848375
#3 -0.458483755 0.48014440 1.27256318
#4 -0.379061372 -0.03610108 1.35920578
#5 1.288808664 0.12274368 0.17870036
#6 1.389891697 0.46570397 0.01624549
## prediction error
pred.se <- f(fit, newdata = test)
# [,1] [,2] [,3]
#[1,] 0.1974039 0.3321300 0.2976205
#[2,] 0.3254108 0.5475000 0.4906129
#[3,] 0.5071956 0.8533510 0.7646849
#[4,] 0.6583707 1.1077014 0.9926075
#[5,] 0.5049637 0.8495959 0.7613200
#[6,] 0.3552794 0.5977537 0.5356451
We can verify that f
is correct:
## `lm1`, `lm2` and `lm3` are defined in your question
predict(lm1, test, se.fit = TRUE)$se.fit
# 1 2 3 4 5 6
#0.1974039 0.3254108 0.5071956 0.6583707 0.5049637 0.3552794
predict(lm2, test, se.fit = TRUE)$se.fit
# 1 2 3 4 5 6
#0.3321300 0.5475000 0.8533510 1.1077014 0.8495959 0.5977537
predict(lm3, test, se.fit = TRUE)$se.fit
# 1 2 3 4 5 6
#0.2976205 0.4906129 0.7646849 0.9926075 0.7613200 0.5356451
using lm in list column to predict new values using purrr
You could take advantage of the newdata
argument to predict
.
I use map2_dbl
so it returns just the single value rather than a list.
mutate(Pred = map2_dbl(model, 1:5, ~predict(.x, newdata = data.frame(ind = .y))))
# A tibble: 5 x 4
groups the_data model Pred
<fctr> <list> <list> <dbl>
1 A <tibble [5 x 2]> <S3: lm> -0.4822045
2 B <tibble [5 x 2]> <S3: lm> -0.1357712
3 C <tibble [5 x 2]> <S3: lm> -0.2455760
4 D <tibble [5 x 2]> <S3: lm> 0.4818425
5 E <tibble [5 x 2]> <S3: lm> -0.3473236
If you add ind
to the dataset before prediction you can use that column instead of 1:5
.
mutate(ind = 1:5) %>%
mutate(Pred = map2_dbl(model, ind, ~predict(.x, newdata = data.frame(ind = .y) )))
# A tibble: 5 x 5
groups the_data model ind Pred
<fctr> <list> <list> <int> <dbl>
1 A <tibble [5 x 2]> <S3: lm> 1 -0.4822045
2 B <tibble [5 x 2]> <S3: lm> 2 -0.1357712
3 C <tibble [5 x 2]> <S3: lm> 3 -0.2455760
4 D <tibble [5 x 2]> <S3: lm> 4 0.4818425
5 E <tibble [5 x 2]> <S3: lm> 5 -0.3473236
Using lm and predict on data in matrices
You can store a whole matrix in one column of a data.frame
:
x <- a [, -1]
y <- a [, 1]
data <- data.frame (y = y, x = I (x))
str (data)
## 'data.frame': 10 obs. of 2 variables:
## $ y: num 0.818 0.767 -0.666 0.788 -0.489 ...
## $ x: AsIs [1:10, 1:9] 0.916274.... 0.386565.... 0.703230.... -2.64091.... 0.274617.... ...
model <- lm (y ~ x)
newdata <- data.frame (x = I (b [, -1]))
predict (model, newdata)
## 1 2
## -3.795722 -4.778784
The paper about the pls package, (Mevik, B.-H. and Wehrens, R. The pls Package: Principal Component and Partial Least Squares Regression in R Journal of Statistical Software, 2007, 18, 1 - 24.) explains this technique.
Another example with a spectroscopic data set (quinine fluorescence), is in vignette ("flu")
of my package hyperSpec.
Predict Warning on newdata
apl$grp
is a vector, but predict
requires the newdata
argument to be a data frame.* This data frame must contain columns with the same names as the predictor variables used to fit the model (though it can contain other columns as well). So, the following code should work:
predict(mdl, newdata = apl)
You can use predict
rather than predict.lm
. mdl
is an object of class lm
, which causes predict
to "dispatch" the predict.lm
method automatically.
* Strictly speaking, since this is an lm
model, the predict
"method" that gets dispatched is predict.lm
and that method requires that newdata
be a data frame. predict.glm
also requires a data frame. However, there are some predict
methods that can take other types of arguments. For example:
- The
randomForest
package has a predict method forrandomForest
models that can take a data frame or matrix as thenewdata
argument. - The
glmnet
package has apredict
method forglmnet
models that requires a matrix, although the argument is callednewx
rather thannewdata
in that case.
Related Topics
Plotly - Different Colours for Different Surfaces
R: Reading a Binary File That Is Zipped
How to Merge Two Data Frame Based on Partial String Match with R
Logistic Regression: How to Try Every Combination of Predictors in R
Get Value of Last Non-Na Row Per Column in Data.Table
Calculate Centroid Within/Inside a Spatialpolygon
Error in Install.Packages:Type =="Both" Cannot Be Used with 'Repos =Null'
Changing the Order of Dodged Bars in Ggplot2 Barplot
R: in Barplot Midpoints Are Not Centered W.R.T. Bars
Using Tidy Eval for Multiple Dplyr Filter Conditions
How to Highlight Area Between Two Lines? Ggplot
Add Multiple Curves/Functions to One Ggplot Through Looping
How to Display Line Numbers for Code Chunks in Rmarkdown HTML and PDF
Shiny Error in Match.Arg(Position):'Arg' Must Be Null or a Character Vector