Lm and Predict - Agreement of Data.Frame Names

prediction using linear model and the importance of data.frame

When you call predict on a lm object, the function called is predict.lm. When you run it like:

predict(model_1, Sepal.Width=c(1,3,4,5))

What you are doing is providing c(1,3,4,5) an argument or parameter to Sepal.Width, which predict.lm ignores since this argument does not exist for this function.

When there is no new input data, you are running predict.lm(model_1), and getting back the fitted values:

table(predict(model_1) == predict(model_1, Sepal.Width=c(1,3,4,5)))

TRUE 
 150

In this case, you fitted the model with a formula, the predict.lm function needs your data frame to reconstruct the independent or exogenous matrix, matrix multiply with the coefficients and return you the predicted values.

This is briefly what predict.lm is doing:

newdata = data.frame(Sepal.Width=c(1,3,4,5))
Terms = delete.response(terms(model_1))
X = model.matrix(Terms,newdata)

X
  (Intercept) Sepal.Width
1           1           1
2           1           3
3           1           4
4           1           5

X %*% coefficients(model_1)
      [,1]
1 6.302861
2 5.856139
3 5.632778
4 5.409417

predict(model_1,newdata)

       1        2        3        4 
6.302861 5.856139 5.632778 5.409417

Applying lm() and predict() to multiple columns in a data frame

I had an inclination to close your question as a duplicate to Fitting a linear model with multiple LHS, but sadly the prediction issue is not addressed over there. On the other hand, Prediction of 'mlm' linear model object from lm() talks about prediction, but is a little bit far off your situation, as you work with formula interface instead of matrix interface.

I did not manage to locate a perfect duplicate target in "mlm" tag. So I think it a good idea to contribute another answer for this tag. As I said in linked questions, predict.mlm does not support se.fit, and at the moment, this is also a missing issue in "mlm" tag. So I would take this chance to fill such gap.

Here is a function to get standard error of prediction:

f <- function (mlmObject, newdata) {
  ## model formula
  form <- formula(mlmObject)
  ## drop response (LHS)
  form[[2]] <- NULL
  ## prediction matrix
  X <- model.matrix(form, newdata)
  Q <- forwardsolve(t(qr.R(mlmObject$qr)), t(X))
  ## unscaled prediction standard error
  unscaled.se <- sqrt(colSums(Q ^ 2))
  ## residual standard error
  sigma <- sqrt(colSums(residuals(mlmObject) ^ 2) / mlmObject$df.residual)
  ## scaled prediction standard error
  tcrossprod(unscaled.se, sigma)
  }

For your given example, you can do

## fit an `mlm`
fit <- lm(cbind(x3, x4, x5) ~ x1 + x2, data = train)

## prediction (mean only)
pred <- predict(fit, newdata = test)

#            x3          x4         x5
#1  0.555956679  0.38628159 0.60649819
#2  0.003610108  0.47653430 0.95848375
#3 -0.458483755  0.48014440 1.27256318
#4 -0.379061372 -0.03610108 1.35920578
#5  1.288808664  0.12274368 0.17870036
#6  1.389891697  0.46570397 0.01624549

## prediction error
pred.se <- f(fit, newdata = test)

#          [,1]      [,2]      [,3]
#[1,] 0.1974039 0.3321300 0.2976205
#[2,] 0.3254108 0.5475000 0.4906129
#[3,] 0.5071956 0.8533510 0.7646849
#[4,] 0.6583707 1.1077014 0.9926075
#[5,] 0.5049637 0.8495959 0.7613200
#[6,] 0.3552794 0.5977537 0.5356451

We can verify that f is correct:

## `lm1`, `lm2` and `lm3` are defined in your question
predict(lm1, test, se.fit = TRUE)$se.fit
#        1         2         3         4         5         6 
#0.1974039 0.3254108 0.5071956 0.6583707 0.5049637 0.3552794 

predict(lm2, test, se.fit = TRUE)$se.fit
#        1         2         3         4         5         6 
#0.3321300 0.5475000 0.8533510 1.1077014 0.8495959 0.5977537 

predict(lm3, test, se.fit = TRUE)$se.fit
#        1         2         3         4         5         6 
#0.2976205 0.4906129 0.7646849 0.9926075 0.7613200 0.5356451

using lm in list column to predict new values using purrr

You could take advantage of the newdata argument to predict.

I use map2_dbl so it returns just the single value rather than a list.

mutate(Pred = map2_dbl(model, 1:5, ~predict(.x, newdata = data.frame(ind = .y))))

# A tibble: 5 x 4
  groups         the_data    model       Pred
  <fctr>           <list>   <list>      <dbl>
1      A <tibble [5 x 2]> <S3: lm> -0.4822045
2      B <tibble [5 x 2]> <S3: lm> -0.1357712
3      C <tibble [5 x 2]> <S3: lm> -0.2455760
4      D <tibble [5 x 2]> <S3: lm>  0.4818425
5      E <tibble [5 x 2]> <S3: lm> -0.3473236

If you add ind to the dataset before prediction you can use that column instead of 1:5.

mutate(ind = 1:5) %>%
    mutate(Pred = map2_dbl(model, ind, ~predict(.x, newdata = data.frame(ind = .y) )))

# A tibble: 5 x 5
  groups         the_data    model   ind       Pred
  <fctr>           <list>   <list> <int>      <dbl>
1      A <tibble [5 x 2]> <S3: lm>     1 -0.4822045
2      B <tibble [5 x 2]> <S3: lm>     2 -0.1357712
3      C <tibble [5 x 2]> <S3: lm>     3 -0.2455760
4      D <tibble [5 x 2]> <S3: lm>     4  0.4818425
5      E <tibble [5 x 2]> <S3: lm>     5 -0.3473236

Using lm and predict on data in matrices

You can store a whole matrix in one column of a data.frame:

x <- a [, -1]
y <- a [,  1]
data <- data.frame (y = y, x = I (x))
str (data)
## 'data.frame':    10 obs. of  2 variables:
## $ y: num  0.818 0.767 -0.666 0.788 -0.489 ...
## $ x: AsIs [1:10, 1:9] 0.916274.... 0.386565.... 0.703230.... -2.64091.... 0.274617.... ...

model <- lm (y ~ x)
newdata <- data.frame (x = I (b [, -1]))
predict (model, newdata) 
##         1         2 
## -3.795722 -4.778784

The paper about the pls package, (Mevik, B.-H. and Wehrens, R. The pls Package: Principal Component and Partial Least Squares Regression in R Journal of Statistical Software, 2007, 18, 1 - 24.) explains this technique.

Another example with a spectroscopic data set (quinine fluorescence), is in vignette ("flu") of my package hyperSpec.

Predict Warning on newdata

apl$grp is a vector, but predict requires the newdata argument to be a data frame.^* This data frame must contain columns with the same names as the predictor variables used to fit the model (though it can contain other columns as well). So, the following code should work:

predict(mdl, newdata = apl)

You can use predict rather than predict.lm. mdl is an object of class lm, which causes predict to "dispatch" the predict.lm method automatically.

* Strictly speaking, since this is an lm model, the predict "method" that gets dispatched is predict.lm and that method requires that newdata be a data frame. predict.glm also requires a data frame. However, there are some predict methods that can take other types of arguments. For example:

The randomForest package has a predict method for randomForest models that can take a data frame or matrix as the newdata argument.
The glmnet package has a predict method for glmnet models that requires a matrix, although the argument is called newx rather than newdata in that case.