Add Column of Predicted Values to Data Frame with Dplyr

Add Column of Predicted Values to Data Frame with dplyr

Using modelr, there is an elegant solution using the tidyverse.

The inputs

library(dplyr)
library(purrr)
library(tidyr)

# generate the inputs like in the question
example_table <- data.frame(x = c(1:5, 1:5),
                            y = c((1:5) + rnorm(5), 2*(5:1)),
                            groups = rep(LETTERS[1:2], each = 5))

models <- example_table %>% 
  group_by(groups) %>% 
  do(model = lm(y ~ x, data = .)) %>%
  ungroup()
example_table <- left_join(tbl_df(example_table ), models, by = "groups")

The solution

# generate the extra column
example_table %>%
  group_by(groups) %>%
  do(modelr::add_predictions(., first(.$model)))

The explanation

add_predictions adds a new column to a data frame using a given model. Unfortunately it only takes one model as an argument. Meet do. Using do, we can run add_prediction individually over each group.

. represents the grouped data frame, .$model the model column and first() takes the first model of each group.

Simplified

With only one model, add_predictions works very well.

# take one of the models
model <- example_table$model[[6]]

# generate the extra column
example_table %>%
  modelr::add_predictions(model)

Recipes

Nowadays, the tidyverse is shifting from the modelr package to recipes so that might be the new way to go once this package matures.

Add predictions for models by group

There are a couple of additional ways you can attack this.

Probably the most direct, but you lose the intermediate model:

rmod <- df %>%
  group_by(country) %>%
  mutate(fit = lm(value ~ year)$fitted.values) %>%
  ungroup
rmod
# # A tibble: 22 × 4
#     year country value      fit
#    <dbl>   <chr> <dbl>    <dbl>
# 1   2001  France    55 38.13636
# 2   2002  France    53 39.00000
# 3   2003  France    31 39.86364
# 4   2004  France    10 40.72727
# 5   2005  France    30 41.59091
# 6   2006  France    37 42.45455
# 7   2007  France    54 43.31818
# 8   2008  France    58 44.18182
# 9   2009  France    50 45.04545
# 10  2010  France    40 45.90909
# # ... with 12 more rows

Another way uses a "tidy" model for enclosing data, models, and results into individual cells within the frame:

rmod <- df %>%
  group_by(country) %>%
  nest() %>%
  mutate(mdl = map(data, ~ lm(value ~ year, data=.))) %>%
  mutate(fit = map(mdl, ~ .$fitted.values))
rmod
# # A tibble: 2 × 4
#   country              data      mdl        fit
#     <chr>            <list>   <list>     <list>
# 1  France <tibble [11 × 2]> <S3: lm> <dbl [11]>
# 2     USA <tibble [11 × 2]> <S3: lm> <dbl [11]>

The advantage to this method is that you can, as needed, access other properties of the model as-needed, perhaps summary( filter(rmod, country == "France")$mdl[[1]] ). (The [[1]] is required because with tibbles, $mdl will always return a list.)

And you can extract/unnest it as follows:

select(rmod, -mdl) %>% unnest()
# # A tibble: 22 × 4
#    country      fit  year value
#      <chr>    <dbl> <dbl> <dbl>
# 1   France 38.13636  2001    55
# 2   France 39.00000  2002    53
# 3   France 39.86364  2003    31
# 4   France 40.72727  2004    10
# 5   France 41.59091  2005    30
# 6   France 42.45455  2006    37
# 7   France 43.31818  2007    54
# 8   France 44.18182  2008    58
# 9   France 45.04545  2009    50
# 10  France 45.90909  2010    40
# # ... with 12 more rows

(The columns are re-ordered, unfortunately, but that's aesthetic and easily remedied.)

EDIT

If you want/need to use modelr-specifics here, try:

rmod <- df %>%
  group_by(country) %>%
  nest() %>%
  mutate(mdl = map(data, ~ lm(value ~ year, data=.))) %>%
  mutate(fit = map(mdl, ~ .$fitted.values)) %>%
  mutate(data = map2(data, mdl, add_predictions))
rmod
# # A tibble: 2 x 4
#   country data              mdl      fit       
#   <chr>   <list>            <list>   <list>    
# 1 France  <tibble [11 x 3]> <S3: lm> <dbl [11]>
# 2 USA     <tibble [11 x 3]> <S3: lm> <dbl [11]>
select(rmod, -mdl, -fit) %>% unnest()
# # A tibble: 22 x 4
#    country  year value  pred
#    <chr>   <dbl> <dbl> <dbl>
#  1 France  2001.   55.  38.1
#  2 France  2002.   53.  39.0
#  3 France  2003.   31.  39.9
#  4 France  2004.   10.  40.7
#  5 France  2005.   30.  41.6
#  6 France  2006.   37.  42.5
#  7 France  2007.   54.  43.3
#  8 France  2008.   58.  44.2
#  9 France  2009.   50.  45.0
# 10 France  2010.   40.  45.9
# # ... with 12 more rows

In R, how to add the fitted value column to the original dataframe?

Suppose:

fm <- lm(demand ~ Time, BOD)

Then try this:

cbind(BOD, resid = resid(fm), fitted = fitted(fm))

or this:

BOD$resid <- resid(fm)
BOD$fitted <- fitted(fm)

ADDED:

If you have NA values in demand then your fitted values and residuals will be of a different length than the number of rows of your data, meaning the above will not work. In such a case use: na.exclude like this:

BOD$demand[3] <- NA # set up test data
fm <- lm(demand ~ Time, BOD, na.action = na.exclude)

na.exclude will automatically pad the predictions and residuals with NA values so that they are of the same length as the original data. Now the previous lines should work.

Predicted values from a series of linear models

One option would be to get the predictions as a column in a data.frame using do. The difference from the other answer is the use of data.frame to get the predictions in a column. You can add in the dv variable to this dataset to keep things straight.

df %>%
    gather(dv, value, y1, y2, -x1,-x2) %>%
    group_by(dv)%>%
    do(mod=lm(value ~ x1 + x2, data=.)) %>%
        do(data.frame(dv = .$dv, pred = predict(.$mod, newdata = df)))

Source: local data frame [200 x 2]
Groups: <by row>

      dv     pred
   (chr)    (dbl)
1     y1 4.936012
2     y1 4.948939
3     y1 4.992472
4     y1 4.733290
5     y1 4.921581
6     y1 5.115699
7     y1 4.981135
8     y1 4.837326
9     y1 4.641484
10    y1 4.739197
..   ...      ...

The down side of that (to me) is that you don't have the data used for the predictions with the actual predicted values. You could certainly cbind to the prediction dataset, but another useful option is to use augment from package broom within do. In this second alternative I use augment within the first call to do, although it's not required.

You can give the dataset you want to predict with/add the predictions to using the newdata argument within augment. In this example I used the dataset df2 (just the dependent variable columns of your df dataset).

library(broom)
df2 = df[ , 3:4] # Dataset for predictions
df %>%
    gather(dv, value, y1, y2, -x1,-x2) %>%
    group_by(dv)%>%
    do( augment(lm(value ~ x1 + x2, data=.), newdata = df2) )

Source: local data frame [200 x 5]
Groups: dv [2]

      dv        x1       x2  .fitted   .se.fit
   (chr)     (dbl)    (dbl)    (dbl)     (dbl)
1     y1  5.863764 6.201406 4.936012 0.1521102
2     y1  4.419014 7.028888 4.948939 0.1936563
3     y1  7.917369 6.081930 4.992472 0.1255001
4     y1  4.338864 4.019565 4.733290 0.1842635
5     y1 13.307611 2.674705 4.921581 0.1757911
6     y1 14.986879 4.666154 5.115699 0.1614377
7     y1 12.941636 3.679022 4.981135 0.1409247
8     y1  7.474526 4.088868 4.837326 0.1310659
9     y1  2.136858 3.706184 4.641484 0.2357699
10    y1  9.307190 1.885127 4.739197 0.2008851
..   ...       ...      ...      ...       ...

How to add a column of fitted values to a data frame by group?

For the lm models you could try

library(nlme)     # lmList to do lm by group
library(ggplot2)  # fortify to get out the fitted/resid data
do.call(rbind, lapply(lmList(y ~ x | g, data=X), fortify))

This gives you the residual and fitted data in ".resid" and ".fitted" columns as well as a bunch of other fit data. By default the rownames will be prefixed with the letters from g.

With the rqss models that might fail

do.call(rbind, lapply(split(X, X$g), function(z) {
    fit <- tryCatch({
        rqss(y ~ x, data=z)
    }, error=function(e) NULL)
    if (is.null(fit)) data.frame(resid=numeric(0), fitted=numeric(0))
    else data.frame(resid=fit$resid, fitted=fitted(fit))
}))

R - Making predictions and confidence intervals with different models for each group of data

Output of predict is a matrix, convert it to a dataframe and then unnest

library(tidyverse)

mtcars %>% 
  group_by(gear) %>% 
  nest %>% 
  inner_join(x) %>% 
  mutate(preds = map2(model, data, 
         ~as.data.frame(predict(.x, .y, interval = "confidence")))) %>%
  unnest(cols = c(preds, data))

#   gear   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  carb model    fit   lwr   upr
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <dbl> <dbl> <dbl>
# 1     4  21       6 160     110  3.9   2.62  16.5     0     1     4 <lm>    22.0  19.6  24.4
# 2     4  21       6 160     110  3.9   2.88  17.0     0     1     4 <lm>    21.2  19.2  23.2
# 3     4  22.8     4 108      93  3.85  2.32  18.6     1     1     1 <lm>    25.1  23.0  27.1
# 4     4  24.4     4 147.     62  3.69  3.19  20       1     0     2 <lm>    26.0  21.5  30.6
# 5     4  22.8     4 141.     95  3.92  3.15  22.9     1     0     2 <lm>    22.2  19.9  24.4
# 6     4  19.2     6 168.    123  3.92  3.44  18.3     1     0     4 <lm>    17.8  15.1  20.5
# 7     4  17.8     6 168.    123  3.92  3.44  18.9     1     0     4 <lm>    17.8  15.1  20.5
# 8     4  32.4     4  78.7    66  4.08  2.2   19.5     1     1     1 <lm>    28.7  26.6  30.8
# 9     4  30.4     4  75.7    52  4.93  1.62  18.5     1     1     2 <lm>    32.3  29.3  35.3
#10     4  33.9     4  71.1    65  4.22  1.84  19.9     1     1     1 <lm>    30.0  27.5  32.5
# … with 22 more rows

using lm in list column to predict new values using purrr

You could take advantage of the newdata argument to predict.

I use map2_dbl so it returns just the single value rather than a list.

mutate(Pred = map2_dbl(model, 1:5, ~predict(.x, newdata = data.frame(ind = .y))))

# A tibble: 5 x 4
  groups         the_data    model       Pred
  <fctr>           <list>   <list>      <dbl>
1      A <tibble [5 x 2]> <S3: lm> -0.4822045
2      B <tibble [5 x 2]> <S3: lm> -0.1357712
3      C <tibble [5 x 2]> <S3: lm> -0.2455760
4      D <tibble [5 x 2]> <S3: lm>  0.4818425
5      E <tibble [5 x 2]> <S3: lm> -0.3473236

If you add ind to the dataset before prediction you can use that column instead of 1:5.

mutate(ind = 1:5) %>%
    mutate(Pred = map2_dbl(model, ind, ~predict(.x, newdata = data.frame(ind = .y) )))

# A tibble: 5 x 5
  groups         the_data    model   ind       Pred
  <fctr>           <list>   <list> <int>      <dbl>
1      A <tibble [5 x 2]> <S3: lm>     1 -0.4822045
2      B <tibble [5 x 2]> <S3: lm>     2 -0.1357712
3      C <tibble [5 x 2]> <S3: lm>     3 -0.2455760
4      D <tibble [5 x 2]> <S3: lm>     4  0.4818425
5      E <tibble [5 x 2]> <S3: lm>     5 -0.3473236

Creating a data frame in R from forecasts to use it for predict

I guess the following code should help you to create a newdata data.frame.

Required libraries to re preduce.

# library(dplyr)
# library(stats)
# library(forecast)

With the following code I will create an example data with 5 time series:

set.seed(123)

dta <- ts(dplyr::tibble(
  AA = arima.sim(list(order=c(1,0,0), ar=.5), n=100, mean = 12), 
  AB = arima.sim(list(order=c(1,0,0), ar=.5), n=100, mean = 12), 
  AC = arima.sim(list(order=c(1,0,0), ar=.5), n=100, mean = 11), 
  BA = arima.sim(list(order=c(1,0,0), ar=.5), n=100, mean = 10), 
  BB = arima.sim(list(order=c(1,0,0), ar=.5), n=100, mean = 14)
), start = c(2013, 1), frequency = 12)

head(dta)
tail(dta)

Now we will do the batch forecasting and create one new data `data.frame`/`matrix`.

nseries <- ncol(dta)
h <- 12 # forecast horizon

newdata <- matrix(nrow = h, ncol = nseries) # empty newdata matrix

for (i in seq_len(nseries)) {
  
  newdata[,i] <- forecast::forecast(forecast::auto.arima(dta[,i]), h = h)$mean
}

colnames(newdata) <- colnames(dta)

head(newdata)

I hope I understood the problem correctly.

Add Column of Predicted Values to Data Frame with Dplyr