Add Column of Predicted Values to Data Frame with Dplyr

Add Column of Predicted Values to Data Frame with dplyr

Using modelr, there is an elegant solution using the tidyverse.

The inputs

library(dplyr)
library(purrr)
library(tidyr)

# generate the inputs like in the question
example_table <- data.frame(x = c(1:5, 1:5),
y = c((1:5) + rnorm(5), 2*(5:1)),
groups = rep(LETTERS[1:2], each = 5))

models <- example_table %>%
group_by(groups) %>%
do(model = lm(y ~ x, data = .)) %>%
ungroup()
example_table <- left_join(tbl_df(example_table ), models, by = "groups")

The solution

# generate the extra column
example_table %>%
group_by(groups) %>%
do(modelr::add_predictions(., first(.$model)))

The explanation

add_predictions adds a new column to a data frame using a given model. Unfortunately it only takes one model as an argument. Meet do. Using do, we can run add_prediction individually over each group.

. represents the grouped data frame, .$model the model column and first() takes the first model of each group.

Simplified

With only one model, add_predictions works very well.

# take one of the models
model <- example_table$model[[6]]

# generate the extra column
example_table %>%
modelr::add_predictions(model)

Recipes

Nowadays, the tidyverse is shifting from the modelr package to recipes so that might be the new way to go once this package matures.

Add predictions for models by group

There are a couple of additional ways you can attack this.

Probably the most direct, but you lose the intermediate model:

rmod <- df %>%
group_by(country) %>%
mutate(fit = lm(value ~ year)$fitted.values) %>%
ungroup
rmod
# # A tibble: 22 × 4
# year country value fit
# <dbl> <chr> <dbl> <dbl>
# 1 2001 France 55 38.13636
# 2 2002 France 53 39.00000
# 3 2003 France 31 39.86364
# 4 2004 France 10 40.72727
# 5 2005 France 30 41.59091
# 6 2006 France 37 42.45455
# 7 2007 France 54 43.31818
# 8 2008 France 58 44.18182
# 9 2009 France 50 45.04545
# 10 2010 France 40 45.90909
# # ... with 12 more rows

Another way uses a "tidy" model for enclosing data, models, and results into individual cells within the frame:

rmod <- df %>%
group_by(country) %>%
nest() %>%
mutate(mdl = map(data, ~ lm(value ~ year, data=.))) %>%
mutate(fit = map(mdl, ~ .$fitted.values))
rmod
# # A tibble: 2 × 4
# country data mdl fit
# <chr> <list> <list> <list>
# 1 France <tibble [11 × 2]> <S3: lm> <dbl [11]>
# 2 USA <tibble [11 × 2]> <S3: lm> <dbl [11]>

The advantage to this method is that you can, as needed, access other properties of the model as-needed, perhaps summary( filter(rmod, country == "France")$mdl[[1]] ). (The [[1]] is required because with tibbles, $mdl will always return a list.)

And you can extract/unnest it as follows:

select(rmod, -mdl) %>% unnest()
# # A tibble: 22 × 4
# country fit year value
# <chr> <dbl> <dbl> <dbl>
# 1 France 38.13636 2001 55
# 2 France 39.00000 2002 53
# 3 France 39.86364 2003 31
# 4 France 40.72727 2004 10
# 5 France 41.59091 2005 30
# 6 France 42.45455 2006 37
# 7 France 43.31818 2007 54
# 8 France 44.18182 2008 58
# 9 France 45.04545 2009 50
# 10 France 45.90909 2010 40
# # ... with 12 more rows

(The columns are re-ordered, unfortunately, but that's aesthetic and easily remedied.)

EDIT

If you want/need to use modelr-specifics here, try:

rmod <- df %>%
group_by(country) %>%
nest() %>%
mutate(mdl = map(data, ~ lm(value ~ year, data=.))) %>%
mutate(fit = map(mdl, ~ .$fitted.values)) %>%
mutate(data = map2(data, mdl, add_predictions))
rmod
# # A tibble: 2 x 4
# country data mdl fit
# <chr> <list> <list> <list>
# 1 France <tibble [11 x 3]> <S3: lm> <dbl [11]>
# 2 USA <tibble [11 x 3]> <S3: lm> <dbl [11]>
select(rmod, -mdl, -fit) %>% unnest()
# # A tibble: 22 x 4
# country year value pred
# <chr> <dbl> <dbl> <dbl>
# 1 France 2001. 55. 38.1
# 2 France 2002. 53. 39.0
# 3 France 2003. 31. 39.9
# 4 France 2004. 10. 40.7
# 5 France 2005. 30. 41.6
# 6 France 2006. 37. 42.5
# 7 France 2007. 54. 43.3
# 8 France 2008. 58. 44.2
# 9 France 2009. 50. 45.0
# 10 France 2010. 40. 45.9
# # ... with 12 more rows

In R, how to add the fitted value column to the original dataframe?

Suppose:

fm <- lm(demand ~ Time, BOD)

Then try this:

cbind(BOD, resid = resid(fm), fitted = fitted(fm))

or this:

BOD$resid <- resid(fm)
BOD$fitted <- fitted(fm)

ADDED:

If you have NA values in demand then your fitted values and residuals will be of a different length than the number of rows of your data, meaning the above will not work. In such a case use: na.exclude like this:

BOD$demand[3] <- NA # set up test data
fm <- lm(demand ~ Time, BOD, na.action = na.exclude)

na.exclude will automatically pad the predictions and residuals with NA values so that they are of the same length as the original data. Now the previous lines should work.

Predicted values from a series of linear models

One option would be to get the predictions as a column in a data.frame using do. The difference from the other answer is the use of data.frame to get the predictions in a column. You can add in the dv variable to this dataset to keep things straight.

df %>%
gather(dv, value, y1, y2, -x1,-x2) %>%
group_by(dv)%>%
do(mod=lm(value ~ x1 + x2, data=.)) %>%
do(data.frame(dv = .$dv, pred = predict(.$mod, newdata = df)))

Source: local data frame [200 x 2]
Groups: <by row>

dv pred
(chr) (dbl)
1 y1 4.936012
2 y1 4.948939
3 y1 4.992472
4 y1 4.733290
5 y1 4.921581
6 y1 5.115699
7 y1 4.981135
8 y1 4.837326
9 y1 4.641484
10 y1 4.739197
.. ... ...

The down side of that (to me) is that you don't have the data used for the predictions with the actual predicted values. You could certainly cbind to the prediction dataset, but another useful option is to use augment from package broom within do. In this second alternative I use augment within the first call to do, although it's not required.

You can give the dataset you want to predict with/add the predictions to using the newdata argument within augment. In this example I used the dataset df2 (just the dependent variable columns of your df dataset).

library(broom)
df2 = df[ , 3:4] # Dataset for predictions
df %>%
gather(dv, value, y1, y2, -x1,-x2) %>%
group_by(dv)%>%
do( augment(lm(value ~ x1 + x2, data=.), newdata = df2) )

Source: local data frame [200 x 5]
Groups: dv [2]

dv x1 x2 .fitted .se.fit
(chr) (dbl) (dbl) (dbl) (dbl)
1 y1 5.863764 6.201406 4.936012 0.1521102
2 y1 4.419014 7.028888 4.948939 0.1936563
3 y1 7.917369 6.081930 4.992472 0.1255001
4 y1 4.338864 4.019565 4.733290 0.1842635
5 y1 13.307611 2.674705 4.921581 0.1757911
6 y1 14.986879 4.666154 5.115699 0.1614377
7 y1 12.941636 3.679022 4.981135 0.1409247
8 y1 7.474526 4.088868 4.837326 0.1310659
9 y1 2.136858 3.706184 4.641484 0.2357699
10 y1 9.307190 1.885127 4.739197 0.2008851
.. ... ... ... ... ...

How to add a column of fitted values to a data frame by group?

For the lm models you could try

library(nlme)     # lmList to do lm by group
library(ggplot2) # fortify to get out the fitted/resid data
do.call(rbind, lapply(lmList(y ~ x | g, data=X), fortify))

This gives you the residual and fitted data in ".resid" and ".fitted" columns as well as a bunch of other fit data. By default the rownames will be prefixed with the letters from g.

With the rqss models that might fail

do.call(rbind, lapply(split(X, X$g), function(z) {
fit <- tryCatch({
rqss(y ~ x, data=z)
}, error=function(e) NULL)
if (is.null(fit)) data.frame(resid=numeric(0), fitted=numeric(0))
else data.frame(resid=fit$resid, fitted=fitted(fit))
}))

R - Making predictions and confidence intervals with different models for each group of data

Output of predict is a matrix, convert it to a dataframe and then unnest

library(tidyverse)

mtcars %>%
group_by(gear) %>%
nest %>%
inner_join(x) %>%
mutate(preds = map2(model, data,
~as.data.frame(predict(.x, .y, interval = "confidence")))) %>%
unnest(cols = c(preds, data))

# gear mpg cyl disp hp drat wt qsec vs am carb model fit lwr upr
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <dbl> <dbl> <dbl>
# 1 4 21 6 160 110 3.9 2.62 16.5 0 1 4 <lm> 22.0 19.6 24.4
# 2 4 21 6 160 110 3.9 2.88 17.0 0 1 4 <lm> 21.2 19.2 23.2
# 3 4 22.8 4 108 93 3.85 2.32 18.6 1 1 1 <lm> 25.1 23.0 27.1
# 4 4 24.4 4 147. 62 3.69 3.19 20 1 0 2 <lm> 26.0 21.5 30.6
# 5 4 22.8 4 141. 95 3.92 3.15 22.9 1 0 2 <lm> 22.2 19.9 24.4
# 6 4 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 <lm> 17.8 15.1 20.5
# 7 4 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 <lm> 17.8 15.1 20.5
# 8 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 1 <lm> 28.7 26.6 30.8
# 9 4 30.4 4 75.7 52 4.93 1.62 18.5 1 1 2 <lm> 32.3 29.3 35.3
#10 4 33.9 4 71.1 65 4.22 1.84 19.9 1 1 1 <lm> 30.0 27.5 32.5
# … with 22 more rows

using lm in list column to predict new values using purrr

You could take advantage of the newdata argument to predict.

I use map2_dbl so it returns just the single value rather than a list.

mutate(Pred = map2_dbl(model, 1:5, ~predict(.x, newdata = data.frame(ind = .y))))

# A tibble: 5 x 4
groups the_data model Pred
<fctr> <list> <list> <dbl>
1 A <tibble [5 x 2]> <S3: lm> -0.4822045
2 B <tibble [5 x 2]> <S3: lm> -0.1357712
3 C <tibble [5 x 2]> <S3: lm> -0.2455760
4 D <tibble [5 x 2]> <S3: lm> 0.4818425
5 E <tibble [5 x 2]> <S3: lm> -0.3473236

If you add ind to the dataset before prediction you can use that column instead of 1:5.

mutate(ind = 1:5) %>%
mutate(Pred = map2_dbl(model, ind, ~predict(.x, newdata = data.frame(ind = .y) )))

# A tibble: 5 x 5
groups the_data model ind Pred
<fctr> <list> <list> <int> <dbl>
1 A <tibble [5 x 2]> <S3: lm> 1 -0.4822045
2 B <tibble [5 x 2]> <S3: lm> 2 -0.1357712
3 C <tibble [5 x 2]> <S3: lm> 3 -0.2455760
4 D <tibble [5 x 2]> <S3: lm> 4 0.4818425
5 E <tibble [5 x 2]> <S3: lm> 5 -0.3473236

Creating a data frame in R from forecasts to use it for predict

I guess the following code should help you to create a newdata data.frame.

Required libraries to re preduce.

# library(dplyr)
# library(stats)
# library(forecast)

With the following code I will create an example data with 5 time series:

set.seed(123)

dta <- ts(dplyr::tibble(
AA = arima.sim(list(order=c(1,0,0), ar=.5), n=100, mean = 12),
AB = arima.sim(list(order=c(1,0,0), ar=.5), n=100, mean = 12),
AC = arima.sim(list(order=c(1,0,0), ar=.5), n=100, mean = 11),
BA = arima.sim(list(order=c(1,0,0), ar=.5), n=100, mean = 10),
BB = arima.sim(list(order=c(1,0,0), ar=.5), n=100, mean = 14)
), start = c(2013, 1), frequency = 12)

head(dta)
tail(dta)

Now we will do the batch forecasting and create one new data data.frame/matrix.

nseries <- ncol(dta)
h <- 12 # forecast horizon

newdata <- matrix(nrow = h, ncol = nseries) # empty newdata matrix

for (i in seq_len(nseries)) {

newdata[,i] <- forecast::forecast(forecast::auto.arima(dta[,i]), h = h)$mean
}

colnames(newdata) <- colnames(dta)

head(newdata)

I hope I understood the problem correctly.



Related Topics



Leave a reply



Submit