Reusing a Model Built in R

Reusing a Model Built in R

Reusing a model to predict for new observations

If the model is not computationally costly, I tend to document the entire model building process in an R script that I rerun when needed. If a random element is involved in the model fitting, I make sure to set a known random seed.

If the model is computationally costly to compute, then I still use a script as above, but save out the model objects using save() into and rda object. I then tend to modify the script such that if the saved object exists, load it, or if not, refit the model, using a simple if()...else clause wrapped around the relevant parts of the code.

When loading your saved model object, be sure to reload any required packages, although in your case if the logit model were fit via glm() there will not be any additional packages to load beyond R.

Here is an example:

> set.seed(345)
> df <- data.frame(x = rnorm(20))
> df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
> ## model
> m1 <- lm(y ~ x, data = df)
> ## save this model
> save(m1, file = "my_model1.rda")
>
> ## a month later, new observations are available:
> newdf <- data.frame(x = rnorm(20))
> ## load the model
> load("my_model1.rda")
> ## predict for the new `x`s in `newdf`
> predict(m1, newdata = newdf)
1 2 3 4 5 6
6.1370366 6.5631503 2.9808845 5.2464261 4.6651015 3.4475255
7 8 9 10 11 12
6.7961764 5.3592901 3.3691800 9.2506653 4.7562096 3.9067537
13 14 15 16 17 18
2.0423691 2.4764664 3.7308918 6.9999064 2.0081902 0.3256407
19 20
5.4247548 2.6906722

If wanting to automate this, then I would probably do the following in a script:

## data
df <- data.frame(x = rnorm(20))
df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))

## check if model exists? If not, refit:
if(file.exists("my_model1.rda")) {
## load model
load("my_model1.rda")
} else {
## (re)fit the model
m1 <- lm(y ~ x, data = df)
}

## predict for new observations
## new observations
newdf <- data.frame(x = rnorm(20))
## predict
predict(m1, newdata = newdf)

Of course, the data generation code would be replaced by code loading your actual data.

Updating a previously fitted model with new observations

If you want to refit the model using additional new observations. Then update() is a useful function. All it does is refit the model with one or more of the model arguments updated. If you want to include new observations in the data used to fit the model, add the new observations to the data frame passed to argument 'data', and then do the following:

m2 <- update(m1, . ~ ., data = df)

where m1 is the original, saved model fit, . ~ . is the model formula changes, which in this case means include all existing variables on both the left and right hand sides of ~ (in other words, make no changes to the model formula), and df is the data frame used to fit the original model, expanded to include the newly available observations.

Here is a working example:

> set.seed(123)
> df <- data.frame(x = rnorm(20))
> df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
> ## model
> m1 <- lm(y ~ x, data = df)
> m1

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept) x
4.960 2.222

>
> ## new observations
> newdf <- data.frame(x = rnorm(20))
> newdf <- transform(newdf, y = 5 + (2.3 * x) + rnorm(20))
> ## add on to df
> df <- rbind(df, newdf)
>
> ## update model fit
> m2 <- update(m1, . ~ ., data = df)
> m2

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept) x
4.928 2.187

Other have mentioned in comments formula(), which extracts the formula from a fitted model:

> formula(m1)
y ~ x
> ## which can be used to set-up a new model call
> ## so an alternative to update() above is:
> m3 <- lm(formula(m1), data = df)

However, if the model fitting involves additional arguments, like 'family', or 'subset' arguments in more complex model fitting functions. If update() methods are available for your model fitting function (which they are for many common fitting functions, like glm()), it provides a simpler way to update a model fit than extracting and reusing the model formula.

If you intend to do all the modelling and future prediction in R, there doesn't really seem much point in abstracting the model out via PMML or similar.

Reuse a HoltWinters model using new data

You can use update:

mdl <- HoltWinters(EuStockMarkets[,"FTSE"],gamma=FALSE)

predict(mdl,n.ahead=10)
Time Series:
Start = c(1998, 170)
End = c(1998, 179)
Frequency = 260
fit
[1,] 5451.093
[2,] 5447.186
[3,] 5443.279
[4,] 5439.373
[5,] 5435.466
[6,] 5431.559
[7,] 5427.652
[8,] 5423.745
[9,] 5419.838
[10,] 5415.932

predict(update(mdl,x=EuStockMarkets[,"CAC"]),n.ahead=10)]
Time Series:
Start = c(1998, 170)
End = c(1998, 179)
Frequency = 260
fit
[1,] 3995.127
[2,] 3995.253
[3,] 3995.380
[4,] 3995.506
[5,] 3995.633
[6,] 3995.759
[7,] 3995.886
[8,] 3996.013
[9,] 3996.139
[10,] 3996.266

How to correctly `dput` a fitted linear model (by `lm`) to an ASCII file and recreate it later?

Step 1:

You need to control de-parsing options:

dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R") 

You can read more on all possible options in ?.deparseOpts.


The "quoteExpressions" wraps all calls / expressions / languages with quote, so that they are not evaluated when you later re-parse it. Note:

  • source is doing parsing;
  • call field in your fitted "lm" object is a call:

    fit$call
    # lm(formula = z ~ x, data = dat_train)

So, without "quoteExpressions", R will try to evaluate lm call during parsing. And if we evaluate it, it is fitting a linear model, and R will aim to find dat_train, which will not exist in your new R session.


The "showAttributes" is another mandatory option, as "lm" object has class attributes. You certainly don't want to discard all class attributes and only export a plain "list" object, right? Moreover, many elements in a "lm" object, like model (the model frame), qr (the compact QR matrix) and terms (terms info), etc all have attributes. You want to keep them all.


If you don't set control, the default setting with:

control = c("keepNA", "keepInteger", "showAttributes")

will be used. As you can see, there is no "quoteExpressions", so you will get into trouble.

You can also specify "keepInteger" and "keepNA", but I don't see the need for "lm" object.

------

Step 2:

The above step will get source working correctly. You can recover your model:

fit1 <- source("model.R")$value

However, it is not yet ready for generic functions like summary and predict to work. Why?

The critical issue is the terms object in fit1 is not really a "terms" object, but only a formula (it is even not a formula, but only a "language" object without "formula" class!). Just compare fit$terms and fit1$terms, and you will see the difference. Don't be surprised; we've set "quoteExpressions" earlier. While that is definitely helpful to prevent evaluation of call, it has side-effect for terms. So we need to reconstruct terms as best as we can.

Fortunately, it is sufficient to do:

fit1$terms <- terms.formula(fit1$terms)

Though this still does not recover all information in fit$terms (like variable classes are missing), it is readily a valid "terms" object.

Why is a "terms" object critical? Because all generic functions rely on it. You may not need to know more on this, as it is really technical, so I will stop here.

Once this is done, we can successfully use predict (and summary, too):

predict(fit1)  ## no `newdata` given, using model frame `fit1$model`
# 1 2 3 4
#1.03 2.01 2.99 3.97

predict(fit1, dat_score) ## with `newdata`
# 1 2
#1.52 3.48

-------

Conclusion remark:

Although I have shown you how to get things work, I don't really recommend you doing this in general. An "lm" object will be pretty large when you fit a model to a large dataset, for example, residuals, fitted.values are long vectors, and qr and model are huge matrices / data frames. So think about this.

Importing a model into R, that was created in a newer version of H2O

You can download the R package for 3.22.0.1 from here:

  • http://h2o-release.s3.amazonaws.com/h2o/rel-xia/1/index.html

At the time of this comment, this is the latest stable release.

The version in CRAN is often a release or two behind, but you can always download the latest stable version from the H2O website. All versions of H2O are archived in S3. Every version is there, you just need to find the right link.

The message "Found version 3.22.0.1, but running version 3.20.0.8" means there is a mismatch between the version of the R package and a running H2O server on your host. You might start by making sure the H2O java processes are all stopped before trying to start a new one from R. (If you're not exactly sure how to do that, worst case just reboot your laptop.)



Related Topics



Leave a reply



Submit