How to correctly `dput` a fitted linear model (by `lm`) to an ASCII file and recreate it later?
Step 1:
You need to control de-parsing options:
dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R")
You can read more on all possible options in ?.deparseOpts
.
The "quoteExpressions" wraps all calls / expressions / languages with quote
, so that they are not evaluated when you later re-parse it. Note:
source
is doing parsing;call
field in your fitted "lm" object is a call:fit$call
# lm(formula = z ~ x, data = dat_train)
So, without "quoteExpressions", R will try to evaluate lm
call during parsing. And if we evaluate it, it is fitting a linear model, and R will aim to find dat_train
, which will not exist in your new R session.
The "showAttributes" is another mandatory option, as "lm" object has class attributes. You certainly don't want to discard all class attributes and only export a plain "list" object, right? Moreover, many elements in a "lm" object, like model
(the model frame), qr
(the compact QR matrix) and terms
(terms info), etc all have attributes. You want to keep them all.
If you don't set control
, the default setting with:
control = c("keepNA", "keepInteger", "showAttributes")
will be used. As you can see, there is no "quoteExpressions", so you will get into trouble.
You can also specify "keepInteger" and "keepNA", but I don't see the need for "lm" object.
------
Step 2:
The above step will get source
working correctly. You can recover your model:
fit1 <- source("model.R")$value
However, it is not yet ready for generic functions like summary
and predict
to work. Why?
The critical issue is the terms
object in fit1
is not really a "terms" object, but only a formula (it is even not a formula, but only a "language" object without "formula" class!). Just compare fit$terms
and fit1$terms
, and you will see the difference. Don't be surprised; we've set "quoteExpressions" earlier. While that is definitely helpful to prevent evaluation of call
, it has side-effect for terms
. So we need to reconstruct terms
as best as we can.
Fortunately, it is sufficient to do:
fit1$terms <- terms.formula(fit1$terms)
Though this still does not recover all information in fit$terms
(like variable classes are missing), it is readily a valid "terms" object.
Why is a "terms" object critical? Because all generic functions rely on it. You may not need to know more on this, as it is really technical, so I will stop here.
Once this is done, we can successfully use predict
(and summary
, too):
predict(fit1) ## no `newdata` given, using model frame `fit1$model`
# 1 2 3 4
#1.03 2.01 2.99 3.97
predict(fit1, dat_score) ## with `newdata`
# 1 2
#1.52 3.48
-------
Conclusion remark:
Although I have shown you how to get things work, I don't really recommend you doing this in general. An "lm" object will be pretty large when you fit a model to a large dataset, for example, residuals
, fitted.values
are long vectors, and qr
and model
are huge matrices / data frames. So think about this.
R, Is there a function for deparsing and (re)parsing any regression output for lm/glm/semgented models?
As @AEF mentions in their comment, you might be looking for serialization
with saveRDS()
. If you specifically need a text format rather than a
binary one, you can use ascii = TRUE
.
fit0 <- lm(mpg ~ wt, data = mtcars)
saveRDS(fit0, "model.rds", ascii = TRUE)
fit1 <- readRDS("model.rds")
predict(fit0, data.frame(wt = 3))
#> 1
#> 21.25171
predict(fit1, data.frame(wt = 3))
#> 1
#> 21.25171
Reusing a Model Built in R
Reusing a model to predict for new observations
If the model is not computationally costly, I tend to document the entire model building process in an R script that I rerun when needed. If a random element is involved in the model fitting, I make sure to set a known random seed.
If the model is computationally costly to compute, then I still use a script as above, but save out the model objects using save()
into and rda object. I then tend to modify the script such that if the saved object exists, load it, or if not, refit the model, using a simple if()...else
clause wrapped around the relevant parts of the code.
When loading your saved model object, be sure to reload any required packages, although in your case if the logit model were fit via glm()
there will not be any additional packages to load beyond R.
Here is an example:
> set.seed(345)
> df <- data.frame(x = rnorm(20))
> df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
> ## model
> m1 <- lm(y ~ x, data = df)
> ## save this model
> save(m1, file = "my_model1.rda")
>
> ## a month later, new observations are available:
> newdf <- data.frame(x = rnorm(20))
> ## load the model
> load("my_model1.rda")
> ## predict for the new `x`s in `newdf`
> predict(m1, newdata = newdf)
1 2 3 4 5 6
6.1370366 6.5631503 2.9808845 5.2464261 4.6651015 3.4475255
7 8 9 10 11 12
6.7961764 5.3592901 3.3691800 9.2506653 4.7562096 3.9067537
13 14 15 16 17 18
2.0423691 2.4764664 3.7308918 6.9999064 2.0081902 0.3256407
19 20
5.4247548 2.6906722
If wanting to automate this, then I would probably do the following in a script:
## data
df <- data.frame(x = rnorm(20))
df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
## check if model exists? If not, refit:
if(file.exists("my_model1.rda")) {
## load model
load("my_model1.rda")
} else {
## (re)fit the model
m1 <- lm(y ~ x, data = df)
}
## predict for new observations
## new observations
newdf <- data.frame(x = rnorm(20))
## predict
predict(m1, newdata = newdf)
Of course, the data generation code would be replaced by code loading your actual data.
Updating a previously fitted model with new observations
If you want to refit the model using additional new observations. Then update()
is a useful function. All it does is refit the model with one or more of the model arguments updated. If you want to include new observations in the data used to fit the model, add the new observations to the data frame passed to argument 'data'
, and then do the following:
m2 <- update(m1, . ~ ., data = df)
where m1
is the original, saved model fit, . ~ .
is the model formula changes, which in this case means include all existing variables on both the left and right hand sides of ~
(in other words, make no changes to the model formula), and df
is the data frame used to fit the original model, expanded to include the newly available observations.
Here is a working example:
> set.seed(123)
> df <- data.frame(x = rnorm(20))
> df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
> ## model
> m1 <- lm(y ~ x, data = df)
> m1
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
4.960 2.222
>
> ## new observations
> newdf <- data.frame(x = rnorm(20))
> newdf <- transform(newdf, y = 5 + (2.3 * x) + rnorm(20))
> ## add on to df
> df <- rbind(df, newdf)
>
> ## update model fit
> m2 <- update(m1, . ~ ., data = df)
> m2
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
4.928 2.187
Other have mentioned in comments formula()
, which extracts the formula from a fitted model:
> formula(m1)
y ~ x
> ## which can be used to set-up a new model call
> ## so an alternative to update() above is:
> m3 <- lm(formula(m1), data = df)
However, if the model fitting involves additional arguments, like 'family'
, or 'subset'
arguments in more complex model fitting functions. If update()
methods are available for your model fitting function (which they are for many common fitting functions, like glm()
), it provides a simpler way to update a model fit than extracting and reusing the model formula.
If you intend to do all the modelling and future prediction in R, there doesn't really seem much point in abstracting the model out via PMML or similar.
Import text file as single character string
Here's a variant of the solution from @JoshuaUlrich that uses the correct size instead of a hard-coded size:
fileName <- 'foo.txt'
readChar(fileName, file.info(fileName)$size)
Note that readChar allocates space for the number of bytes you specify, so readChar(fileName, .Machine$integer.max)
does not work well...
convert string back into object in r
I'm assuming you used base::dput()
according to the following example (based on this answer):
# Generate some model over some data
data <- sample(1:100, 30)
df <- data.frame(x = data, y = 2 * data + 20)
model <- lm(y ~ x, df)
# Assuming this is what you did you have the model structure inside model.R
dput(model, control = c("quoteExpressions", "showAttributes"), file = "model.R")
# ----- This is where you are, I presume -----
# So you can copy the content of model.R here (attention to the single quotes)
mstr <- '...'
# Execute the content of mstr as a piece of code (loading the model)
model1 <- eval(parse(text = mstr))
# Parse the formulas
model1$terms <- terms.formula(model1$terms)
# ----- Test it -----
# New data
df1 <- data.frame(x = 101:110)
pred <- as.integer(predict(model, df1))
pred1 <- as.integer(predict(model1, df1))
identical(pred, pred1)
# [1] TRUE
model
#
# Call:
# lm(formula = y ~ x, data = df)
#
# Coefficients:
# (Intercept) x
# 20 2
#
model1
#
# Call:
# lm(formula = y ~ x, data = df)
#
# Coefficients:
# (Intercept) x
# 20 2
# Check summary too (you'll see some minor differences)
# summary(model)
# summary(model1)
R: How to or should I drop an insignificant orthogonal polynomial basis in a linear model?
No, don't check with summary
in this case. You should use anova
. A polynomial term from poly()
, or a spline term from bs()
contains more than coefficients, so they are more like a factor variable with multiple levels.
> anova(polymod)
Analysis of Variance Table
Response: soil_m_sat
Df Sum Sq Mean Sq F value Pr(>F)
poly(x + y, degree = 2) 2 113484 56742 1600.8 < 2.2e-16 ***
poly(z, degree = 3) 3 68538 22846 644.5 < 2.2e-16 ***
Residuals 290 10280 35
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The ANOVA table clearly shows that you need all model terms. Do not drop any.
But I still need to answer your question and make you feel happy.
It is not impossible to drop the poly(x + y, degree = 2)1
term, but you need to access model matrix for such purpose. You may do
gue$XY_poly <- with(gue, poly(x + y, degree = 2))[, 2] ## use the 2nd column only
fit <- lm(soil_m_sat ~ XY_poly + poly(z, degree = 3), data = gue)
summary(fit)
## ...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.3071 0.3459 151.217 < 2e-16 ***
XY_poly -18.8515 7.3894 -2.551 0.0112 *
poly(z, degree = 3)1 -418.1634 6.4937 -64.395 < 2e-16 ***
poly(z, degree = 3)2 116.5327 6.9171 16.847 < 2e-16 ***
poly(z, degree = 3)3 -28.7773 5.9517 -4.835 2.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.951 on 291 degrees of freedom
Multiple R-squared: 0.9464, Adjusted R-squared: 0.9457
F-statistic: 1285 on 4 and 291 DF, p-value: < 2.2e-16
Is it possible to use a database link between an oracle database and a postgresql database on different physical servers
to connect from Oracle to postgresql, http://download.oracle.com/docs/cd/B19306_01/server.102/b14232/toc.htm, this forum can also help: http://forums.oracle.com/forums/forum.jspa?forumID=63
Related Topics
Apply a Summarise Condition to a Range of Columns When Using Dplyr Group_By
Specifying the Colour Scale for Maps in Ggplot
How Does R Handle Object in Function Call
R - Data Frame - Convert to Sparse Matrix
Unexpected Behaviour with Argument Defaults
How to Simultaneously Apply Color/Shape/Size in a Scatter Plot Using Plotly
Search for Corresponding Node in a Regression Tree Using Rpart
Error When Exporting Dataframe to Text File in R
Setting Working Directory: Julia Versus R
Note or Warning from Package Check When Readme.Md Includes Images
Update an Entire Row in Data.Table in R
Ggplot2: Cannot Color Area Between Intersecting Lines Using Geom_Ribbon
How to Convert Camelcase to Not.Camel.Case in R