How to Correctly 'Dput' a Fitted Linear Model (By 'Lm') to an Ascii File and Recreate It Later

How to correctly `dput` a fitted linear model (by `lm`) to an ASCII file and recreate it later?

Step 1:

You need to control de-parsing options:

dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R")

You can read more on all possible options in ?.deparseOpts.

The "quoteExpressions" wraps all calls / expressions / languages with quote, so that they are not evaluated when you later re-parse it. Note:

source is doing parsing;

call field in your fitted "lm" object is a call:

fit$call
# lm(formula = z ~ x, data = dat_train)

So, without "quoteExpressions", R will try to evaluate lm call during parsing. And if we evaluate it, it is fitting a linear model, and R will aim to find dat_train, which will not exist in your new R session.

The "showAttributes" is another mandatory option, as "lm" object has class attributes. You certainly don't want to discard all class attributes and only export a plain "list" object, right? Moreover, many elements in a "lm" object, like model (the model frame), qr (the compact QR matrix) and terms (terms info), etc all have attributes. You want to keep them all.

If you don't set control, the default setting with:

control = c("keepNA", "keepInteger", "showAttributes")

will be used. As you can see, there is no "quoteExpressions", so you will get into trouble.

You can also specify "keepInteger" and "keepNA", but I don't see the need for "lm" object.

------

Step 2:

The above step will get source working correctly. You can recover your model:

fit1 <- source("model.R")$value

However, it is not yet ready for generic functions like summary and predict to work. Why?

The critical issue is the terms object in fit1 is not really a "terms" object, but only a formula (it is even not a formula, but only a "language" object without "formula" class!). Just compare fit$terms and fit1$terms, and you will see the difference. Don't be surprised; we've set "quoteExpressions" earlier. While that is definitely helpful to prevent evaluation of call, it has side-effect for terms. So we need to reconstruct terms as best as we can.

Fortunately, it is sufficient to do:

fit1$terms <- terms.formula(fit1$terms)

Though this still does not recover all information in fit$terms (like variable classes are missing), it is readily a valid "terms" object.

Why is a "terms" object critical? Because all generic functions rely on it. You may not need to know more on this, as it is really technical, so I will stop here.

Once this is done, we can successfully use predict (and summary, too):

predict(fit1)  ## no `newdata` given, using model frame `fit1$model`
#   1    2    3    4 
#1.03 2.01 2.99 3.97 

predict(fit1, dat_score)  ## with `newdata`
#   1    2 
#1.52 3.48

-------

Conclusion remark:

Although I have shown you how to get things work, I don't really recommend you doing this in general. An "lm" object will be pretty large when you fit a model to a large dataset, for example, residuals, fitted.values are long vectors, and qr and model are huge matrices / data frames. So think about this.

R, Is there a function for deparsing and (re)parsing any regression output for lm/glm/semgented models?

As @AEF mentions in their comment, you might be looking for serialization
with saveRDS(). If you specifically need a text format rather than a
binary one, you can use ascii = TRUE.

fit0 <- lm(mpg ~ wt, data = mtcars)
saveRDS(fit0, "model.rds", ascii = TRUE)
fit1 <- readRDS("model.rds")

predict(fit0, data.frame(wt = 3))
#>        1 
#> 21.25171
predict(fit1, data.frame(wt = 3))
#>        1 
#> 21.25171

Reusing a Model Built in R

Reusing a model to predict for new observations

If the model is not computationally costly, I tend to document the entire model building process in an R script that I rerun when needed. If a random element is involved in the model fitting, I make sure to set a known random seed.

If the model is computationally costly to compute, then I still use a script as above, but save out the model objects using save() into and rda object. I then tend to modify the script such that if the saved object exists, load it, or if not, refit the model, using a simple if()...else clause wrapped around the relevant parts of the code.

When loading your saved model object, be sure to reload any required packages, although in your case if the logit model were fit via glm() there will not be any additional packages to load beyond R.

Here is an example:

> set.seed(345)
> df <- data.frame(x = rnorm(20))
> df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
> ## model
> m1 <- lm(y ~ x, data = df)
> ## save this model
> save(m1, file = "my_model1.rda")
> 
> ## a month later, new observations are available: 
> newdf <- data.frame(x = rnorm(20))
> ## load the model
> load("my_model1.rda")
> ## predict for the new `x`s in `newdf`
> predict(m1, newdata = newdf)
        1         2         3         4         5         6 
6.1370366 6.5631503 2.9808845 5.2464261 4.6651015 3.4475255 
        7         8         9        10        11        12 
6.7961764 5.3592901 3.3691800 9.2506653 4.7562096 3.9067537 
       13        14        15        16        17        18 
2.0423691 2.4764664 3.7308918 6.9999064 2.0081902 0.3256407 
       19        20 
5.4247548 2.6906722

If wanting to automate this, then I would probably do the following in a script:

## data
df <- data.frame(x = rnorm(20))
df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))

## check if model exists? If not, refit:
if(file.exists("my_model1.rda")) {
    ## load model
    load("my_model1.rda")
} else {
    ## (re)fit the model
    m1 <- lm(y ~ x, data = df)
}

## predict for new observations
## new observations
newdf <- data.frame(x = rnorm(20))
## predict
predict(m1, newdata = newdf)

Of course, the data generation code would be replaced by code loading your actual data.

Updating a previously fitted model with new observations

If you want to refit the model using additional new observations. Then update() is a useful function. All it does is refit the model with one or more of the model arguments updated. If you want to include new observations in the data used to fit the model, add the new observations to the data frame passed to argument 'data', and then do the following:

m2 <- update(m1, . ~ ., data = df)

where m1 is the original, saved model fit, . ~ . is the model formula changes, which in this case means include all existing variables on both the left and right hand sides of ~ (in other words, make no changes to the model formula), and df is the data frame used to fit the original model, expanded to include the newly available observations.

Here is a working example:

> set.seed(123)
> df <- data.frame(x = rnorm(20))
> df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
> ## model
> m1 <- lm(y ~ x, data = df)
> m1

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)            x  
      4.960        2.222  

> 
> ## new observations
> newdf <- data.frame(x = rnorm(20))
> newdf <- transform(newdf, y = 5 + (2.3 * x) + rnorm(20))
> ## add on to df
> df <- rbind(df, newdf)
> 
> ## update model fit
> m2 <- update(m1, . ~ ., data = df)
> m2

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)            x  
      4.928        2.187

Other have mentioned in comments formula(), which extracts the formula from a fitted model:

> formula(m1)
y ~ x
> ## which can be used to set-up a new model call
> ## so an alternative to update() above is:
> m3 <- lm(formula(m1), data = df)

However, if the model fitting involves additional arguments, like 'family', or 'subset' arguments in more complex model fitting functions. If update() methods are available for your model fitting function (which they are for many common fitting functions, like glm()), it provides a simpler way to update a model fit than extracting and reusing the model formula.

If you intend to do all the modelling and future prediction in R, there doesn't really seem much point in abstracting the model out via PMML or similar.

Import text file as single character string

Here's a variant of the solution from @JoshuaUlrich that uses the correct size instead of a hard-coded size:

fileName <- 'foo.txt'
readChar(fileName, file.info(fileName)$size)

Note that readChar allocates space for the number of bytes you specify, so readChar(fileName, .Machine$integer.max) does not work well...

convert string back into object in r

I'm assuming you used base::dput() according to the following example (based on this answer):

# Generate some model over some data
data <- sample(1:100, 30)
df <- data.frame(x = data, y = 2 * data + 20)
model <- lm(y ~ x, df)

# Assuming this is what you did you have the model structure inside model.R
dput(model, control = c("quoteExpressions", "showAttributes"), file = "model.R")

# ----- This is where you are, I presume -----
# So you can copy the content of model.R here (attention to the single quotes)
mstr <- '...'

# Execute the content of mstr as a piece of code (loading the model)
model1 <- eval(parse(text = mstr))

# Parse the formulas
model1$terms <- terms.formula(model1$terms)

# ----- Test it -----
# New data
df1 <- data.frame(x = 101:110)

pred <- as.integer(predict(model, df1))
pred1 <- as.integer(predict(model1, df1))

identical(pred, pred1)
# [1] TRUE

model
# 
# Call:
# lm(formula = y ~ x, data = df)
# 
# Coefficients:
# (Intercept)            x  
#          20            2  
# 

model1
# 
# Call:
# lm(formula = y ~ x, data = df)
# 
# Coefficients:
# (Intercept)            x  
#          20            2  

# Check summary too (you'll see some minor differences)
# summary(model)
# summary(model1)

R: How to or should I drop an insignificant orthogonal polynomial basis in a linear model?

No, don't check with summary in this case. You should use anova. A polynomial term from poly(), or a spline term from bs() contains more than coefficients, so they are more like a factor variable with multiple levels.

> anova(polymod)
Analysis of Variance Table

Response: soil_m_sat
                         Df Sum Sq Mean Sq F value    Pr(>F)    
poly(x + y, degree = 2)   2 113484   56742  1600.8 < 2.2e-16 ***
poly(z, degree = 3)       3  68538   22846   644.5 < 2.2e-16 ***
Residuals               290  10280      35                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The ANOVA table clearly shows that you need all model terms. Do not drop any.

But I still need to answer your question and make you feel happy.

It is not impossible to drop the poly(x + y, degree = 2)1 term, but you need to access model matrix for such purpose. You may do

gue$XY_poly <- with(gue, poly(x + y, degree = 2))[, 2]  ## use the 2nd column only

fit <- lm(soil_m_sat ~ XY_poly + poly(z, degree = 3), data = gue)
summary(fit)

## ...
Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            52.3071     0.3459 151.217  < 2e-16 ***
XY_poly               -18.8515     7.3894  -2.551   0.0112 *  
poly(z, degree = 3)1 -418.1634     6.4937 -64.395  < 2e-16 ***
poly(z, degree = 3)2  116.5327     6.9171  16.847  < 2e-16 ***
poly(z, degree = 3)3  -28.7773     5.9517  -4.835 2.16e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.951 on 291 degrees of freedom
Multiple R-squared:  0.9464,    Adjusted R-squared:  0.9457 
F-statistic:  1285 on 4 and 291 DF,  p-value: < 2.2e-16

Is it possible to use a database link between an oracle database and a postgresql database on different physical servers

to connect from Oracle to postgresql, http://download.oracle.com/docs/cd/B19306_01/server.102/b14232/toc.htm, this forum can also help: http://forums.oracle.com/forums/forum.jspa?forumID=63

How to Correctly 'Dput' a Fitted Linear Model (By 'Lm') to an Ascii File and Recreate It Later