How to 'Compress' an Lm() Object for Later Prediction

Is there a way to 'compress' an lm() object for later prediction?

You can use biglm to fit your models, a biglm model object is smaller than a lm model object. You can use predict.biglm create a function that you can pass the newdata design matrix to, which returns the predicted values.

Another option is to use saveRDS to save the files, which appear to be slightly smaller, as they have less overhead, being a single object, not like save which can save multiple objects.

 library(biglm)
 m <- lm(log(Volume)~log(Girth)+log(Height), trees)
 mm <- lm(log(Volume)~log(Girth)+log(Height), trees, model = FALSE, x =FALSE, y = FALSE)
 bm <- biglm(log(Volume)~log(Girth)+log(Height), trees)
 pred <- predict(bm, make.function = TRUE)
 save(m, file = 'm.rdata')
 save(mm, file = 'mm.rdata')
 save(bm, file = 'bm.rdata')
 save(pred, file = 'pred.rdata')
 saveRDS(m, file = 'm.rds')
 saveRDS(mm, file = 'mm.rds')
 saveRDS(bm, file = 'bm.rds')
 saveRDS(pred, file = 'pred.rds')

 file.info(paste(rep(c('m','mm','bm','pred'),each=2) ,c('.rdata','.rds'),sep=''))
#             size isdir mode mtime               ctime               atime               exe
#  m.rdata    2806 FALSE  666 2013-03-07 11:29:30 2013-03-07 11:24:23 2013-03-07 11:29:30  no
#  m.rds      2798 FALSE  666 2013-03-07 11:29:30 2013-03-07 11:29:30 2013-03-07 11:29:30  no
#  mm.rdata   2113 FALSE  666 2013-03-07 11:29:30 2013-03-07 11:24:28 2013-03-07 11:29:30  no
#  mm.rds     2102 FALSE  666 2013-03-07 11:29:30 2013-03-07 11:29:30 2013-03-07 11:29:30  no
#  bm.rdata    592 FALSE  666 2013-03-07 11:29:30 2013-03-07 11:24:34 2013-03-07 11:29:30  no
#  bm.rds      583 FALSE  666 2013-03-07 11:29:30 2013-03-07 11:29:30 2013-03-07 11:29:30  no
#  pred.rdata 1007 FALSE  666 2013-03-07 11:29:30 2013-03-07 11:24:40 2013-03-07 11:29:30  no
#  pred.rds    995 FALSE  666 2013-03-07 11:29:30 2013-03-07 11:27:30 2013-03-07 11:29:30  no

How can I reduce the size of a linear model saved by a Shiny app?

Check out this great post for some methods/info on reducing the size of the fat on glm/lm objects.

I use this method, which I took from the above.

how to save a fitted R model for later use

If you include the argument model = FALSE (it's true by default) when fitting the model, the model frame that was used will be excluded from the resulting object. You can get an estimate of the memory that is being used to store the model object giving:

object.size(my_model)

R attribute .Environment consuming large amounts of RAM in nnet package

tl;dr: this is OK, except for some very special cases

Background

The .Environment attribute in R contains a reference to the context in which an R closure (usually a formula or a function) was defined. An R environment is a store holding values of variables, similarly to a list. This allows the formula to refer to these variables, for example:

> f = function(g) return(y ~ g(x))
> form = f(exp)
> lm(form, list(y=1:10, x=log(1:10)))
...
Coefficients:
(Intercept)     g(x)
3.37e-15        1.00e+00

In this example, the formula form if defined as y~exp(x), by giving g the value of exp. In order to be able to find the value of g (which is an argument to function f), the formula needs to hold a reference to the environment constructed inside the call to function f.

You can see the enviroment attached to a formula by using the attributes() or environment() functions as follows:

> attributes(form)
$class
[1] "formula"

$.Environment
<environment: R_GlobalEnv>

> environment(form)
<environment: R_GlobalEnv>

Your question

I believe you are using the nnet() function variant with a formula (rather than matrices), i.e.

> nnet(y ~ x1 + x2, ...)

Unfortunately, R keeps the entire environment (including all the variables defined where your formula is defined) allocated, even if your formula does not refer to any of it. There is no way to the language to easily tell what you may or may not be using from the environment.

One solution is to explicitly retain only the required parts of the environment. In particular, if your formula does not refer to anything in the environment (which is the most common case), it is safe to remove it.

I would suggest removing the environment from your formula before you call nnet, something like this:

    form = y~x + z
    environment(form) = NULL
    ...
    result = nnet(form, ...)

R Function to slice a vector/matrix in a rolling manner

I just discovered the base R function embed and now it is one of my favorite things:

> numcol <- 3
> embed(1:10, numcol)
     [,1] [,2] [,3]
[1,]    3    2    1
[2,]    4    3    2
[3,]    5    4    3
[4,]    6    5    4
[5,]    7    6    5
[6,]    8    7    6
[7,]    9    8    7
[8,]   10    9    8

It basically does exactly what you describe by making a matrix of rolling windows of your data, with the second input being the window size. If order matters you can reverse the columns using:

embed(1:10, numcol)[ , numcol:1]

Save classifier to disk in scikit-learn

Classifiers are just objects that can be pickled and dumped like any other. To continue your example:

import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
    cPickle.dump(gnb, fid)    

# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
    gnb_loaded = cPickle.load(fid)

Edit: if you are using a sklearn Pipeline in which you have custom transformers that cannot be serialized by pickle (nor by joblib), then using Neuraxle's custom ML Pipeline saving is a solution where you can define your own custom step savers on a per-step basis. The savers are called for each step if defined upon saving, and otherwise joblib is used as default for steps without a saver.